
🤖 Ghostwritten by Claude Opus 4.8 · Fact-checked & edited by GPT 5.5
For an AI agent fleet, the task queue is the operational backbone. Models decide what to generate, orchestrators decide what should happen next, but the queue determines whether work moves safely, predictably, and recoverably across the fleet. Get the queue wrong, and capable agents become idle workers, duplicated effort, or unrecoverable partial state.
A production-ready queue for agent work should do four things well: hold atomic tasks, let workers pull work when they are ready, preserve durable state through failure, and enforce security boundaries between the planner and the executor. The queue should not be the brain of the system. It should be the boring, inspectable contract between an orchestrator that understands the plan and workers that execute one bounded step at a time.
That design is especially important for bare-metal and edge deployments, where centralized task queues with priority ordering, pull-based worker assignment, and queue-depth-driven horizontal scaling remain the practical production pattern. It also explains why a queue-first architecture can be a better fit than pushing every worker into a full graph orchestration runtime.
TL;DR: A queued task should be a single, idempotent unit of work with the minimum context required to execute one step — never a whole plan.
The temptation with agent fleets is to enqueue goals: build the feature, fix the bug, ship the PR. That is a trap. A goal is a plan, and plans belong in the orchestrator. What goes on the queue is the smallest atomic step a worker can execute and report on: generate a file, run a test suite, apply a diff, summarize a failure.
Each task payload should carry three things and nothing more:
A worker receives the context for its step only. It should not see the full project plan, sibling tasks, or the orchestrator’s reasoning trace. This is least-privilege context applied to agents the same way it is applied to service accounts.
Tasks also need to be idempotent. If a worker dies mid-execution and the task is reclaimed by another node, re-running it must not corrupt state. That single constraint shapes everything downstream, particularly how the queue recovers from a node going dark.
TL;DR: Pull-based assignment lets queue depth drive horizontal scaling and makes a dead worker a recoverable lease event instead of a lost task.
The orchestrator should not assign tasks to specific workers. It should write tasks to the queue and let idle workers claim them. This pull-based model works well for bare-metal and edge fleets because queue depth becomes the scaling signal. If the queue is backing up, bring more workers online. If it is empty, workers can sit idle without forcing the orchestrator to track every node’s moment-by-moment state.
Push-based assignment couples the orchestrator to individual worker health. When a worker goes dark mid-task, the producer has to detect the failure, reassign the work, and reconcile state. With pull-based assignment, an expired task lease simply makes the work available again. A worker that crashes does not permanently hold work it cannot complete.
Here is an illustrative claim pattern using a Postgres-backed queue:
-- Worker claims the highest-priority unclaimed task atomically
UPDATE tasks
SET status = 'claimed',
claimed_by = 'worker-node-id',
claimed_at = now(),
lease_expires_at = now() + interval '5 minutes'
WHERE id = (
SELECT id FROM tasks
WHERE status = 'queued'
AND run_id = $1
ORDER BY priority DESC, created_at ASC
FOR UPDATE SKIP LOCKED
LIMIT 1
)
RETURNING id, instruction, scoped_credential_ref;The FOR UPDATE SKIP LOCKED clause is the practical trick. Multiple workers can run this claim query at the same time, and each worker skips rows already locked by another transaction. The database serializes the claim without requiring a separate coordinator.
The lease is how a dark node is handled. A reaper process resets any claimed task whose lease has expired back to queued. Because tasks are idempotent, another worker can safely reclaim the step.
Enqueueing from the orchestrator can stay plain:
def enqueue(run_id, instruction, priority, scoped_credential_ref):
return db.table('tasks').insert({
'run_id': run_id,
'instruction': instruction, # one step only
'priority': priority, # example scale: 0=background ... 100=urgent
'scoped_credential_ref': scoped_credential_ref, # e.g. op://{vault}/{item}/{field}
'status': 'queued',
}).execute()Priority should be assigned by the orchestrator based on the dependency graph it holds internally. Release-path tasks outrank exploratory ones. The queue itself can stay intentionally dumb: it stores durable state, preserves ordering signals, and exposes claim semantics. The intelligence about planning remains in the producer, which keeps the consumer side simple and auditable.
TL;DR: Microsoft Agent Framework 1.0 models the workflow as a checkpointed graph; a queue-first design keeps the graph in the orchestrator and workers stateless.
Microsoft Agent Framework 1.0 introduces graph-based durable orchestration as its core abstraction for multi-agent fleets. It supports checkpointing, pause/resume, and sequential, concurrent, and handoff patterns. That is a strong model when the fleet lives inside a managed runtime and the framework is responsible for execution state.
A queue-first design makes a different trade-off. The durable truth lives in one inspectable place — a task table — while the graph remains inside the orchestrator. Workers do not need to understand graph topology. They claim one task, execute one step, report one result, and release or renew their lease.
| Concern | Graph durable orchestration | Queue-first design |
|---|---|---|
| State ownership | Framework runtime and checkpoints | Central task table |
| Worker coupling | Aware of graph topology | Stateless, claims one task |
| Failure recovery | Resume from checkpoint | Lease expiry and idempotent reclaim |
| Debuggability | Trace the graph engine | Query task state directly |
| Best fit | Managed runtimes | Bare-metal and edge nodes |
The graph still exists. In the Optimus Prime pattern, the dev orchestrator holds the plan, decomposes it into atomic tasks, and feeds those tasks to the queue as dependencies clear. The workers never need to know the shape of the graph. That separation makes the worker bench disposable, horizontally scalable, and easier to reason about under failure.
TL;DR: Scope every task payload so a compromised worker can affect only one step’s worth of blast radius.
A shared queue is a shared attack surface. At some point, a worker may execute hostile output: a prompt injection in scraped content, a poisoned dependency, or a model-produced command that should never have run. The security question is not whether this can happen. It is how much damage it can do when it does.
Three boundaries contain that cost:
The producer-consumer trust boundary is intentionally asymmetric. The orchestrator is trusted to compose tasks. The worker is treated as potentially hostile from the moment it claims one.
For a small bare-metal fleet, a Postgres table with FOR UPDATE SKIP LOCKED can provide atomic claiming, durable state, and straightforward debugging without operating a separate broker. The truth lives in one queryable place, which can matter more than raw throughput at modest scale. A dedicated broker becomes worth it when message volume, fan-out, routing complexity, or retention needs outgrow what a single database can comfortably serve.
The atomic UPDATE ... WHERE id = (SELECT ... FOR UPDATE SKIP LOCKED LIMIT 1) pattern lets each worker lock and claim one row while skipping rows other workers have already locked. There is no application-level coordinator. The database handles the contention and returns a distinct task to each successful claimant.
Each claimed task carries a lease with an expiry. A reaper process resets any expired-but-incomplete task back to queued, where another worker can reclaim it. Because tasks are designed to be idempotent, re-executing a partially completed step should not corrupt state.
Graph-based durable orchestration is useful when a managed framework should own execution state. A queue-first design is better when durable state needs to remain centralized, inspectable, and independent of worker runtime. The graph still exists — it lives in the orchestrator, not in the queue or the workers.
Task payloads carry only one step’s instruction plus a reference to a narrow, short-lived credential resolved at execution time. Workers do not receive the full plan, sibling tasks, or standing secrets, which limits a compromised worker’s blast radius to a single bounded step.
FOR UPDATE SKIP LOCKED enables coordination-free atomic claiming on a Postgres-backed queue.TL;DR: The queue succeeds when it stays simple, durable, and easy to inspect under pressure.
The interesting failures will not show up in the happy path. They will show up when an orchestrator decomposes a real coding objective into dozens of atomic tasks and the queue depth spikes faster than the worker bench can drain it. That backpressure behavior — especially whether urgent release-path work can jump a queue full of exploratory tasks — is the first thing to watch.
The second is lease tuning. Set the lease too short and long-running test suites get reaped and rerun wastefully. Set it too long and a dead node’s work stalls. There is no correct value in the abstract; the right lease depends on actual task duration, retry cost, and failure patterns.
The broader bet is that simplicity at the queue layer buys reliability everywhere above it. As autonomous coding tasks move through worker fleets, the measure of success will not be how clever the queue is. It will be how boring it remains under load.
Discover more content: