
🤖 Ghostwritten by Claude Opus 4.6 · Fact-checked & edited by GPT 5.4 · Curated by Tom Hundley
MCP Tasks address a common failure mode in agent workflows: blocking on long-running operations. Instead of holding a synchronous request open while an ETL job runs or a document converts, a task-based tool returns a handle immediately. The agent can continue working, then poll for status or subscribe to updates until the task completes, fails, or requests more input.
That is the call-now, fetch-later pattern. For production systems, it is often the difference between a resilient workflow and one that times out, retries unnecessarily, or loses visibility mid-execution. If your agents still rely on synchronous MCP tool calls for operations that can run for tens of seconds or longer, Tasks are the natural upgrade path. The core ideas are simple: return a handle immediately, track state explicitly, and retrieve results asynchronously through polling or subscriptions.
This guide covers the full implementation path: task creation, state management, polling vs. subscription tradeoffs, error handling, and production patterns we use at Elegant Software Solutions. It also builds on related ESS guidance such as agent-to-agent communication patterns and production MCP patterns.
TL;DR: Synchronous tool calls become fragile when work runs longer than normal request windows, and many real agent workflows include at least one slow step.
Consider a typical workflow: an agent receives a request to ingest a CSV dataset, transform it, and generate a summary report. The ETL step alone may take minutes, depending on data volume and downstream systems. In a synchronous model, the MCP client blocks on that tool call. The connection stays open. Timeouts become more likely. Retry logic can duplicate work if the client cannot tell whether the original request is still running.
This is a transport and systems problem more than an LLM problem. Standard request-response patterns are a poor fit for operations that may run well beyond ordinary HTTP timeouts or user-interface patience thresholds.
Synchronous blocking creates three compounding problems:
| Problem | Synchronous Impact | Tasks Solution |
|---|---|---|
| Client timeout | Connection drops, result may be lost to the caller | Handle persists independently of the original request |
| Resource lock | Connection or worker stays occupied while idle | Client is freed immediately |
| Retry storms | Retries may start duplicate work | Idempotency keys can map retries to the same task |
| Observability gap | Little visibility during execution | State transitions can be surfaced as task updates |
If you've already implemented production MCP patterns with custom retry handling and monitoring, Tasks provide a cleaner protocol-level way to solve the same class of problems.
TL;DR: MCP Tasks work best when every task moves through a small, explicit state machine that clients can handle deterministically.
A practical task model is intentionally simple. A task typically includes an identifier, a current status, optional progress metadata, and either a result, an error, or an input request depending on its state.
interface MCPTask {
id: string; // Unique task identifier
status: TaskStatus; // Current state
progress?: number; // 0-100, optional
progressMessage?: string; // Human-readable status
result?: MCPToolResult; // Available when completed
error?: MCPError; // Available when failed
inputRequest?: InputRequest; // Available when input_required
createdAt: string; // ISO 8601
updatedAt: string; // ISO 8601
}
type TaskStatus =
| 'working'
| 'input_required'
| 'completed'
| 'failed'
| 'cancelled';The five states are enough for most production workflows:
working: the task is actively runninginput_required: the task is paused until additional input arrivescompleted: the task finished successfullyfailed: the task ended with an errorcancelled: the task was intentionally stoppedThe input_required state is what makes Tasks more than a thin wrapper around a background job. When a long-running operation reaches a decision point, it can pause explicitly instead of guessing or failing. That decision might involve ambiguous schema mapping, permission confirmation, or parameter clarification.
The agent can then gather the missing input from a user, another agent, or a policy engine and resume the task. In other words, human-in-the-loop and agent-in-the-loop behavior become part of the task contract rather than an afterthought.
TL;DR: A task-based tool should return a handle immediately, then expose status and result retrieval through follow-up task operations.
The exact API surface depends on the MCP SDK and version you use, so treat the following as illustrative pseudocode rather than a drop-in implementation. The important pattern is immediate task creation plus background execution.
import asyncio
import uuid
from dataclasses import dataclass
from datetime import datetime, timezone
@dataclass
class TaskRecord:
id: str
status: str
progress: int = 0
progress_message: str | None = None
result: dict | None = None
error: dict | None = None
input_request: dict | None = None
created_at: str = datetime.now(timezone.utc).isoformat()
updated_at: str = datetime.now(timezone.utc).isoformat()
task_store: dict[str, TaskRecord] = {}
async def run_etl_pipeline(source_url: str, target_table: str) -> TaskRecord:
"""Starts an ETL pipeline and returns a task handle immediately."""
task_id = str(uuid.uuid4())
task = TaskRecord(
id=task_id,
status="working",
progress=0,
progress_message="Initializing pipeline"
)
task_store[task_id] = task
asyncio.create_task(_execute_pipeline(task_id, source_url, target_table))
return task
async def _execute_pipeline(task_id: str, source_url: str, target_table: str) -> None:
task = task_store[task_id]
try:
task.progress = 20
task.progress_message = "Extracting data from source"
task.updated_at = datetime.now(timezone.utc).isoformat()
data = await extract_from_source(source_url)
schema_issues = validate_schema(data, target_table)
if schema_issues:
task.status = "input_required"
task.input_request = {
"type": "schema_mapping",
"message": f"Found {len(schema_issues)} unmapped columns",
"fields": schema_issues
}
task.updated_at = datetime.now(timezone.utc).isoformat()
return # Resume later when input is submitted
task.progress = 60
task.progress_message = "Transforming and loading"
task.updated_at = datetime.now(timezone.utc).isoformat()
result = await transform_and_load(data, target_table)
task.status = "completed"
task.progress = 100
task.result = {"rows_loaded": result.count, "table": target_table}
task.updated_at = datetime.now(timezone.utc).isoformat()
except Exception as e:
task.status = "failed"
task.error = {"code": "PIPELINE_ERROR", "message": str(e)}
task.updated_at = datetime.now(timezone.utc).isoformat()Two implementation notes matter here:
input_required usually requires a separate submit-input path that requeues or restarts the background work.On the client side, the pattern is the same regardless of language: call the tool, store the task ID, then poll or subscribe until the task reaches a terminal state.
async function executeETLWithTask(client: MCPClient) {
const task = await client.callTool('run_etl_pipeline', {
source_url: 'https://data-source.example.com/export.csv',
target_table: 'quarterly_sales'
});
console.log(`Task ${task.id} started — status: ${task.status}`);
const result = await pollUntilTerminal(client, task.id, {
intervalMs: 2000,
maxAttempts: 180,
onProgress: (t) => {
console.log(`[${t.progress}%] ${t.progressMessage}`);
},
onInputRequired: async (t) => {
const mapping = await resolveSchemaMapping(t.inputRequest);
await client.submitTaskInput(task.id, mapping);
}
});
if (result.status === 'completed') {
console.log('ETL complete:', result.result);
} else if (result.status === 'failed') {
console.error('ETL failed:', result.error);
}
}TL;DR: Polling is simpler and broadly compatible; subscriptions are better when your transport supports push updates and you need lower-latency progress reporting.
| Factor | Polling | Subscription |
|---|---|---|
| Transport requirement | Works anywhere the client can re-query task state | Requires a transport and server implementation that support server push |
| Implementation complexity | Low | Medium |
| Latency to state change | Up to the polling interval | Usually lower than polling |
| Server resource usage | More repeated requests | More long-lived connections or streams |
| Network chattiness | Higher | Lower for long tasks |
| Best for | Simpler deployments, broad compatibility | Long tasks, richer real-time UX |
async function pollUntilTerminal(
client: MCPClient,
taskId: string,
options: PollOptions
): Promise<MCPTask> {
const { intervalMs, maxAttempts, onProgress, onInputRequired } = options;
for (let attempt = 0; attempt < maxAttempts; attempt++) {
const task = await client.getTask(taskId);
switch (task.status) {
case 'working':
onProgress?.(task);
await sleep(intervalMs);
break;
case 'input_required':
await onInputRequired?.(task);
attempt = 0;
await sleep(intervalMs);
break;
case 'completed':
case 'failed':
case 'cancelled':
return task;
}
}
await client.cancelTask(taskId);
throw new Error(`Task ${taskId} exceeded max poll attempts`);
}Polling is often the right default, but avoid overly aggressive intervals. A one- or two-second cadence is usually enough for user-facing progress, and longer intervals may be appropriate for back-office jobs.
If your MCP transport and server support push-style task updates, subscriptions can reduce latency and request volume. The exact mechanism varies by SDK and transport, so this example is also illustrative:
async function subscribeToTask(
client: MCPClient,
taskId: string
): Promise<MCPTask> {
return new Promise((resolve, reject) => {
const subscription = client.subscribeToTask(taskId, {
onStateChange: (task) => {
console.log(`Task ${taskId}: ${task.status} (${task.progress}%)`);
if (task.status === 'completed') {
subscription.unsubscribe();
resolve(task);
} else if (task.status === 'failed') {
subscription.unsubscribe();
reject(new Error(task.error?.message));
}
},
onError: (err) => {
subscription.unsubscribe();
reject(err);
}
});
setTimeout(() => {
subscription.unsubscribe();
client.cancelTask(taskId);
reject(new Error('Task subscription timeout'));
}, 600_000);
});
}If you're running MCP servers with OAuth 2.1 token lifecycle patterns, make sure token refresh logic accounts for long-lived task monitoring. A task can easily outlast a short access-token lifetime.
TL;DR: Reliable MCP Tasks depend on idempotency, durable state, timeout handling, and explicit treatment of paused tasks.
The protocol gives you the primitive. Production resilience comes from the surrounding system design.
Prevent duplicate task creation when clients retry after a network interruption:
async def run_etl_pipeline(
source_url: str,
target_table: str,
idempotency_key: str | None = None
) -> TaskRecord:
if idempotency_key and idempotency_key in idempotency_index:
return task_store[idempotency_index[idempotency_key]]
task = create_new_task()
if idempotency_key:
idempotency_index[idempotency_key] = task.id
return taskThis is especially important when the client cannot tell whether the original request failed before task creation or after it.
Tasks that remain in working for too long may be orphaned by worker crashes or lost callbacks. A periodic sweeper can mark them failed and surface the issue clearly.
from datetime import datetime, timezone
async def sweep_stale_tasks(max_age_seconds: int = 3600):
"""Transition orphaned tasks to failed state."""
now = datetime.now(timezone.utc)
for task_id, task in task_store.items():
if task.status == "working":
updated = datetime.fromisoformat(task.updated_at)
age = (now - updated).total_seconds()
if age > max_age_seconds:
task.status = "failed"
task.error = {
"code": "TASK_TIMEOUT",
"message": f"Task stale for {age:.0f}s"
}
task.updated_at = now.isoformat()
logger.warning(f"Swept stale task {task_id}")Polling every 100 milliseconds is rarely justified and can overload the server. Use a capped backoff strategy instead:
function backoffInterval(attempt: number, baseMs = 1000, maxMs = 30000): number {
return Math.min(baseMs * Math.pow(1.5, attempt), maxMs);
}If your client does not handle input_required, tasks can stall indefinitely. Every client should define a policy for that state, even if the policy is to cancel the task and log the reason.
case 'input_required':
if (!onInputRequired) {
logger.error(`Task ${taskId} requires input but no handler registered`);
await client.cancelTask(taskId);
throw new Error('Unhandled input_required state');
}
break;In-memory task stores are fine for demos, but production systems should persist task metadata and state transitions to durable storage. On restart, the server should reconcile any tasks left in working and either resume them safely or mark them failed with a clear recovery code.
MCP Tasks are a protocol-facing abstraction. The client interacts with a task through the MCP interface rather than directly through queue infrastructure. Under the hood, your server may still use Celery, Bull, or another job system to execute the work. The difference is that the agent sees a consistent task contract instead of queue-specific mechanics.
Task creation and polling fit naturally with any transport that lets the client make follow-up requests. Push-style subscriptions depend on transport and SDK support for server-initiated updates, so they are more transport-specific. If you are running local agents over stdio, polling is usually the simpler option.
That depends on where task state lives. If state is only in memory, a crash can orphan the task entirely. If state is persisted, the server can recover the task record on restart and either resume execution or mark it failed with a recovery-specific error code. Durable state is the difference between graceful recovery and silent loss.
Autonomous systems still need an explicit routing policy. Some tasks should escalate to a human. Others can route to a policy engine or a specialist agent. The important design choice is to make that routing deterministic and auditable rather than letting the task sit indefinitely.
There is no universal fixed limit. Capacity depends on memory, worker concurrency, downstream dependencies, and how much state you retain per task. In practice, you should implement admission control, monitor queue depth and task age, and reject or defer new work before the system becomes unstable.
input_required.input_required deliberately or tasks will hang in ways that are difficult to debug.MCP Tasks mark a practical shift from prototype agent systems to production-ready workflows. The call-now, fetch-later pattern is not just a convenience. It is a better fit for operations that take time, require approval, or need to survive transient failures without tying up the original request.
If your team is building agent workflows for ETL, document processing, code generation, or any other operation that cannot reliably finish within a short synchronous request window, Tasks are worth adopting. The pattern itself is straightforward. The engineering work is in the surrounding details: idempotency, recovery, transport-aware retrieval, and clear state transitions.
Elegant Software Solutions helps development teams implement production MCP patterns, including task-based workflows, OAuth lifecycle management, and multi-agent orchestration. If you're ready to move from synchronous prototypes to resilient async agent workflows, schedule a technical conversation with our team.
Discover more content: