What stays stable past hour eight: long-horizon agent runs in production.

An agent that runs for thirty seconds and an agent that runs for thirty hours are not the same engineering problem. They share an API, a framework, and roughly nothing else. Here are the five patterns we now reach for when a single autonomous run has to survive a token-context flush, a tool outage, and (more than once) a model upgrade mid-execution.

1. Checkpoint the plan, not the trace

Short-horizon agents can be replayed from their trace. Long-horizon agents cannot — the trace gets too large to load and the upstream model that produced it may not exist anymore. What you checkpoint instead is the plan and the open commitments.

// python

# bad — replay-from-trace breaks at hour eight
state = load_trace(run_id)
resume_from(state.last_step)

# good — replay-from-plan with committed steps as durable facts
plan = load_plan(run_id)
committed = load_committed_steps(run_id)
remaining = diff(plan, committed)
resume(plan, committed, remaining)

The plan is small (kilobytes), structured, and stable across runs. The committed-steps log is append-only and indexed by step_id. When the run resumes — after a crash, a deploy, a model upgrade — it loads the plan, sees which steps have committed, and continues. The trace is for humans. The plan + commit log is for the agent.

2. Tool calls are idempotent or they are not safe

We see teams ship long-horizon agents with non-idempotent tool calls and then add retry logic on top. Eight hours in, the agent retries a stripe.charge or a database.write and you have a duplicate-charges incident. Every tool call carries an idempotency_key derived from (run_id, step_id). The tool — not the agent — guarantees that two calls with the same key produce the same outcome.

3. Context budget per phase, not per call

Short agents share a single context window. Long agents need a budget per phase — planning, retrieval, execution, review — and need to compress between phases. We use a fixed compression schedule: at every phase boundary, the agent summarizes the prior phase into a structured artifact and starts the next phase with that artifact, not with the full prior context.

Planning phase: full problem context · budget = 30% of window
Retrieval phase: planning summary + grounded chunks · budget = 35%
Execution phase: planning summary + retrieval artifact + step-local context · budget = 25%
Review phase: execution summary + flagged anomalies · budget = 10%

When you don't impose this budget, the agent's working context drifts. By hour twelve it's reasoning about its own internal monologue from hour two and producing output that looks plausible but is no longer grounded in the original objective.

4. Heartbeat + watchdog

Long agents need an external supervisor. The agent emits a heartbeat every N minutes; a watchdog process checks the heartbeat and intervenes if it stalls. The watchdog is not part of the graph — it's a separate, stateless process that can kill, restart, or escalate a run independent of whatever broke.

// python

# Inside the agent loop
async def heartbeat(run_id: str, phase: str):
    await store.put_heartbeat(run_id, {
        "phase": phase,
        "ts": utc_now(),
        "step_count": current_step_count(),
    })

# Watchdog (separate process)
async def watchdog_tick(run_id: str):
    hb = await store.get_heartbeat(run_id)
    if utc_now() - hb.ts > timedelta(minutes=15):
        await escalate(run_id, reason="heartbeat_stall")

Fifteen minutes is our default stall threshold for the watchdog. Shorter for high-priced workloads, longer for batch jobs. Below five minutes the watchdog fires on normal slow phases; above thirty minutes the agent burns budget on hung tool calls before anyone notices.

5. Survive a model upgrade mid-run

This is the pattern that catches teams off guard the first time it happens. A long-horizon run starts on claude-sonnet-4-6 at 2:00 PM, the provider rolls a routine update at 8:00 PM, and the agent's behavior shifts mid-run because every subsequent call is now hitting a slightly different model. The fix is explicit and small: pin the model version per run, not per project.

Where this lands

The five patterns above are what now appears in every long-horizon engagement we run. Together they trade some short-term elegance for a system that survives the failure modes that only show up when an agent has been running long enough for the world around it to change. If you're shipping long-horizon agents and your engineering instincts come from the request-response world, expect to relearn the durability primitives that the SRE world has been writing about for thirty years. The agentic version is the same problem with a different vocabulary.

// continue

Four LangGraph topology patterns that survive production.