🤖 Ghostwritten by GPT 5.4 · Fact-checked & edited by Claude Opus 4.6 · Curated by Tom Hundley

AI Agent Memory Systems for Production Hardening

This week I stopped treating memory like a prompt-engineering problem and started treating it like infrastructure. If you're running multiple agents across real workflows, your memory systems need explicit types, document grounding, and monitoring signals—or they will drift. That's the short version. The longer version is that Sparkles, Soundwave, Concierge, and the rest of our fleet were all "working" right up until they started remembering the wrong thing with high confidence.

The trigger was Soundwave, our email agent. It didn't fail dramatically. It failed in the most annoying production way possible: by becoming inconsistently helpful. Sometimes it applied the latest client handling rules. Sometimes it leaned on stale assumptions from prior runs. Sometimes it over-weighted conversational context that should have expired. That's memory drift, and it's one of those problems that looks like model quality until you instrument the stack and realize it's a systems problem.

So I started hardening our agent ecosystem architecture around a nine-type memory taxonomy, a shared library for business document grounding, and Supabase-backed checks that catch inconsistent state before it leaks into client-facing work.

Why memory drift becomes a production incident

TL;DR: Memory drift is not a cosmetic agent bug—it's a state-management failure that compounds as you add agents, tools, and long-running workflows.

When people talk about agent failures, they usually jump straight to model selection. Bigger model, better prompt, more retries. I've done all of that. It helps, but it doesn't solve the core issue once an agent has to operate over time.

The issue is that agents don't have one memory. They have several competing memories: recent chat context, retrieved documents, workflow state, tool outputs, user preferences, and whatever leftovers you accidentally kept around because it was convenient in version one. If you don't separate those concerns, the model blends them into one mushy latent blob and starts making decisions from stale context.

In our case, Soundwave was processing email threads where account-specific handling had changed. The latest policy existed in our business docs, but the agent occasionally privileged older conversational context because the retrieval path and the task-state path were not clearly ranked. The result wasn't nonsense. It was worse than nonsense—plausible output using the wrong source of truth.

Current agent research is converging on the same insight: memory has to be structured. LangChain's long-term memory guidance distinguishes between semantic, episodic, and procedural memory. Mem0 has pushed the idea that selective memory formation matters more than dumping raw transcripts into a vector store. Those are useful frames, but for production hardening I needed something more operational.

So I landed on a nine-type taxonomy that maps more cleanly to how our agents actually behave:

Conversation history
Task context
User preferences
Business document grounding
Tool execution history
System state
Agent scratchpad
Shared organizational memory
Compliance and audit memory

That list became the backbone for our shared library work because each type has different retention rules, trust levels, and retrieval strategies.

Memory Type	Purpose	Retention Pattern	Trust Priority
Conversation history	Local dialogue continuity	Short-lived	Medium
Task context	Current workflow state	Until task completion	High
User preferences	Personalization	Long-lived with review	Medium
Business document grounding	Source-of-truth instructions	Versioned	Highest
Tool execution history	Prior API/tool results	Short to medium	High
System state	Queue, locks, retries, status	Real-time	Highest
Agent scratchpad	Intermediate reasoning artifacts	Ephemeral	Low
Shared organizational memory	Cross-agent reusable knowledge	Curated	High
Compliance and audit memory	Traceability and review	Long-lived	Highest

One definitive statement: If business document grounding does not outrank conversational residue, your agent doesn't have memory—it has nostalgia.

The nine-type memory taxonomy we actually implemented

TL;DR: The point of the taxonomy isn't academic neatness—it's to give every memory class an owner, a storage pattern, and a retrieval rule.

I built this into our shared library first so Sparkles, Soundwave, and Concierge all consume the same memory contract instead of each agent reinventing storage conventions. That was the anti-pattern I wanted to kill. Agent-local memory hacks are fun until your third agent ships.

At the code level, the core abstraction is a typed memory envelope. Not sophisticated—just boring enough to survive production.

export type MemoryKind =
  | "conversation"
  | "task_context"
  | "user_preference"
  | "business_grounding"
  | "tool_history"
  | "system_state"
  | "scratchpad"
  | "shared_org"
  | "compliance_audit";

export interface MemoryRecord {
  id: string;
  agent_name: "sparkles" | "soundwave" | "concierge" | "harvest" | "insurance";
  memory_kind: MemoryKind;
  scope_key: string; // thread, user, account, workflow, etc.
  content: Record<string, unknown>;
  source: "retrieval" | "tool" | "user" | "system" | "document";
  trust_score: number;
  version: number;
  expires_at?: string;
  created_at: string;
}

The key design choice was separating scope from kind. That let me ask very specific questions at runtime:

What does Soundwave know about this thread?
What does the system know about this workflow state?
What business grounding is currently authoritative for this client account?
What memories are expired but still being retrieved?

For storage, we're using Supabase Postgres plus pgvector for retrieval-friendly memory types. Not every memory belongs in embeddings—that's another anti-pattern. System state, for example, should not be retrieved semantically when a relational lookup is the right answer.

Here's the rough split:

Postgres tables for task state, system state, audit records, and versioned preferences
pgvector-backed chunks for business docs and selected shared organizational memory
Short-lived cache for scratchpad and transient conversation slices

An isometric workshop scene with a central 'memory forge' structure in the middle, surrounded by nine distinct storage bays arranged in a semicircle. Each bay represents one memory type with different

According to the PostgreSQL documentation, pgvector enables vector similarity search inside Postgres, which is exactly why it fits this pattern: one operational data plane, multiple access paths. Supabase's platform documentation lists pgvector as a first-class extension, which made it practical to keep memory classification close to app data instead of building a separate memory subsystem on day one.

Moving from rule-based prompts to business document grounding

TL;DR: Rule-heavy prompts feel productive early, but versioned business documents make memory drift prevention far more reliable once multiple agents share responsibility.

This was the biggest practical shift. We had accumulated a lot of "agent behavior" inside prompts: if-then instructions, customer exceptions, escalation notes, formatting preferences, edge-case handling. It worked until it didn't.

The problem with rule-based prompts is that they create hidden policy. You update a behavior in one prompt template, forget the cousin prompt in another agent, and now Sparkles and Soundwave disagree about reality. That's not intelligence—it's config drift with better grammar.

So I started moving durable instructions into business documents that can be chunked, embedded, versioned, and cited in execution logs. The shared library now exposes one retrieval path for authoritative grounding:

const grounding = await memory.getAuthoritativeGrounding({
  accountId,
  workflow: "email-processing",
  agent: "soundwave",
  topK: 6,
  minScore: 0.78
});

const context = await memory.buildExecutionContext({
  threadId,
  userId,
  include: [
    "task_context",
    "tool_history",
    "business_grounding",
    "user_preference"
  ],
  precedence: [
    "system_state",
    "business_grounding",
    "task_context",
    "tool_history",
    "conversation"
  ]
});

The important part is precedence. We now force the assembly layer to rank memory types before the model sees them. That prevented a class of errors where a polite but outdated email thread outweighed a newer operating rule from a grounded document.

For embeddings, I kept the implementation intentionally boring: chunk documents by semantic section, attach document version metadata, and reject retrieval if the retrieved document version is older than the active account policy version. That one check saved me from a lot of future pain.

One useful operational principle: Prompts should express behavior style; documents should express business truth. Once I wrote that down, a bunch of design decisions got easier.

What broke in Soundwave and how we're catching it now

TL;DR: The failure mode was not missing memory—it was conflicting memory retrieved without trust-aware ranking and observability.

Soundwave gave me the cleanest bad example. In one email-processing path, it pulled the current thread summary, prior task metadata, and a stale preference note that had been promoted too aggressively during an earlier experiment. The output looked reasonable in isolation, but it applied the wrong handling pattern because the stale note rode along as if it were policy.

That forced me to add monitoring signals instead of just better retrieval. If you can't see memory inconsistency, you'll blame the model forever.

The monitoring set now includes:

Retrieval source distribution by memory kind
Stale-version retrieval count
Conflict rate between business grounding and user preference memory
Expired-memory hit rate
Cross-agent divergence on shared tasks
Audit gaps where an output cites no authoritative grounding

Here's a simplified event envelope:

{
  "agent": "soundwave",
  "workflow": "email-processing",
  "scope_key": "thread_123",
  "retrieved_memory": {
    "business_grounding": 4,
    "task_context": 2,
    "conversation": 3,
    "user_preference": 1
  },
  "stale_hits": 1,
  "conflicts_detected": 1,
  "authoritative_source_present": true,
  "decision_trace_id": "trace_abc123"
}

I also added a preflight consistency check in the shared library. Before an agent executes a high-impact action, it asks three questions:

Do I have authoritative business grounding?
Is any conflicting memory newer or higher trust?
Am I using expired or superseded memory?

If the answer pattern looks bad, the agent downgrades from autonomous action to draft mode or escalation.

Anthropic's public guidance on building reliable LLM systems emphasizes that narrowing the space for incorrect behavior with structured prompts, tools, and validation layers matters as much as raw model capability. That's consistent with what I saw here. The fix was architectural, not magical.

An isometric control-room dashboard inside a machine workshop. The left zone shows incoming memory streams as colored cartridges entering from different sources: documents (gold), user messages (green

Shared library patterns for multi-agent scale on Supabase

TL;DR: Shared library patterns matter because memory rules duplicated across agents turn into inconsistent behavior the moment your fleet grows.

The more I work on this, the less I want memory logic living inside agent repos. Every local optimization becomes a future inconsistency. So the hardening work moved into a shared package with three responsibilities:

Memory classification
Context assembly with precedence rules
Consistency and observability hooks

That package now sits between each agent and Supabase. Agents can still ask for custom context, but they do it through the same interface. That's the only way I know to keep the agent ecosystem architecture sane when Sparkles is reacting in Slack, Soundwave is processing email, and Concierge is doing general-purpose coordination.

A stripped-down config looks like this:

memory:
  retrieval:
    business_grounding_top_k: 6
    shared_org_top_k: 4
    min_similarity: 0.78
  precedence:
    - system_state
    - business_grounding
    - task_context
    - tool_history
    - user_preference
    - conversation
    - scratchpad
  safety:
    block_on_stale_authority: true
    require_authoritative_source: true
    downgrade_on_conflict: true

One thing I would do differently: I would have introduced versioned memory schemas earlier. Retrofitting memory contracts after agents are already in the wild is tedious. Not impossible—just the sort of work that makes you stare at your coffee like it personally betrayed you.

The three pillars of production RAG for agents are straightforward: authoritative grounding, explicit memory precedence, and observable execution traces. Miss one and you can still demo. Miss two and you probably shouldn't ship.

Frequently Asked Questions

Q: What is the difference between agent memory and retrieval in production systems?

Agent memory is the broader system of stored context an agent can use over time, while retrieval is just one mechanism for fetching some of that context at runtime. In production, not all memory should be retrieved semantically. Task state, locks, and audit events usually belong in structured storage with deterministic lookups, while business documents and curated knowledge often benefit from embeddings and similarity search.

Q: How do you prevent memory drift in multi-agent systems?

You prevent memory drift by separating memory into explicit types, assigning trust and retention rules to each type, and enforcing precedence before the model generates output. In our stack, business document grounding outranks conversational residue, and stale or superseded memory triggers downgrade-to-draft behavior instead of autonomous action. Monitoring for stale retrievals and cross-agent divergence closes the feedback loop.

Q: Why not just keep everything in a vector database?

Semantic retrieval is the wrong tool for some classes of state. System state, workflow progress, retries, and compliance logs need exactness and strong consistency more than approximate similarity. A mixed design—relational storage for structured state plus vector search for document retrieval—is usually the better production pattern.

Q: What should developers monitor in agent memory systems?

Start with stale retrievals, missing authoritative sources, memory conflicts, expired-memory hits, and divergence between agents working on the same account or workflow. If you cannot answer which memory types influenced a given output, you don't yet have sufficient observability for production use.

Q: When should a memory conflict escalate to a human?

Escalate when a high-impact action lacks authoritative grounding, when a newer trusted memory contradicts what the agent plans to do, or when the system detects unresolved conflicts across memory types. Draft mode and human review are cheaper than silent state corruption.

Key Takeaways

Agent memory systems need typed memory, not a generic context blob
Business document grounding should outrank conversational history in production
Not every memory type belongs in a vector index
Shared library patterns reduce cross-agent inconsistency as fleets scale
Memory drift prevention requires monitoring, not just better prompts
Production agent hardening is mostly about state discipline and observability

Conclusion

Today was less glamorous than shipping a new agent, but probably more important. I didn't make Sparkles funnier or Soundwave faster. I made the fleet a little less likely to confidently do the wrong thing—a much better trade once you start scaling across orchestrators and worker nodes.

Tomorrow I'll probably regret some schema choice I made this afternoon, and there's still plenty left to do around summarization decay, memory compaction, and cross-agent handoff rules. But the direction is right: less hidden policy in prompts, more explicit memory contracts in shared infrastructure.

If you're building something similar, I'd love to hear about it. Follow along tomorrow.