
🤖 Ghostwritten by GPT 5.4 · Fact-checked & edited by Claude Opus 4.6 · Curated by Tom Hundley
This week I stopped treating memory like a prompt-engineering problem and started treating it like infrastructure. If you're running multiple agents across real workflows, your memory systems need explicit types, document grounding, and monitoring signals—or they will drift. That's the short version. The longer version is that Sparkles, Soundwave, Concierge, and the rest of our fleet were all "working" right up until they started remembering the wrong thing with high confidence.
The trigger was Soundwave, our email agent. It didn't fail dramatically. It failed in the most annoying production way possible: by becoming inconsistently helpful. Sometimes it applied the latest client handling rules. Sometimes it leaned on stale assumptions from prior runs. Sometimes it over-weighted conversational context that should have expired. That's memory drift, and it's one of those problems that looks like model quality until you instrument the stack and realize it's a systems problem.
So I started hardening our agent ecosystem architecture around a nine-type memory taxonomy, a shared library for business document grounding, and Supabase-backed checks that catch inconsistent state before it leaks into client-facing work.
TL;DR: Memory drift is not a cosmetic agent bug—it's a state-management failure that compounds as you add agents, tools, and long-running workflows.
When people talk about agent failures, they usually jump straight to model selection. Bigger model, better prompt, more retries. I've done all of that. It helps, but it doesn't solve the core issue once an agent has to operate over time.
The issue is that agents don't have one memory. They have several competing memories: recent chat context, retrieved documents, workflow state, tool outputs, user preferences, and whatever leftovers you accidentally kept around because it was convenient in version one. If you don't separate those concerns, the model blends them into one mushy latent blob and starts making decisions from stale context.
In our case, Soundwave was processing email threads where account-specific handling had changed. The latest policy existed in our business docs, but the agent occasionally privileged older conversational context because the retrieval path and the task-state path were not clearly ranked. The result wasn't nonsense. It was worse than nonsense—plausible output using the wrong source of truth.
Current agent research is converging on the same insight: memory has to be structured. LangChain's long-term memory guidance distinguishes between semantic, episodic, and procedural memory. Mem0 has pushed the idea that selective memory formation matters more than dumping raw transcripts into a vector store. Those are useful frames, but for production hardening I needed something more operational.
So I landed on a nine-type taxonomy that maps more cleanly to how our agents actually behave:
That list became the backbone for our shared library work because each type has different retention rules, trust levels, and retrieval strategies.
| Memory Type | Purpose | Retention Pattern | Trust Priority |
|---|---|---|---|
| Conversation history | Local dialogue continuity | Short-lived | Medium |
| Task context | Current workflow state | Until task completion | High |
| User preferences | Personalization | Long-lived with review | Medium |
| Business document grounding | Source-of-truth instructions | Versioned | Highest |
| Tool execution history | Prior API/tool results | Short to medium | High |
| System state | Queue, locks, retries, status | Real-time | Highest |
| Agent scratchpad | Intermediate reasoning artifacts | Ephemeral | Low |
| Shared organizational memory | Cross-agent reusable knowledge | Curated | High |
| Compliance and audit memory | Traceability and review | Long-lived | Highest |
One definitive statement: If business document grounding does not outrank conversational residue, your agent doesn't have memory—it has nostalgia.
TL;DR: The point of the taxonomy isn't academic neatness—it's to give every memory class an owner, a storage pattern, and a retrieval rule.
I built this into our shared library first so Sparkles, Soundwave, and Concierge all consume the same memory contract instead of each agent reinventing storage conventions. That was the anti-pattern I wanted to kill. Agent-local memory hacks are fun until your third agent ships.
At the code level, the core abstraction is a typed memory envelope. Not sophisticated—just boring enough to survive production.
export type MemoryKind =
| "conversation"
| "task_context"
| "user_preference"
| "business_grounding"
| "tool_history"
| "system_state"
| "scratchpad"
| "shared_org"
| "compliance_audit";
export interface MemoryRecord {
id: string;
agent_name: "sparkles" | "soundwave" | "concierge" | "harvest" | "insurance";
memory_kind: MemoryKind;
scope_key: string; // thread, user, account, workflow, etc.
content: Record<string, unknown>;
source: "retrieval" | "tool" | "user" | "system" | "document";
trust_score: number;
version: number;
expires_at?: string;
created_at: string;
}The key design choice was separating scope from kind. That let me ask very specific questions at runtime:
For storage, we're using Supabase Postgres plus pgvector for retrieval-friendly memory types. Not every memory belongs in embeddings—that's another anti-pattern. System state, for example, should not be retrieved semantically when a relational lookup is the right answer.
Here's the rough split:
According to the PostgreSQL documentation, pgvector enables vector similarity search inside Postgres, which is exactly why it fits this pattern: one operational data plane, multiple access paths. Supabase's platform documentation lists pgvector as a first-class extension, which made it practical to keep memory classification close to app data instead of building a separate memory subsystem on day one.
TL;DR: Rule-heavy prompts feel productive early, but versioned business documents make memory drift prevention far more reliable once multiple agents share responsibility.
This was the biggest practical shift. We had accumulated a lot of "agent behavior" inside prompts: if-then instructions, customer exceptions, escalation notes, formatting preferences, edge-case handling. It worked until it didn't.
The problem with rule-based prompts is that they create hidden policy. You update a behavior in one prompt template, forget the cousin prompt in another agent, and now Sparkles and Soundwave disagree about reality. That's not intelligence—it's config drift with better grammar.
So I started moving durable instructions into business documents that can be chunked, embedded, versioned, and cited in execution logs. The shared library now exposes one retrieval path for authoritative grounding:
const grounding = await memory.getAuthoritativeGrounding({
accountId,
workflow: "email-processing",
agent: "soundwave",
topK: 6,
minScore: 0.78
});
const context = await memory.buildExecutionContext({
threadId,
userId,
include: [
"task_context",
"tool_history",
"business_grounding",
"user_preference"
],
precedence: [
"system_state",
"business_grounding",
"task_context",
"tool_history",
"conversation"
]
});The important part is precedence. We now force the assembly layer to rank memory types before the model sees them. That prevented a class of errors where a polite but outdated email thread outweighed a newer operating rule from a grounded document.
For embeddings, I kept the implementation intentionally boring: chunk documents by semantic section, attach document version metadata, and reject retrieval if the retrieved document version is older than the active account policy version. That one check saved me from a lot of future pain.
One useful operational principle: Prompts should express behavior style; documents should express business truth. Once I wrote that down, a bunch of design decisions got easier.
TL;DR: The failure mode was not missing memory—it was conflicting memory retrieved without trust-aware ranking and observability.
Soundwave gave me the cleanest bad example. In one email-processing path, it pulled the current thread summary, prior task metadata, and a stale preference note that had been promoted too aggressively during an earlier experiment. The output looked reasonable in isolation, but it applied the wrong handling pattern because the stale note rode along as if it were policy.
That forced me to add monitoring signals instead of just better retrieval. If you can't see memory inconsistency, you'll blame the model forever.
The monitoring set now includes:
Here's a simplified event envelope:
{
"agent": "soundwave",
"workflow": "email-processing",
"scope_key": "thread_123",
"retrieved_memory": {
"business_grounding": 4,
"task_context": 2,
"conversation": 3,
"user_preference": 1
},
"stale_hits": 1,
"conflicts_detected": 1,
"authoritative_source_present": true,
"decision_trace_id": "trace_abc123"
}I also added a preflight consistency check in the shared library. Before an agent executes a high-impact action, it asks three questions:
If the answer pattern looks bad, the agent downgrades from autonomous action to draft mode or escalation.
Anthropic's public guidance on building reliable LLM systems emphasizes that narrowing the space for incorrect behavior with structured prompts, tools, and validation layers matters as much as raw model capability. That's consistent with what I saw here. The fix was architectural, not magical.
TL;DR: Shared library patterns matter because memory rules duplicated across agents turn into inconsistent behavior the moment your fleet grows.
The more I work on this, the less I want memory logic living inside agent repos. Every local optimization becomes a future inconsistency. So the hardening work moved into a shared package with three responsibilities:
That package now sits between each agent and Supabase. Agents can still ask for custom context, but they do it through the same interface. That's the only way I know to keep the agent ecosystem architecture sane when Sparkles is reacting in Slack, Soundwave is processing email, and Concierge is doing general-purpose coordination.
A stripped-down config looks like this:
memory:
retrieval:
business_grounding_top_k: 6
shared_org_top_k: 4
min_similarity: 0.78
precedence:
- system_state
- business_grounding
- task_context
- tool_history
- user_preference
- conversation
- scratchpad
safety:
block_on_stale_authority: true
require_authoritative_source: true
downgrade_on_conflict: trueOne thing I would do differently: I would have introduced versioned memory schemas earlier. Retrofitting memory contracts after agents are already in the wild is tedious. Not impossible—just the sort of work that makes you stare at your coffee like it personally betrayed you.
The three pillars of production RAG for agents are straightforward: authoritative grounding, explicit memory precedence, and observable execution traces. Miss one and you can still demo. Miss two and you probably shouldn't ship.
Agent memory is the broader system of stored context an agent can use over time, while retrieval is just one mechanism for fetching some of that context at runtime. In production, not all memory should be retrieved semantically. Task state, locks, and audit events usually belong in structured storage with deterministic lookups, while business documents and curated knowledge often benefit from embeddings and similarity search.
You prevent memory drift by separating memory into explicit types, assigning trust and retention rules to each type, and enforcing precedence before the model generates output. In our stack, business document grounding outranks conversational residue, and stale or superseded memory triggers downgrade-to-draft behavior instead of autonomous action. Monitoring for stale retrievals and cross-agent divergence closes the feedback loop.
Semantic retrieval is the wrong tool for some classes of state. System state, workflow progress, retries, and compliance logs need exactness and strong consistency more than approximate similarity. A mixed design—relational storage for structured state plus vector search for document retrieval—is usually the better production pattern.
Start with stale retrievals, missing authoritative sources, memory conflicts, expired-memory hits, and divergence between agents working on the same account or workflow. If you cannot answer which memory types influenced a given output, you don't yet have sufficient observability for production use.
Escalate when a high-impact action lacks authoritative grounding, when a newer trusted memory contradicts what the agent plans to do, or when the system detects unresolved conflicts across memory types. Draft mode and human review are cheaper than silent state corruption.
Today was less glamorous than shipping a new agent, but probably more important. I didn't make Sparkles funnier or Soundwave faster. I made the fleet a little less likely to confidently do the wrong thing—a much better trade once you start scaling across orchestrators and worker nodes.
Tomorrow I'll probably regret some schema choice I made this afternoon, and there's still plenty left to do around summarization decay, memory compaction, and cross-agent handoff rules. But the direction is right: less hidden policy in prompts, more explicit memory contracts in shared infrastructure.
If you're building something similar, I'd love to hear about it. Follow along tomorrow.
Discover more content: