mech.app
AI Agents

Agentic Memory: State Persistence Patterns for Multi-Session AI Workflows

Storage backends, retrieval strategies, and context window trade-offs for agent memory systems that persist across sessions.

Source: blog.mikiobraun.de
Agentic Memory: State Persistence Patterns for Multi-Session AI Workflows

Agent memory is the gap between a stateless LLM call and a system that knows what you did last week. The problem is simple: context windows are finite, sessions end, and agents need to remember facts, preferences, and task state across interactions. The solution space is messy.

Mikio Braun’s follow-up post surveys the current memory infrastructure landscape after his initial exploration. The key insight is that memory is not one problem. It is at least six different architectural patterns, each with different storage backends, retrieval latencies, and failure modes.

The Memory Stack

Braun’s survey identifies six categories of memory approaches. The three most relevant for production systems are:

Platform-native memory: ChatGPT’s saved memories, Claude’s conversation summaries, Gemini’s cross-product context. These are opaque, product-driven implementations. You get convenience but no control over retrieval logic or schema evolution.

Memory middleware: APIs like Mem0 and Letta that sit between your agent and storage. They extract facts from conversations, store them in vector or graph databases, and inject relevant context into prompts. You own the orchestration but delegate the memory layer.

Embedded memory libraries: Code you import into your agent runtime. Think LangChain’s memory modules or custom state managers in LangGraph. You control everything but also handle all the plumbing.

Platform memory gives you zero-config persistence but locks you into provider-specific retrieval logic. Middleware adds a network hop but lets you swap LLM providers. Embedded libraries give you full control at the cost of managing storage lifecycle yourself.

Storage Backend Trade-offs

Different memory types demand different storage shapes:

Memory TypeStorage BackendRetrieval MethodLatency ProfileFailure Mode
Episodic (conversation history)Key-value store (Redis, DynamoDB)Session ID lookup<10msSession keys accumulate until Redis memory limit, causing eviction of active sessions
Semantic (facts, preferences)Vector DB (Pinecone, Weaviate)Embedding similarity50-200msEmbedding model updates invalidate similarity scores, returning irrelevant facts
Procedural (task workflows)Graph DB (Neo4j, Memgraph)Path traversal100-500msGraph traversal queries break when node properties change, requiring full reindex
Working memory (current task state)In-memory (agent runtime)Direct access<1msLost on crash, no persistence

Most production systems use a hybrid: working memory in-process, episodic memory in a fast key-value store, and semantic memory in a vector database. The orchestration layer decides when to query each backend.

Retrieval Strategy Patterns

Storage is straightforward. Retrieval requires deciding which memories are relevant and when to query them.

Explicit retrieval: The agent explicitly calls a memory tool when it needs context. This keeps prompts lean but requires the agent to know when it is missing information. Works well for task-specific workflows where memory needs are predictable.

Automatic injection: The memory system queries relevant context before every agent turn and injects it into the prompt. Higher token cost, lower cognitive load on the agent. ChatGPT’s reference chat history works this way.

Hybrid with budget: Retrieve top-k memories by relevance, then prune to fit a token budget. Requires a scoring function (recency, relevance, user-flagged importance). Some middleware layers implement this pattern, though the specific retrieval logic varies by provider.

The latency problem is real. If your agent needs to query a vector database mid-execution, you add 100-200ms per retrieval. For interactive workflows, that compounds fast. Some systems pre-fetch likely memories at session start, trading upfront latency for smoother execution.

Unbounded Growth and Pruning

Memory grows without bounds unless you intervene. Three common strategies:

  1. Time-based expiration: Drop memories older than N days. Simple but loses long-term context.
  2. Summarization: Periodically compress old memories into higher-level summaries. Reduces token count but loses detail. Requires a separate summarization pass (more LLM calls, more cost).
  3. Importance scoring: Tag memories with relevance scores and prune low-value entries. Requires either user feedback or a heuristic (access frequency, recency, explicit user saves).

None of these are free. Summarization adds latency and cost. Scoring requires instrumentation. Expiration loses data. The choice depends on whether you prioritize cost control, context preservation, or operational simplicity.

Schema Evolution and Versioning

Agent capabilities evolve. You add new tools, change prompt structures, or refactor your orchestration graph. Your memory schema needs to keep up.

The problem: memories stored six months ago reference tools that no longer exist or use a fact schema that has changed. Do you migrate old memories? Ignore them? Re-embed them with the new schema?

Migration strategies:

  • Lazy migration: Re-process memories on retrieval if they match an old schema. Spreads the cost but adds retrieval latency.
  • Batch migration: Run a background job to update all memories when the schema changes. Upfront cost, clean retrieval.
  • Versioned schemas: Store schema version with each memory and handle multiple versions in retrieval logic. More complex but avoids data loss.

If you are building a memory layer, version your schema from day one. Retrofitting versioning into an existing system requires migrating all stored memories, which can be expensive at scale.

Implementation Sketch

Here is a minimal hybrid memory setup using Redis for episodic storage and Pinecone for semantic retrieval:

import json
import os
from redis import Redis
from pinecone import Pinecone
from openai import OpenAI

# Configuration
TTL_SECONDS = 86400 * 7  # 7 days

class AgentMemory:
    def __init__(self, session_id):
        self.session_id = session_id
        self.redis = Redis(host='localhost', port=6379)
        self.pinecone = Pinecone(api_key=os.environ['PINECONE_API_KEY'])
        self.index = self.pinecone.Index('agent-memory')
        self.openai = OpenAI()

    def store_turn(self, user_msg, agent_msg):
        # Episodic: append to session history
        key = f"session:{self.session_id}"
        self.redis.rpush(key, f"User: {user_msg}")
        self.redis.rpush(key, f"Agent: {agent_msg}")
        self.redis.expire(key, TTL_SECONDS)

        # Semantic: extract and embed facts
        facts = self._extract_facts(user_msg, agent_msg)
        for fact in facts:
            embedding = self._embed(fact)
            self.index.upsert([(
                f"{self.session_id}:{hash(fact)}",
                embedding,
                {"text": fact, "session": self.session_id}
            )])

    def retrieve_context(self, query, k=5):
        # Episodic: last N turns
        recent = self.redis.lrange(f"session:{self.session_id}", -10, -1)
        
        # Semantic: top-k relevant facts
        query_emb = self._embed(query)
        results = self.index.query(vector=query_emb, top_k=k)
        facts = [r.metadata['text'] for r in results.matches]

        return {
            "recent_turns": [t.decode() for t in recent],
            "relevant_facts": facts
        }

    def _extract_facts(self, user_msg, agent_msg):
        # Use LLM to extract memorable facts
        try:
            response = self.openai.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{
                    "role": "system",
                    "content": "Extract factual statements about user preferences, past actions, or stated information. Return as JSON array of strings."
                }, {
                    "role": "user",
                    "content": f"User: {user_msg}\nAgent: {agent_msg}"
                }]
            )
            facts = json.loads(response.choices[0].message.content)
            return facts
        except (json.JSONDecodeError, KeyError):
            return []

    def _embed(self, text):
        response = self.openai.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        return response.data[0].embedding

This pattern separates fast session retrieval from slower semantic search. The agent can pull recent context synchronously and optionally enrich it with semantic facts if the query warrants it.

Observability Gaps

Memory systems are hard to debug. You cannot see what the agent retrieved or why it chose certain facts over others. Useful instrumentation includes:

  • Retrieval logs: What memories were fetched for each agent turn, with relevance scores.
  • Memory drift metrics: Track cosine similarity between fact embeddings. Flag pairs with similarity greater than 0.85 but contradictory sentiment as detected by an LLM judge.
  • Token budget tracking: How much of your context window is memory vs. task instructions.
  • Pruning audit logs: What got deleted and why.

Without these, you are flying blind when the agent hallucinates based on stale or conflicting memories.

Technical Verdict

Use agentic memory when:

  • Your agent needs to maintain state across sessions (customer support, personal assistants, long-running workflows).
  • You can tolerate 100-200ms retrieval latency. This is acceptable for async workflows like email agents but problematic for real-time chat interfaces where users expect sub-second responses.
  • You have a strategy for pruning or summarizing unbounded memory growth.
  • You can instrument retrieval to debug when memory causes incorrect behavior.

Avoid or defer when:

  • Your agent is stateless by design (one-shot tasks, ephemeral queries).
  • You cannot afford the operational overhead of managing vector databases and schema migrations.
  • Your context window is large enough to fit all relevant history in-prompt. This is possible for agents operating on fixed datasets like a 50-page product manual that fits in a 128k context window.
  • You are still iterating on core agent capabilities and memory would add too much surface area.

Memory is not a feature you bolt on at the end. It is a persistence layer with all the usual database problems: schema evolution, query optimization, data consistency, and observability. Treat it like infrastructure, not a library.