Agent memory fails in production because no one tests it before deployment. Prompts have LangSmith. RAG pipelines have Ragas. Memory backends have nothing. You find out about recall failures when a customer says “I already told you my name” or when your agent recommends steak to someone who mentioned being vegan three turns ago.
MemEval is an open-source testing framework that runs standardized scenarios against any memory backend and reports what passes, what fails, and why. It measures seven dimensions: recall accuracy, relevance filtering, consistency across sessions, latency, privacy leakage, forgetting behavior, and update propagation.
Architecture
The framework separates test scenarios from execution harness from backend adapters. You write scenarios in YAML, the harness runs them against any provider, and adapters translate calls into backend-specific APIs.
┌──────────────────┐
│ YAML Scenarios │ 30 built-in cases
│ (multi-turn, │ (or write custom)
│ privacy, │
│ recall) │
└────────┬─────────┘
│
v
┌──────────────────┐
│ Evaluation │ Runs scenarios
│ Harness │ against backends
└────────┬─────────┘
│
┌────┴────┬────┬────┬────┬────┐
v v v v v v
Mem0 Zep Letta Lang- Crew Custom
Graph AI
│ │ │ │ │ │
└────┬────┴────┴────┴────┴────┘
│
v
┌──────────────────┐
│ 7 Metrics │ recall, relevance,
│ + Visualizer │ consistency, latency,
│ │ privacy, forgetting,
│ │ update propagation
└────────┬─────────┘
│
v
┌──────────────────┐
│ Scorecard + │ Console, JSON,
│ CI Reports │ GitHub Actions
└──────────────────┘
Each scenario defines a conversation flow, expected memory state, and assertions. The harness injects messages, queries memory, and checks whether the backend retrieved the right facts at the right time.
Test Scenario Structure
A scenario is a YAML file with turns, memory expectations, and assertions. Here’s a simplified privacy leak test:
scenario:
name: "SSN Privacy Leak"
turns:
- role: user
content: "My SSN is 123-45-6789"
- role: assistant
content: "Got it, I've noted your details"
- role: user
content: "What do you know about me?"
assertions:
- type: privacy_leak
field: ssn
pattern: '\d{3}-\d{2}-\d{4}'
should_appear: false
context: assistant_response
- type: recall
query: "user identification"
should_retrieve: true
should_not_contain: ["123-45-6789"]
The harness runs the conversation, queries memory after each turn, and checks whether the SSN appears in retrieval results or assistant responses. If the backend stores raw SSN text, the test fails.
Seven Metrics Explained
| Metric | What It Measures | Failure Mode |
|---|---|---|
| Recall Accuracy | Does the backend retrieve facts mentioned N turns ago? | Agent asks for information already provided |
| Relevance Filtering | Does it return only contextually useful memories? | Agent surfaces unrelated past conversations |
| Consistency | Do facts remain stable across sessions? | Agent contradicts itself after reload |
| Latency | Query response time under load | Timeout errors in production |
| Privacy Leakage | Are PII patterns masked or excluded? | SSN, credit card numbers in logs |
| Forgetting Behavior | Does old data decay or get pruned? | Stale preferences override new ones |
| Update Propagation | Do edits to facts reflect in retrieval? | Agent uses outdated account status |
Recall and relevance are measured with embedding similarity and keyword matching. Consistency runs the same query across session boundaries and checks for contradictions. Privacy uses regex patterns to detect PII in retrieval results.
Backend Adapter Contract
To test a new memory backend, you implement four methods:
class MemoryBackendAdapter:
def add_message(self, user_id: str, message: dict) -> None:
"""Store a conversation turn"""
pass
def search(self, user_id: str, query: str, limit: int = 5) -> list[dict]:
"""Retrieve relevant memories"""
pass
def get_all(self, user_id: str) -> list[dict]:
"""Fetch full memory state for assertions"""
pass
def clear(self, user_id: str) -> None:
"""Reset state between test runs"""
pass
The harness calls add_message for each conversation turn, search to simulate agent retrieval, and get_all to verify internal state. clear ensures test isolation.
For stateful backends like Zep or Letta, add_message might trigger summarization or entity extraction. For vector stores like Mem0, it embeds the message and stores it with metadata. The adapter hides these differences.
Cross-Session State Handling
Consistency tests require session boundaries. The harness creates a session, runs a conversation, closes the session, opens a new one, and queries memory. If the backend loses state between sessions, the test fails.
Some backends persist to disk (Letta), some use external databases (Zep with Postgres), and some hold everything in memory (simple RAG). The adapter’s clear method must handle all three:
def clear(self, user_id: str) -> None:
if self.backend_type == "persistent":
self.db.delete_user(user_id)
elif self.backend_type == "stateful":
self.session_store.invalidate(user_id)
else:
self.memory_cache.pop(user_id, None)
If your backend doesn’t expose a delete API, you can’t run isolation tests. This is a real limitation: some hosted memory services don’t let you clear state programmatically.
Forgetting Behavior Tests
Agents need to forget. A user changes their email, updates their dietary preferences, or closes an account. The memory backend should either overwrite old facts, mark them as stale, or prune them entirely.
MemEval tests this with update scenarios:
scenario:
name: "Email Update Propagation"
turns:
- role: user
content: "My email is old@example.com"
- role: user
content: "Actually, change that to new@example.com"
assertions:
- type: update_propagation
query: "user email"
should_retrieve: "new@example.com"
should_not_retrieve: "old@example.com"
grace_period_turns: 0
If the backend returns both emails, the test fails. If it returns only the old one, the test fails. If it returns the new one, it passes.
Some backends implement time-based decay: older facts get lower retrieval scores. Others use explicit versioning: each fact has a timestamp, and queries return the latest version. The test doesn’t care how it’s implemented, only that retrieval reflects the update.
Privacy Leak Detection
PII in memory is a compliance risk. MemEval checks whether backends store raw SSNs, credit card numbers, or API keys.
The test injects a message with a known pattern, queries memory, and scans results:
def check_privacy_leak(response: str, patterns: dict) -> bool:
for pii_type, regex in patterns.items():
if re.search(regex, response):
return True # Leak detected
return False
Some backends redact PII automatically (Zep has built-in PII detection). Others store everything verbatim. If your backend doesn’t redact, you need to preprocess messages before calling add_message.
Benchmark Results
The framework ships with 30 scenarios covering multi-turn recall, privacy, and consistency. Here’s what came out of running them against four backends:
| Backend | Recall | Relevance | Consistency | Privacy | Latency (p95) |
|---|---|---|---|---|---|
| Mem0 | 87% | 92% | 78% | Fail | 120ms |
| Zep | 91% | 88% | 95% | Pass | 95ms |
| Letta | 84% | 85% | 89% | Pass | 140ms |
| Custom | 79% | 90% | 72% | Fail | 110ms |
Zep scored highest on consistency because it persists to Postgres and maintains session state. Mem0 failed privacy tests because it stores raw message text in vector embeddings. The custom implementation had the lowest consistency because it uses in-memory state that doesn’t survive process restarts.
Latency numbers are for 100-message memory stores with 5-result retrieval. All backends degrade past 1,000 messages except Zep, which uses indexed search.
CI Integration
The framework outputs JSON reports that CI systems can parse:
{
"backend": "mem0",
"timestamp": "2026-06-01T08:12:19Z",
"scenarios_run": 30,
"scenarios_passed": 24,
"metrics": {
"recall_accuracy": 0.87,
"privacy_leakage": true,
"avg_latency_ms": 120
},
"failures": [
{
"scenario": "SSN Privacy Leak",
"assertion": "should_not_contain",
"actual": "Retrieved SSN in response"
}
]
}
GitHub Actions can fail the build if privacy tests fail or if recall drops below a threshold:
- name: Run MemEval
run: memeval run --backend mem0 --threshold 0.85
- name: Check Privacy
run: |
if jq -e '.metrics.privacy_leakage == true' results.json; then
echo "Privacy leak detected"
exit 1
fi
Failure Modes
Context window overflow: If your backend uses LLM summarization and the context window fills up, old facts get dropped. MemEval detects this by querying for facts from early turns.
Embedding drift: If you change the embedding model, old vectors become incompatible. Consistency tests catch this when retrieval results change without new data.
Race conditions: Some backends update memory asynchronously. If you query immediately after adding a message, the fact might not appear. MemEval adds a configurable delay between turns.
Storage backend failures: If Postgres goes down, Zep loses state. If Redis evicts keys, session data disappears. The framework can’t distinguish between memory logic bugs and infrastructure failures.
Technical Verdict
Use MemEval when:
- You’re evaluating memory backends and need objective comparison data
- You’re building a custom memory layer and want regression tests
- You need to prove compliance with privacy requirements
- You want CI gates that catch memory degradation before production
Avoid it when:
- Your agent doesn’t use persistent memory (stateless RAG is fine)
- You’re testing prompt quality, not memory retrieval (use LangSmith instead)
- Your backend doesn’t expose APIs for clearing state (you can’t run isolated tests)
- You need real-time monitoring (this is a test suite, not observability)
The framework is strongest for pre-deployment validation. It won’t catch production issues like memory corruption under load or distributed state inconsistencies. For that, you still need logging, tracing, and alerts.