mech.app
Dev Tools

MemPalace's Pluggable Backend Architecture: Swapping Vector Stores Without Touching Application Code

How MemPalace's base interface lets you swap ChromaDB, Qdrant, or Pinecone while preserving structured palace retrieval and local-first deployment.

Source: github.com
MemPalace's Pluggable Backend Architecture: Swapping Vector Stores Without Touching Application Code

MemPalace hit #7 on GitHub Trending Python with a 96.6% R@5 score on LongMemEval and zero API calls. The headline is local-first AI memory, but the engineering story is the pluggable backend interface that lets you swap ChromaDB for Qdrant, Pinecone, or Weaviate without rewriting retrieval logic.

Most conversational memory systems hard-code a vector store. MemPalace defines a clean contract in mempalace/backends/base.py and ships ChromaDB as the default. When your conversation history outgrows local disk or you need multi-region replication, you drop in a new backend class and change one config line. The structured palace model (wings, rooms, drawers) maps to namespaces or metadata filters in every backend, so your scoped searches stay consistent.

Why the abstraction matters

AI agents accumulate conversation turns fast. A single Claude Code session can generate thousands of messages. Local ChromaDB works until you hit 100K turns or need cross-machine sync. Managed Pinecone scales to millions but costs money and adds latency. Qdrant sits in between with self-hosted clustering.

Hard-coding a vector store means:

  • Migration requires rewriting every search() and insert() call.
  • Testing against multiple backends needs parallel codebases.
  • Switching from local dev to production means touching application logic.

MemPalace’s backend interface solves this by defining a stable contract. The application layer calls backend.search(query, filters) and gets results. Whether that query hits ChromaDB on localhost or Pinecone in us-east-1 is a deployment detail.

The base interface contract

The mempalace/backends/base.py file defines an abstract class with five methods:

class MemoryBackend(ABC):
    @abstractmethod
    def insert(self, documents: List[Document], metadata: Dict) -> None:
        pass

    @abstractmethod
    def search(self, query: str, filters: Dict, top_k: int) -> List[Result]:
        pass

    @abstractmethod
    def delete(self, filters: Dict) -> int:
        pass

    @abstractmethod
    def update_metadata(self, doc_id: str, metadata: Dict) -> None:
        pass

    @abstractmethod
    def get_stats(self) -> Dict:
        pass

Every backend must implement these five methods. The Document type wraps raw text plus embeddings. The filters dict encodes the palace structure: {"wing": "project_alpha", "room": "authentication"}.

ChromaDB’s implementation lives in mempalace/backends/chromadb.py. It translates filters into ChromaDB’s where clauses and handles embedding generation. Qdrant’s implementation would map the same filters to payload conditions. Pinecone would use metadata filters. The application layer never sees these differences.

Palace structure and backend mapping

MemPalace organizes memory into wings (people, projects), rooms (topics), and drawers (content). This structure is metadata, not schema. When you insert a conversation turn:

backend.insert(
    documents=[Document(text="User asked about OAuth flow")],
    metadata={
        "wing": "project_alpha",
        "room": "authentication",
        "drawer": "oauth_questions",
        "timestamp": 1735689600
    }
)

ChromaDB stores this as a document with metadata fields. Qdrant stores it as a point with a payload. Pinecone stores it as a vector with metadata. When you search:

results = backend.search(
    query="How do we handle OAuth tokens?",
    filters={"wing": "project_alpha", "room": "authentication"},
    top_k=5
)

The backend translates filters into its native query language. ChromaDB uses where={"wing": "project_alpha", "room": "authentication"}. Qdrant uses must=[FieldCondition(key="wing", match=MatchValue(value="project_alpha"))]. The application gets the same List[Result] either way.

Backend comparison table

BackendDeploymentLatency (p95)Cost (1M vectors)Hybrid SearchMigration Path
ChromaDBLocal disk<10ms$0NoExport to Parquet, import to target
QdrantSelf-hosted or cloud20-50ms$50/mo (self-hosted)Yes (payload index)Snapshot API, restore to new cluster
PineconeManaged SaaS30-80ms$70/mo (starter)Yes (sparse-dense)Bulk upsert from export
WeaviateSelf-hosted or cloud25-60ms$40/mo (self-hosted)Yes (BM25 + vector)Backup/restore or batch import

ChromaDB wins for local-first development. Qdrant wins for self-hosted production with payload filtering. Pinecone wins for zero-ops managed service. Weaviate wins for GraphQL query flexibility.

Migration mechanics

Swapping backends mid-project requires re-indexing. MemPalace does not automatically migrate data because vector stores use different embedding models and index formats. The migration flow:

  1. Export existing data: mempalace export --backend chromadb --output ./backup.jsonl
  2. Change backend config: backend = "qdrant" in mempalace.toml
  3. Re-index: mempalace import --backend qdrant --input ./backup.jsonl

The export format is JSONL with text, metadata, and optional pre-computed embeddings. If embeddings are missing, the new backend generates them during import. This lets you switch embedding models (e.g., text-embedding-ada-002 to bge-large-en-v1.5) at the same time.

Re-indexing 100K documents takes 5-10 minutes on ChromaDB, 10-20 minutes on Qdrant, and 15-30 minutes on Pinecone (network I/O bound). The palace structure (wings, rooms, drawers) transfers as metadata, so scoped searches work immediately.

Performance trade-offs

Local ChromaDB handles 10K conversation turns with sub-10ms p95 latency. At 100K turns, latency climbs to 50ms and disk usage hits 2-3 GB. At 1M turns, you need a managed backend.

Qdrant self-hosted on a 4-core VM handles 1M vectors with 20-30ms p95 latency. Payload indexing (MemPalace’s filters) adds 5-10ms but enables fast scoped searches. Clustering adds complexity but scales to 10M+ vectors.

Pinecone’s managed service adds network latency (30-80ms p95) but removes operational load. Sparse-dense hybrid search improves recall for keyword-heavy queries (e.g., “OAuth error 401”) but costs 2x per query. The starter tier ($70/mo) caps at 1M vectors.

Backend-specific features

MemPalace’s base interface exposes the lowest common denominator. Backends with advanced features require custom code:

  • Qdrant payload indexing: Create indexes on wing, room, drawer fields for faster filtered searches. Add payload_index=["wing", "room"] to the Qdrant backend config.
  • Pinecone sparse-dense hybrid: Enable hybrid search in the Pinecone backend class. Requires generating sparse vectors (BM25) alongside dense embeddings.
  • Weaviate GraphQL: Use Weaviate’s Get queries to traverse relationships between wings and rooms. Requires extending the base interface with a graph_query() method.

These features break portability. If you use Qdrant payload indexing, switching to ChromaDB means losing the performance boost. If you use Pinecone hybrid search, switching to Qdrant means rewriting queries.

Observability and failure modes

MemPalace logs backend operations to ~/.mempalace/logs/backend.log. Each insert(), search(), and delete() call logs latency, result count, and errors. Watch for:

  • Embedding generation timeouts: If the embedding model (local or API) hangs, inserts block. Set a 10-second timeout in the backend config.
  • Connection pool exhaustion: Managed backends (Pinecone, Qdrant Cloud) limit concurrent connections. MemPalace defaults to 10 concurrent inserts; raise it to 50 for bulk imports.
  • Metadata size limits: Pinecone caps metadata at 40 KB per vector. If a conversation turn exceeds this, the insert fails. Split large turns into chunks.

The get_stats() method returns backend health: document count, index size, and error rate. Poll it every 60 seconds to catch degradation before users notice.

Deployment shape

Local development:

  • ChromaDB on disk
  • Embeddings from sentence-transformers/all-MiniLM-L6-v2 (local model)
  • No network calls

Production (single-region):

  • Qdrant self-hosted on a 4-core VM
  • Embeddings from OpenAI text-embedding-3-small (API)
  • Backup snapshots to S3 every 6 hours

Production (multi-region):

  • Pinecone managed service with replicas in us-east-1 and eu-west-1
  • Embeddings from OpenAI text-embedding-3-small (API)
  • No backup needed (Pinecone handles replication)

Technical verdict

Use MemPalace’s pluggable backends when:

  • You need local-first development with a path to managed production.
  • Your conversation history will outgrow a single machine.
  • You want to test multiple vector stores without rewriting application code.
  • You need structured scoping (wings, rooms, drawers) that maps cleanly to metadata filters.
  • Retrieval quality matters: the 96.6% R@5 benchmark on LongMemEval proves the abstraction layer doesn’t sacrifice accuracy.

Avoid it when:

  • You already have a vector store deeply integrated into your stack.
  • You need backend-specific features (Qdrant payload indexing, Pinecone hybrid search) and can’t tolerate the portability cost.
  • Your memory needs are simple enough for a flat corpus (no scoping).
  • You’re building a throwaway prototype and won’t need to scale or migrate.
  • Re-indexing overhead (5-30 minutes depending on backend and corpus size) conflicts with strict uptime requirements or frequent migration needs.

The base interface is 200 lines of Python. The ChromaDB implementation is 400 lines. Adding a new backend takes a day. The abstraction cost is low, and the deployment flexibility is high.