mech.app
AI Agents

Version Control for AI Agents: Why Git Doesn't Work for Prompt Diffs, Tool Changes, and Multi-Agent State

Git tracks text files, not agent configurations. Explore what version control primitives agents need: prompt diffs, tool rollback, and execution replay.

Source: cognatoai.com
Version Control for AI Agents: Why Git Doesn't Work for Prompt Diffs, Tool Changes, and Multi-Agent State

Git was built for merge conflicts in text files. It has no opinion about prompt templates, tool schemas, or the execution traces that agents produce. When you version an agent, you’re not just tracking code. You’re tracking configuration (system prompts, temperature, model ID), tool bindings (function schemas, API keys, retry policies), and runtime state (conversation history, tool call results, branching decisions).

Cognato AI is building a version control system specifically for agents. The pitch: treat agent memory like a Git repository. Push, pull, clone, and branch agent sessions. Swap models mid-task. Audit every action with cryptographic proof.

This is not a Git wrapper. It’s a protocol for serializing, versioning, and verifying agent memory across models and platforms.

What Does a Commit Mean for an Agent?

In Git, a commit is a snapshot of files. In agent version control, a commit is a snapshot of:

  • Configuration: System prompt, model ID, temperature, max tokens, tool registry.
  • Execution state: Conversation history, tool call results, intermediate outputs.
  • Metadata: Timestamp, user ID, cost, latency, model version.

A single agent run might produce dozens of commits. Each tool call is a state transition. Each model response is a new snapshot. The commit graph is not a linear history of code changes. It’s a tree of execution paths.

Cognato’s Mach protocol wraps agents in a recording environment. Every step is hashed and linked, creating an immutable ledger. The result is a Git-like commit history for agent behavior, not just code.

Why Git Doesn’t Work

Git assumes text files and line-based diffs. Agents produce structured data: JSON tool calls, nested conversation turns, embeddings, and non-deterministic outputs. Here’s what breaks:

Git PrimitiveAgent RealityWhy It Fails
Line-based diffPrompt templates with variable substitutionChanging a variable name looks like a full rewrite
Merge conflictTwo versions of a tool schemaNo semantic merge: you can’t combine function signatures
BranchA/B testing agent behaviorBranches diverge on runtime behavior, not code
RollbackReverting a bad promptYou need to replay execution, not just restore text
BlameWho changed this line?Agents generate outputs; no human author to blame

Git tracks intent through commit messages. Agents track intent through execution traces. You need to version the trace, not just the config.

Architecture: Mach Protocol

Mach is a protocol, not a hosted service. It defines how to serialize agent memory, hash execution steps, and push/pull sessions to a remote registry.

Core Components

  1. Wrapper: Agents run inside a Mach-wrapped environment. The wrapper intercepts tool calls, model responses, and state transitions. It serializes each step as a commit.

  2. Registry: A remote store for agent sessions. Think Docker Hub for agent memory. You push sessions, pull them, and branch from specific commits.

  3. Cryptographic Ledger: Every commit is hashed and linked to its parent. The chain is immutable. You can verify that an agent’s execution matches its claimed history.

  4. Model Swapping: Pull a session, change the model ID, and resume. The conversation history and tool bindings stay intact. The new model picks up where the old one left off.

Example Workflow

# Clone an existing agent session
$ mach clone origin/session_29a
Cloning into 'working-dir'...
remote: Enumerating agent states: 45, done.
Receiving state: 100% (45/45), done.
 Switched to branch 'gemini-handoff'

# Inspect the commit history
$ mach log
commit a3f8b2c (HEAD -> gemini-handoff)
Author: claude-3.5-sonnet
Date:   2026-06-07 10:41:00
Message: Generated UI mockup, hit rate limit

commit 7d4e1a9
Author: claude-3.5-sonnet
Date:   2026-06-07 10:38:15
Message: Analyzed requirements, called search_docs tool

# Swap to a different model
$ mach checkout -b feature/auth-gen
$ mach config set model gemini-1.5-pro
$ mach resume
Resuming agent task from commit a3f8b2c...
Model: gemini-1.5-pro

The session state includes conversation history, tool call results, and intermediate outputs. The new model sees the same context window as the old one.

Diff Semantics for Prompts

Git diffs show line changes. Agent diffs need to show semantic changes: variable substitutions, tool schema updates, and behavior shifts.

Mach doesn’t expose a diff command yet, but the protocol supports it. A semantic diff for prompts would compare:

  • Variable bindings: Did the user ID change? Did the context window shrink?
  • Tool availability: Was a tool added, removed, or updated?
  • Model config: Did temperature or max tokens change?

This is closer to infrastructure-as-code diffing (Terraform, Pulumi) than Git. You’re comparing configuration, not text.

Rollback and Replay

Rollback in Git restores files. Rollback in agent version control restores execution state and replays from a checkpoint.

If an agent hallucinates and deletes critical code, you:

  1. Checkout the session state before the error.
  2. Resume execution with a different prompt or model.
  3. Verify the new output before committing.

This requires deterministic replay. The agent must produce the same tool calls and outputs given the same inputs. LLMs are non-deterministic, so Mach relies on:

  • Fixed seeds: Set temperature to 0 and seed the random number generator.
  • Cached responses: Store model outputs in the commit. Replay uses cached responses instead of calling the model again.

Replay is not perfect. If you change the model or prompt, behavior diverges. But you can replay the tool calls and inspect the state at each step.

Branching for A/B Testing

Git branches let you work on features in parallel. Agent branches let you test different behaviors in parallel.

Use case: You have a research agent that generates a session tree. You want to test two different summarization strategies. You:

  1. Branch from the research commit.
  2. Run one branch with Claude, one with Gemini.
  3. Compare the outputs and merge the winner.

Merging is not automatic. You can’t merge two prompt templates or tool sets without runtime testing. But you can compare execution traces and pick the branch that performed better.

Multi-Agent Orchestration

Mach supports multi-agent workflows. A specialized research agent generates a session tree. Independent coder agents branch from specific commits and build implementations in parallel.

Each agent runs in its own wrapper. Each agent pushes its session to the registry. The orchestrator pulls sessions, inspects commit histories, and decides which branches to merge.

This is closer to CI/CD than Git. You’re orchestrating execution, not just tracking code.

Observability and Audit

Every agent step is hashed and linked. The chain is immutable. You can verify that an agent’s execution matches its claimed history.

This is critical for enterprise deployments. If an agent makes a mistake, you need to audit its reasoning. Mach provides:

  • Execution trace: Every tool call, model response, and state transition.
  • Cryptographic proof: Hash chains verify that the trace is unmodified.
  • Blame: Track which model and prompt produced each output.

This overlaps with observability tools (LangSmith, Helicone, Agent Trace RFC), but Mach focuses on versioning and audit, not just logging.

Deployment Shape

Mach is a protocol, not a hosted service. You can run it locally or deploy a registry to your own infrastructure. The architecture is:

  • Client: CLI tool that wraps agents and serializes state.
  • Registry: Remote store for sessions (S3, GCS, or a custom backend).
  • Verifier: Optional service that validates hash chains and enforces access control.

The client is open source. The registry is bring-your-own-storage. The verifier is optional (useful for enterprises that need audit trails).

Failure Modes

RiskImpactMitigation
Non-deterministic replayRollback produces different outputsUse temperature=0, cache responses, or accept divergence
Large session sizePushing/pulling is slowCompress conversation history, prune old commits
Model API changesTool schemas break across versionsVersion tool schemas separately, test before swapping
Merge conflictsTwo branches modify the same toolNo automatic merge; require manual testing
Registry downtimeCan’t push/pull sessionsCache locally, retry with exponential backoff

The biggest risk is non-determinism. LLMs are probabilistic. Even with temperature=0, outputs can vary across API versions or hardware. Mach mitigates this by caching responses, but replay is not guaranteed.

Technical Verdict

Use Mach when:

  • You’re deploying multiple agent versions in production and need rollback.
  • You’re A/B testing agent behavior and need to compare execution traces.
  • You’re building multi-agent systems and need to hand off context between models.
  • You need audit trails for enterprise compliance.

Avoid Mach when:

  • You’re prototyping a single agent and don’t need version control yet.
  • You’re using a hosted agent platform (OpenAI Assistants, Anthropic Claude) that doesn’t expose execution state.
  • You need real-time collaboration (Mach is async; sessions are pushed/pulled, not live-edited).
  • You’re building deterministic workflows (traditional CI/CD is simpler).

Mach is early. The protocol is defined, but tooling is minimal. The open-source CLI is available, but the registry and verifier are not yet public. This is infrastructure for the agentic era, but it’s still being built.