Agentic Proving for Program Verification: Claude Code Meets Lean 4

Agentic theorem proving has delivered state-of-the-art results in formal mathematics. This paper tests whether those capabilities transfer to program verification, where proofs must reason about imperative code, memory, and side effects instead of abstract mathematical objects.

A new ArXiv paper evaluates Claude Code in an agentic proving framework against CLEVER, a Lean 4 benchmark for verifiable code generation. The results expose both the power and the limits of compiler-in-the-loop agentic systems when they move from proving theorems to certifying programs.

What Makes Program Verification Different

Formal mathematics and program verification share the same proof assistant infrastructure (Lean 4, Coq, or Isabelle). The proof strategies diverge:

Mathematical proofs work with abstract definitions, induction, and algebraic rewriting.
Program verification requires reasoning about loops, mutable state, preconditions, postconditions, and invariants.

An agent that can prove the Pythagorean theorem may fail to prove that a sorting function preserves list length. The proof search space is different, and the tactics that work in one domain do not always transfer.

The CLEVER Benchmark

CLEVER is a Lean 4 dataset designed to test verifiable code generation. Each problem includes:

A natural language specification
A reference implementation
A formal specification in Lean 4
A correctness proof

The benchmark tests three capabilities:

Specification generation: Can the agent write a formal spec from natural language?
Implementation certification: Can it prove an implementation satisfies a given spec?
End-to-end pipeline: Can it generate both code and proof from scratch?

CLEVER uses isomorphism-based scoring to judge whether generated specifications are equivalent to ground truth. This matters because multiple valid specs can describe the same behavior, and exact string matching would reject correct answers.

Agentic Proving Architecture

The paper evaluates Claude Code in an agentic framework. The core loop operates as follows:

The agent generates tactics, submits them to the Lean kernel, receives feedback, and iterates. The kernel is the source of truth: it accepts or rejects each proof step. The agent maintains no separate state machine. It relies on the kernel to serialize and deserialize the proof context with each round trip.

Key orchestration decisions:

Backtracking: When does the agent abandon a proof branch and try a different tactic?
Context management: How does it keep track of hypotheses, goals, and intermediate lemmas?
Feedback interpretation: How does it parse Lean error messages and decide what to fix?

Results and Failure Modes

Claude Code achieved:

98.8% valid specifications (pass rate on specification generation from natural language)
87.5% implementation certification (proving code matches spec)
98.1% end-to-end success on problems with self-consistent premises

These numbers are high. The paper flags two important caveats:

Isomorphism scoring is fragile: Some generated specs are correct but rejected because they are not isomorphic to the reference.
Dataset bugs: The agent identified lingering bugs in CLEVER itself, including inconsistent premises and incorrect reference proofs.

The failure modes cluster around:

Loop invariants: The agent struggles to synthesize invariants for while loops.
Recursive functions: Termination proofs require explicit measures, and the agent sometimes guesses wrong.
Specification ambiguity: When the natural language is underspecified, the agent generates a valid but different spec.

Proof State Orchestration

Lean 4 proofs are stateful. Each tactic transforms the proof state: goals, hypotheses, and local context. An agentic prover must manage this state across multiple reasoning steps.

The paper reports that Claude provides “high-quality feedback on its own attempts,” identifying underlying causes of failure. This feedback mechanism reveals how the agent handles backtracking and state synchronization.

Backtracking logic: The agent uses compiler feedback as the primary signal. When a tactic produces a type error or unification failure, the agent abandons that branch and tries an alternative. The tight compiler loop prevents the agent from wandering too far down invalid paths.

State synchronization: The agent must track:

The current goal stack (what remains to be proved)
Available hypotheses and local definitions
Which tactics have been attempted on each goal
The proof context (imported libraries, notation, type class instances)

Each round trip to the Lean kernel includes the full proof state. The agent does not maintain a separate state machine. It relies on the kernel to serialize and deserialize the proof context.

Goal State	Tactic Applied	Kernel Result	Agent Decision
`⊢ n + 0 = n`	`rfl`	Type mismatch	Backtrack, try `induction n`
`⊢ n + 0 = n`	`induction n`	Two subgoals created	Continue with base case
`⊢ 0 + 0 = 0`	`rfl`	Accepted	Proceed to inductive case
`⊢ succ n + 0 = succ n`	`simp [ih]`	Accepted	Proof complete

Partial progress evaluation: The 98.1% success rate on self-consistent premises indicates the agent can identify which subgoals were solved and which remain open. The agent’s ability to provide feedback on its own failures suggests it tracks intermediate progress and can report which branches succeeded before hitting a dead end.

Observability and Debugging

Formal verification is opaque. When a proof fails, the error message might say “type mismatch” or “failed to unify.” It does not explain why the tactic was wrong or what to try next.

The paper’s finding that Claude identifies “lingering bugs in the dataset” demonstrates the agent reasons about why tactics fail, not just that they fail. This capability requires structured interpretation of kernel responses and comparison against expected proof patterns.

For production deployment, you need:

Structured logging of tactic attempts and kernel responses
Proof replay to reproduce failures
Intermediate checkpoints so partial progress is not lost
A trace of every tactic and kernel response
A diff of the proof state before and after each step
A summary of which branches were explored and why they were abandoned

Logging Strategy	Overhead	Debuggability	Scalability
Full state snapshots	High (10-50KB per step)	Excellent (complete replay)	Poor (storage grows linearly)
Tactic-only logs	Low (100-500 bytes per step)	Moderate (missing context)	Good (compact storage)
Differential state	Medium (1-5KB per step)	Good (reconstructible)	Good (compressed deltas)
Checkpoint + deltas	Low (periodic snapshots)	Good (bounded replay)	Excellent (constant overhead)

When Isomorphism Scoring Breaks

The paper reports 98.8% valid specification generation but only 81.3% acceptance under CLEVER’s isomorphism-based scoring. This gap reveals cases where isomorphism fails:

The agent uses a different but equivalent formulation (e.g., ∀ x, P x vs. ∀ x, Q x where P and Q are logically equivalent)
The agent adds extra hypotheses that are implied by the context
The agent reorders conjuncts or uses a different variable naming scheme

The abstract states that these findings “point to the need for more rigorous, bug-resilient evaluation methodologies, and in particular for alternatives to isomorphism-based scoring of generated specifications.” Options include:

Semantic equivalence checking: Prove the two specs are equivalent
Test-based validation: Run both specs on the same inputs
Human review: Manual inspection of edge cases

None of these is perfect. Semantic equivalence is undecidable in general. Test-based validation misses corner cases. Human review does not scale.

Infrastructure Requirements

Based on the paper’s findings, a production agentic prover for program verification requires several key components:

Component	Purpose	Constraint	Example
Agent pool	Concurrent proof processing	One Lean kernel per agent	Claude API with rate limits
Proof store	Versioned audit trail	Immutable, queryable	Git-backed proof repository
Observability	Track success, latency, failures	Real-time metrics	Prometheus + Grafana
Sandboxing	Isolate untrusted tactics	Resource limits, timeouts	Docker containers with cgroups

Security boundaries: Lean 4 is a trusted kernel, but the agentic prover is not. The agent generates arbitrary code and tactics. You need:

Resource limits (timeout, memory cap, recursion depth)
Sandboxed execution (containerized Lean kernel)
Audit logs (record every tactic and kernel response)

Parallelization is possible at the problem level (multiple proofs in parallel) but not within a single proof search. The tight compiler loop creates a serial bottleneck. Each tactic must be validated before the next can be generated.

Technical Verdict

Scenario	Recommendation	Rationale
Formal specification exists	Use agentic proving	87.5% certification rate on implementation correctness
Real-time verification needed	Avoid	Multiple compiler round trips add latency (seconds to minutes per proof)
Ambiguous specifications	Avoid	Agent will generate valid but unintended specs (98.8% valid vs. 81.3% accepted)
Sandboxing unavailable	Avoid	Agent generates arbitrary tactics; kernel must be isolated
Isomorphism-based scoring	Reconsider benchmark	17.5% of valid specs rejected due to scoring fragility
Self-consistent premises	Use agentic proving	98.1% end-to-end success rate when dataset is clean

The paper demonstrates that Claude Code achieves near-complete success on program verification when the benchmark has self-consistent premises (98.1% end-to-end). The 87.5% implementation certification rate shows the agent can prove code correctness against formal specs. The 17.5-point gap between valid specification generation (98.8%) and isomorphism acceptance (81.3%) exposes a measurement problem, not a capability gap.

The infrastructure is not trivial: you need state management, backtracking, observability, and sandboxing. The payoff is machine-checkable proofs that a human would take hours to write. The authors provide empirical evidence that tight compiler-in-the-loop agentic paradigms are currently the most effective approach for foundational program verification.