Agentic theorem proving has delivered state-of-the-art results in formal mathematics. This paper tests whether those capabilities transfer to program verification, where proofs must reason about imperative code, memory, and side effects instead of abstract mathematical objects.
A new ArXiv paper evaluates Claude Code in an agentic proving framework against CLEVER, a Lean 4 benchmark for verifiable code generation. The results expose both the power and the limits of compiler-in-the-loop agentic systems when they move from proving theorems to certifying programs.
What Makes Program Verification Different
Formal mathematics and program verification share the same proof assistant infrastructure (Lean 4, Coq, or Isabelle). The proof strategies diverge:
- Mathematical proofs work with abstract definitions, induction, and algebraic rewriting.
- Program verification requires reasoning about loops, mutable state, preconditions, postconditions, and invariants.
An agent that can prove the Pythagorean theorem may fail to prove that a sorting function preserves list length. The proof search space is different, and the tactics that work in one domain do not always transfer.
The CLEVER Benchmark
CLEVER is a Lean 4 dataset designed to test verifiable code generation. Each problem includes:
- A natural language specification
- A reference implementation
- A formal specification in Lean 4
- A correctness proof
The benchmark tests three capabilities:
- Specification generation: Can the agent write a formal spec from natural language?
- Implementation certification: Can it prove an implementation satisfies a given spec?
- End-to-end pipeline: Can it generate both code and proof from scratch?
CLEVER uses isomorphism-based scoring to judge whether generated specifications are equivalent to ground truth. This matters because multiple valid specs can describe the same behavior, and exact string matching would reject correct answers.
Agentic Proving Architecture
The paper evaluates Claude Code in an agentic framework. The core loop operates as follows:
The agent generates tactics, submits them to the Lean kernel, receives feedback, and iterates. The kernel is the source of truth: it accepts or rejects each proof step. The agent maintains no separate state machine. It relies on the kernel to serialize and deserialize the proof context with each round trip.
Key orchestration decisions:
- Backtracking: When does the agent abandon a proof branch and try a different tactic?
- Context management: How does it keep track of hypotheses, goals, and intermediate lemmas?
- Feedback interpretation: How does it parse Lean error messages and decide what to fix?
Results and Failure Modes
Claude Code achieved:
- 98.8% valid specifications (pass rate on specification generation from natural language)
- 87.5% implementation certification (proving code matches spec)
- 98.1% end-to-end success on problems with self-consistent premises
These numbers are high. The paper flags two important caveats:
- Isomorphism scoring is fragile: Some generated specs are correct but rejected because they are not isomorphic to the reference.
- Dataset bugs: The agent identified lingering bugs in CLEVER itself, including inconsistent premises and incorrect reference proofs.
The failure modes cluster around:
- Loop invariants: The agent struggles to synthesize invariants for while loops.
- Recursive functions: Termination proofs require explicit measures, and the agent sometimes guesses wrong.
- Specification ambiguity: When the natural language is underspecified, the agent generates a valid but different spec.
Proof State Orchestration
Lean 4 proofs are stateful. Each tactic transforms the proof state: goals, hypotheses, and local context. An agentic prover must manage this state across multiple reasoning steps.
The paper reports that Claude provides “high-quality feedback on its own attempts,” identifying underlying causes of failure. This feedback mechanism reveals how the agent handles backtracking and state synchronization.
Backtracking logic: The agent uses compiler feedback as the primary signal. When a tactic produces a type error or unification failure, the agent abandons that branch and tries an alternative. The tight compiler loop prevents the agent from wandering too far down invalid paths.
State synchronization: The agent must track:
- The current goal stack (what remains to be proved)
- Available hypotheses and local definitions
- Which tactics have been attempted on each goal
- The proof context (imported libraries, notation, type class instances)
Each round trip to the Lean kernel includes the full proof state. The agent does not maintain a separate state machine. It relies on the kernel to serialize and deserialize the proof context.
| Goal State | Tactic Applied | Kernel Result | Agent Decision |
|---|---|---|---|
⊢ n + 0 = n | rfl | Type mismatch | Backtrack, try induction n |
⊢ n + 0 = n | induction n | Two subgoals created | Continue with base case |
⊢ 0 + 0 = 0 | rfl | Accepted | Proceed to inductive case |
⊢ succ n + 0 = succ n | simp [ih] | Accepted | Proof complete |
Partial progress evaluation: The 98.1% success rate on self-consistent premises indicates the agent can identify which subgoals were solved and which remain open. The agent’s ability to provide feedback on its own failures suggests it tracks intermediate progress and can report which branches succeeded before hitting a dead end.
Observability and Debugging
Formal verification is opaque. When a proof fails, the error message might say “type mismatch” or “failed to unify.” It does not explain why the tactic was wrong or what to try next.
The paper’s finding that Claude identifies “lingering bugs in the dataset” demonstrates the agent reasons about why tactics fail, not just that they fail. This capability requires structured interpretation of kernel responses and comparison against expected proof patterns.
For production deployment, you need:
- Structured logging of tactic attempts and kernel responses
- Proof replay to reproduce failures
- Intermediate checkpoints so partial progress is not lost
- A trace of every tactic and kernel response
- A diff of the proof state before and after each step
- A summary of which branches were explored and why they were abandoned
| Logging Strategy | Overhead | Debuggability | Scalability |
|---|---|---|---|
| Full state snapshots | High (10-50KB per step) | Excellent (complete replay) | Poor (storage grows linearly) |
| Tactic-only logs | Low (100-500 bytes per step) | Moderate (missing context) | Good (compact storage) |
| Differential state | Medium (1-5KB per step) | Good (reconstructible) | Good (compressed deltas) |
| Checkpoint + deltas | Low (periodic snapshots) | Good (bounded replay) | Excellent (constant overhead) |
When Isomorphism Scoring Breaks
The paper reports 98.8% valid specification generation but only 81.3% acceptance under CLEVER’s isomorphism-based scoring. This gap reveals cases where isomorphism fails:
- The agent uses a different but equivalent formulation (e.g.,
∀ x, P xvs.∀ x, Q xwherePandQare logically equivalent) - The agent adds extra hypotheses that are implied by the context
- The agent reorders conjuncts or uses a different variable naming scheme
The abstract states that these findings “point to the need for more rigorous, bug-resilient evaluation methodologies, and in particular for alternatives to isomorphism-based scoring of generated specifications.” Options include:
- Semantic equivalence checking: Prove the two specs are equivalent
- Test-based validation: Run both specs on the same inputs
- Human review: Manual inspection of edge cases
None of these is perfect. Semantic equivalence is undecidable in general. Test-based validation misses corner cases. Human review does not scale.
Infrastructure Requirements
Based on the paper’s findings, a production agentic prover for program verification requires several key components:
| Component | Purpose | Constraint | Example |
|---|---|---|---|
| Agent pool | Concurrent proof processing | One Lean kernel per agent | Claude API with rate limits |
| Proof store | Versioned audit trail | Immutable, queryable | Git-backed proof repository |
| Observability | Track success, latency, failures | Real-time metrics | Prometheus + Grafana |
| Sandboxing | Isolate untrusted tactics | Resource limits, timeouts | Docker containers with cgroups |
Security boundaries: Lean 4 is a trusted kernel, but the agentic prover is not. The agent generates arbitrary code and tactics. You need:
- Resource limits (timeout, memory cap, recursion depth)
- Sandboxed execution (containerized Lean kernel)
- Audit logs (record every tactic and kernel response)
Parallelization is possible at the problem level (multiple proofs in parallel) but not within a single proof search. The tight compiler loop creates a serial bottleneck. Each tactic must be validated before the next can be generated.
Technical Verdict
| Scenario | Recommendation | Rationale |
|---|---|---|
| Formal specification exists | Use agentic proving | 87.5% certification rate on implementation correctness |
| Real-time verification needed | Avoid | Multiple compiler round trips add latency (seconds to minutes per proof) |
| Ambiguous specifications | Avoid | Agent will generate valid but unintended specs (98.8% valid vs. 81.3% accepted) |
| Sandboxing unavailable | Avoid | Agent generates arbitrary tactics; kernel must be isolated |
| Isomorphism-based scoring | Reconsider benchmark | 17.5% of valid specs rejected due to scoring fragility |
| Self-consistent premises | Use agentic proving | 98.1% end-to-end success rate when dataset is clean |
The paper demonstrates that Claude Code achieves near-complete success on program verification when the benchmark has self-consistent premises (98.1% end-to-end). The 87.5% implementation certification rate shows the agent can prove code correctness against formal specs. The 17.5-point gap between valid specification generation (98.8%) and isomorphism acceptance (81.3%) exposes a measurement problem, not a capability gap.
The infrastructure is not trivial: you need state management, backtracking, observability, and sandboxing. The payoff is machine-checkable proofs that a human would take hours to write. The authors provide empirical evidence that tight compiler-in-the-loop agentic paradigms are currently the most effective approach for foundational program verification.