mech.app
AI Agents

Formal Proof Search Agents: How LLMs Generate Lean Proofs to Make Mathematical Reasoning Verifiable

Inside the agent loop that translates LLM reasoning into machine-checked Lean proofs: tactic generation, proof state tracking, and verification boundaries.

Source: arxiv.org
Formal Proof Search Agents: How LLMs Generate Lean Proofs to Make Mathematical Reasoning Verifiable

Large language models can solve math problems, but you cannot trust the output. A proof that looks correct might contain a subtle error three steps in. For mathematics research, “probably right” is not good enough.

The solution is to make the LLM generate formal proofs in a language like Lean, where a kernel verifies every step. A recent large-scale evaluation (arXiv 2605.22763v1) shows agents solving open Erdős problems and OEIS conjectures by looping between LLM tactic generation and Lean verification. The architecture turns unreliable reasoning into machine-checked correctness.

Results from the Paper

The paper presents the first large-scale evaluation of LLM-driven formal proof search on open research problems. The most capable agent autonomously resolved:

  • 9 of 353 open Erdős problems (2.5% success rate)
  • 44 of 492 OEIS conjectures (8.9% success rate)
  • Cost per solved problem: “a few hundred dollars” in compute

The system is being deployed in active research across combinatorics, optimization, graph theory, algebraic geometry, and quantum optics.

The paper compares two agent designs:

  1. Basic agent: Alternates LLM generation with Lean verification in a simple loop
  2. Advanced agent: Uses value estimation and sophisticated search strategies

Both agents solved the same Erdős problems, but the basic agent proved more expensive on the hardest problems. This suggests that search strategy becomes critical as problem difficulty increases.

The Verification Boundary

Lean is a proof assistant with a small trusted kernel. When you submit a proof, the kernel type-checks every tactic application. If a step fails, the entire proof fails. This creates a hard boundary: the LLM can hallucinate, but the kernel will reject invalid reasoning.

The agent does not try to verify its own work. It generates candidate tactics, submits them to Lean, and uses the kernel’s response to decide the next move. This separation is the key to trustworthiness.

Verification flow:

  • Agent generates a tactic string (e.g., rw [mul_comm])
  • Lean kernel attempts to apply the tactic to the current proof state
  • Kernel returns success (new proof state) or failure (type error)
  • Agent updates its search tree based on the result

The kernel runs in-process or via IPC depending on the agent implementation. Some systems use a separate Lean server process to isolate crashes and manage timeouts. The critical property is that the kernel never trusts the LLM output. It checks everything.

Proof State Management

A proof in Lean is a tree of goals. Each goal has a context (available hypotheses) and a target (statement to prove). Applying a tactic transforms one goal into zero or more subgoals.

The agent tracks this state explicitly. It maintains a data structure mapping each node in the search tree to a Lean proof state. When the LLM generates a tactic, the agent applies it to the current state and stores the result.

State representation (pseudocode):

-- Pseudocode representation of proof state tracking
structure ProofState where
  goals : List Goal           -- remaining subgoals to prove
  context : Environment       -- available definitions and lemmas
  metavars : MetavarContext   -- unification variables
  depth : Nat                 -- search depth for loop prevention

The environment includes all available lemmas and definitions. The metavariable context tracks unification variables introduced during proof search. Depth limits prevent infinite loops.

When a tactic fails, the agent does not repair it. It backtracks to the previous state and tries a different tactic. This is cheaper than trying to fix broken proofs, because the kernel already told you the tactic is invalid.

Search Strategy

The proof space is exponential. At each step, the agent can apply thousands of tactics. The paper’s findings show that search strategy matters more as problem difficulty increases.

The basic agent alternates generation and verification without sophisticated search. It samples tactics from the LLM, applies them in sequence, and backtracks on failure. This works for easier problems but becomes expensive on hard ones.

The advanced agent uses value estimation to prioritize promising branches. The LLM assigns a score to each partial proof based on how close it looks to completion. The agent explores high-value branches first.

Tactic Generation Loop

The LLM generates tactics conditioned on the current proof state. The prompt includes the goal, available hypotheses, and relevant lemmas from the library.

Generation cycle:

  1. Serialize the current Lean proof state into a text prompt
  2. Sample N candidate tactics from the LLM
  3. Apply each tactic to the proof state via the Lean kernel
  4. Filter out tactics that fail type-checking
  5. Add successful tactics to the search tree
  6. Select the next node to expand based on the search strategy

The serialization step is critical. The LLM sees a text representation of the proof state, not the internal Lean data structure. If the serialization loses information (e.g., implicit arguments, type class instances), the LLM will generate tactics that look plausible but fail verification.

Some agents include a repair step: if a tactic fails, the agent asks the LLM to fix it based on the error message. This rarely works. Type errors in formal proofs are usually not fixable with small edits. Backtracking is faster.

Agent Design Patterns

While the paper does not detail every implementation choice, formal proof agents typically follow common patterns:

Lemma retrieval: Lean’s standard library contains thousands of lemmas. Practical systems use a retrieval index. The agent embeds the current goal and queries a vector database of lemma statements. The top-k results go into the prompt.

Failure handling: The agent can fail in several ways:

  • Search exhaustion: The agent explores the entire search budget without finding a proof
  • Kernel timeout: A tactic takes too long to type-check
  • Library gap: The proof requires a lemma that does not exist in the library
  • Prompt drift: As the proof gets longer, the serialized state exceeds the LLM’s context window

The paper’s 2.5% success rate on open Erdős problems reflects these failure modes. Most attempted proofs exhaust the search budget without finding a solution.

Cost Model

The paper reports solving problems at “a few hundred dollars” per solved problem. This includes LLM inference costs and compute time for the Lean kernel.

At a 2.5% success rate on Erdős problems (9 solved out of 353 attempted), the effective cost per solved problem is significantly higher when accounting for failed attempts. If each attempt costs $200 and you need 40 attempts per success, the effective cost is $8,000 per solved problem.

Cost breakdown:

  • LLM inference: Dominates for hard problems. Each tactic generation costs a few cents. A proof requiring 10,000 tactic attempts costs $100 to $500 depending on the model.
  • Lean kernel: Cheap. Type-checking a tactic takes milliseconds on a single CPU core.

The cost scales with search depth. Better search strategies reduce cost by pruning unpromising branches early. The paper’s comparison of basic and advanced agents shows this effect: both solved the same problems, but the basic agent cost more on hard instances.

Deployment Shape

A production formal proof agent runs as a long-lived service. It accepts proof requests, manages a pool of Lean worker processes, and returns verified proofs or failure reports.

Architecture components:

  • API server: Accepts proof requests, returns results
  • Task queue: Buffers incoming requests, prioritizes by estimated difficulty
  • Worker pool: Runs Lean kernel instances, applies tactics, returns proof states
  • LLM service: Generates candidate tactics, can be local or remote
  • Result store: Caches completed proofs, avoids redundant work

The worker pool is the bottleneck. Each Lean instance consumes significant memory (500MB to 2GB depending on the library). Scaling horizontally requires careful resource management.

Some deployments use a hybrid approach: run the Lean kernel locally for low-latency verification, but call a remote LLM API for tactic generation. This trades network latency for GPU cost.

Observability

You need to instrument the agent to understand why it succeeds or fails. Key metrics:

  • Tactics attempted per proof: Measures search efficiency
  • Verification success rate: Fraction of generated tactics that pass type-checking
  • Proof depth distribution: Shows how long successful proofs are
  • Backtrack frequency: Indicates search strategy quality

Logging the full search tree is expensive but valuable for debugging. You can replay failed proofs, inspect the tactics the agent tried, and identify where it got stuck.

Technical Verdict

Use formal proof agents when:

  • You need machine-checkable correctness for mathematics research
  • You have a well-defined problem in a domain with a mature Lean library (combinatorics, graph theory, algebraic geometry)
  • You can afford the compute cost ($200+ per attempt, potentially thousands per solved problem when accounting for failures)
  • The value of a verified result exceeds the cost (e.g., resolving an open conjecture, validating a critical theorem)
  • You need a formal artifact that can be checked independently

Avoid when:

  • You need fast, approximate answers and can tolerate errors
  • The problem domain lacks a Lean formalization (most applied domains, business logic)
  • The proof requires novel lemmas not in the library
  • You cannot afford hundreds to thousands of dollars per problem or the 90%+ failure rate on hard problems
  • The verification overhead exceeds the value of formal correctness

The architecture works because it separates unreliable generation from reliable verification. The LLM explores the proof space, but the Lean kernel enforces correctness. This pattern generalizes to any domain where you can define a formal specification and a trusted verifier. The paper demonstrates viability for mathematics research. Extending to other domains (cryptographic protocols, safety-critical systems, smart contracts) requires building the corresponding formal libraries and verifiers.

Tags

agentic-ai orchestration infrastructure

Primary Source

arxiv.org