Vulnerability disclosure volumes now exceed what security teams can assess. Three research communities (exploit generation, vulnerability prioritization, detection engineering) operate in isolation, each solving one piece of the problem. FORGE connects all three using a multi-agent pipeline where exploit attempts produce detection signatures at the same time.
The system runs five specialized agents in a fixed sequence: Intel gathers CVE metadata, Generator builds vulnerable applications, Planner maps exploitation strategy, Exploit executes graduated attacks, and Detector writes Sigma and Snort rules from OpenTelemetry traces. The key mechanism is graduated exploitation depth, a four-level taxonomy (L0: no evidence through L3: full compromise) that scores partial progress instead of binary pass/fail.
Orchestration Flow and Agent Handoffs
FORGE uses a fixed pipeline with explicit state boundaries between agents. Each agent receives structured input from the previous stage and writes structured output to shared storage.
Pipeline stages:
- Intel Agent: Parses CVE metadata, extracts affected versions, CWE types, and attack vectors. Outputs JSON schema with vulnerability context.
- Generator Agent: Reads Intel output, synthesizes a vulnerable application in the target language (supports eight languages). Writes Dockerfile and application code to disk.
- Planner Agent: Analyzes CVE description and CWE type, produces exploitation strategy as a sequence of steps. Outputs natural language plan with expected intermediate states.
- Exploit Agent: Executes multi-turn exploitation coached by the Planner’s strategy. LLM-primary oracle scores each attempt on the L0-L3 taxonomy. Writes exploitation transcript and telemetry traces.
- Detector Agent: Consumes OpenTelemetry traces from Exploit stage, generates Sigma rules for SIEM and Snort rules for network monitoring. Outputs detection signatures grounded in observed behavior.
State management uses a tiered knowledge architecture. Each CVE assessment accumulates build patterns, exploitation techniques, and detection signatures. Subsequent CVEs query this knowledge base to reuse successful strategies. The system does not use vector embeddings or semantic search; it matches on CWE type and language.
Graduated Exploitation as the Bridging Mechanism
Traditional exploit generation reports binary outcomes: exploit works or it doesn’t. FORGE scores exploitation depth on four levels:
| Level | Definition | Signal for Detection | Signal for Prioritization |
|---|---|---|---|
| L0 | No evidence of exploitation | Baseline traffic patterns | Likely not exploitable in practice |
| L1 | Vulnerability triggered, no control | Anomalous requests logged | Reachable but not weaponizable |
| L2 | Partial control (read, limited write) | Privilege escalation attempts | Exploitable with constraints |
| L3 | Full compromise (arbitrary code execution) | Command execution traces | Critical priority |
This taxonomy solves two problems. First, it provides richer behavioral traces for detection engineering. A failed L1 attempt still shows what anomalous traffic looks like. Second, it validates prioritization models. If a CVE scores high EPSS but consistently fails at L0, the metadata-based score is wrong.
The Exploit Agent runs multiple turns per CVE. The LLM-primary oracle evaluates each attempt and updates the exploitation depth. The Detector Agent runs in parallel, consuming telemetry from all attempts regardless of success level. This means detection rules capture both successful and failed exploitation patterns.
Detection Rule Generation from Telemetry
The Detector Agent reads OpenTelemetry traces written by the Exploit Agent. These traces include HTTP requests, process execution, file system access, and network connections. The agent generates two rule types:
Sigma rules (SIEM detection):
- Match on process execution patterns (unexpected child processes, privilege escalation)
- Correlate file access with network activity
- Trigger on anomalous API call sequences
Snort rules (network detection):
- Match on HTTP request patterns (unusual headers, payload signatures)
- Detect protocol violations
- Identify exploit-specific traffic shapes
The Detector Agent does not have access to the Exploit Agent’s internal reasoning. It only sees telemetry traces. This prevents the exploit strategy from poisoning detection logic. If the Exploit Agent uses a novel technique, the Detector Agent must infer it from observed behavior.
Failure Modes and Recovery
The fixed pipeline means one agent’s failure blocks downstream stages. FORGE handles this with per-agent retry logic and fallback strategies.
Generator Agent failure: If the agent cannot synthesize a vulnerable application (missing dependencies, unsupported language version), it writes a failure marker. The pipeline skips to the next CVE. The system accumulates these failures and periodically retries with updated build knowledge.
Exploit Agent failure: If all exploitation attempts score L0, the Detector Agent still runs. It generates baseline detection rules from failed attempts. These rules are tagged as low-confidence but still useful for anomaly detection.
Detector Agent failure: If rule generation fails (malformed telemetry, unsupported protocol), the system logs the error and continues. Detection rules are optional output; exploitation depth is the primary deliverable.
The LLM-primary oracle introduces non-determinism. The same exploitation attempt may score differently across runs. FORGE mitigates this with majority voting: each attempt is scored three times, and the median score is used.
Token Budget and Cost Management
The system processes 603 CVEs at $1.50 per CVE, implying approximately 200,000 tokens per assessment (assuming $0.01 per 1,000 tokens for GPT-4 class models). Token allocation across agents:
- Intel Agent: 5,000 tokens (CVE parsing, metadata extraction)
- Generator Agent: 30,000 tokens (application synthesis, Dockerfile generation)
- Planner Agent: 20,000 tokens (exploitation strategy, multi-step planning)
- Exploit Agent: 100,000 tokens (multi-turn exploitation, oracle scoring)
- Detector Agent: 45,000 tokens (rule generation, telemetry analysis)
The Exploit Agent consumes the majority of the budget because it runs multiple turns with full context. The system uses prompt caching for repeated CVE patterns (same CWE type, same language). This reduces cost by 30-40% for common vulnerability classes.
Architecture Diagram
┌─────────────────────────────────────────────────────────────┐
│ Tiered Knowledge Base │
│ (Build Patterns, Exploitation Techniques, Detection Sigs) │
└─────────────────────────────────────────────────────────────┘
▲
│ Query/Update
│
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Intel │───▶│Generator │───▶│ Planner │───▶│ Exploit │
│ Agent │ │ Agent │ │ Agent │ │ Agent │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
│
│ OpenTelemetry
▼
┌──────────┐
│ Detector │
│ Agent │
└──────────┘
│
▼
Sigma + Snort Rules
Implementation Notes
The paper does not specify the LLM provider, but the token budget and cost structure suggest GPT-4 or Claude 3. The LLM-primary oracle is a separate model call that evaluates exploitation attempts. This is a second model invocation per turn, doubling the token cost for the Exploit Agent.
OpenTelemetry instrumentation requires the vulnerable application to run in a controlled environment. FORGE uses Docker containers with telemetry agents injected at runtime. This adds 10-15 seconds per exploitation attempt but provides structured traces.
The system achieves 67.8% L1+ exploitation across 603 CVEs. Exploitation rates remain near 68% regardless of EPSS or CVSS band. This indicates that pattern-level reachability (can the agent synthesize a working exploit?) is orthogonal to metadata-based scoring (does the CVE look severe?). High CVSS scores do not predict exploitability.
Technical Verdict
Use FORGE’s architecture when:
- You need to assess large volumes of CVEs (hundreds per week) and cannot manually triage all of them.
- You want detection rules generated alongside exploit validation, not as a separate workflow.
- You have budget for LLM calls ($1.50 per CVE is acceptable) and can tolerate 30-60 second latency per assessment.
- You operate in environments where graduated exploitation depth (L0-L3) provides more signal than binary exploit/no-exploit.
Avoid this approach when:
- You need deterministic, reproducible exploit generation. The LLM-primary oracle introduces non-determinism.
- Your vulnerability backlog is small (fewer than 50 CVEs per month). Manual assessment is faster and cheaper.
- You require formal verification of detection rules. LLM-generated Sigma and Snort rules need human review before production deployment.
- You cannot run untrusted code in isolated environments. The Generator Agent synthesizes potentially malicious applications.
The graduated exploitation taxonomy is the most reusable idea. Even without the full multi-agent pipeline, scoring exploitation depth on L0-L3 provides richer signal than binary outcomes. The detection engineering integration is novel but requires careful validation. LLM-generated detection rules are a starting point, not production-ready signatures.