Most agent frameworks rely on prompt engineering or fine-tuning to improve task performance. Both approaches have friction: prompts drift as tasks evolve, and fine-tuning requires labeled data plus GPU time. Qiu et al. introduce a third path in their recent work: unsupervised skill discovery that injects reusable procedural knowledge at inference time without touching model parameters.
The core idea is to let agents explore data-analysis tasks, extract verifier signals from their own trajectories, and distill those signals into skills that future runs can reuse. No human labels. No parameter updates. Just a feedback loop between exploration, verification, and skill extraction.
Why Inference-Time Skills Matter
Traditional agent improvement paths:
- Prompt engineering: Fast but brittle. Skills encoded in natural language drift when task distributions shift.
- Fine-tuning: Durable but expensive. Requires labeled examples, compute budget, and redeployment.
- RAG-based injection: Flexible but shallow. Retrieves examples or documentation, not procedural logic.
Inference-time skill augmentation sits between RAG and fine-tuning. You store executable procedures (not just text snippets) and inject them into the agent’s context when relevant. The challenge is discovering which procedures are worth keeping when you have no ground-truth labels.
DataCOPE Architecture
DataCOPE coordinates three components in a loop:
- Data-Analytic Agent: Generates exploration trajectories on unlabeled tasks.
- Unsupervised Verifier: Extracts quality signals from trajectories without human labels.
- Skill Manager: Distills high-signal trajectories into reusable skills via contrastive learning.
The verifier is task-format-specific. For report-style analysis (narrative explanations with charts), DataCOPE uses an Adaptive Checklist Verifier that derives task-specific criteria and scores reports by verifiable coverage. For reasoning-style analysis (multi-step calculations), it uses an Answer Agreement Verifier that groups trajectories by final answer and treats self-consistency as a proxy for correctness.
Adaptive Checklist Verifier
Report-style data analysis is hard to verify because success criteria are implicit. A good report might include trend analysis, outlier detection, or correlation checks, but which checks matter depends on the dataset.
The Adaptive Checklist Verifier:
- Samples initial trajectories from the agent.
- Extracts common analytical patterns (e.g., “check for missing values,” “compare distributions across groups”).
- Scores each trajectory by how many checklist items it covers.
- Refines the checklist iteratively as new trajectories reveal gaps.
This gives you a verifier signal without human annotation. Trajectories that cover more checklist items get higher scores, and those scores drive skill distillation.
Answer Agreement Verifier
For reasoning-style tasks (e.g., “What is the median sales figure for Q3?”), you can group trajectories by their final answer. If five trajectories arrive at the same answer and two arrive at different answers, the majority cluster likely contains correct reasoning.
The Answer Agreement Verifier:
- Runs multiple trajectories per task.
- Clusters trajectories by final answer.
- Assigns higher scores to trajectories in the largest cluster.
- Uses self-consistency as an auxiliary signal for skill quality.
This is cheaper than outcome-based verification (which requires executing code or querying external systems) and more reliable than heuristic scoring.
Skill Distillation Flow
Once the verifier assigns scores to trajectories, the system distills high-quality procedures:
- Contrastive pairing: For each task, pair high-score trajectories (positive) with low-score trajectories (negative).
- Procedure extraction: Identify reusable steps in positive trajectories (e.g., “group by category, then compute rolling mean”).
- Skill storage: Store procedures as structured templates with placeholders for dataset-specific details.
- Injection logic: At inference time, retrieve relevant skills and prepend them to the agent’s prompt or tool context.
The contrastive step is critical. Without negative examples, the system might extract overly specific procedures that only work on the training task. By comparing high-score and low-score trajectories, DataCOPE isolates the procedural differences that matter.
Implementation Boundaries
| Component | Storage | Retrieval | Execution |
|---|---|---|---|
| Skills | JSON templates with step sequences | Embedding-based similarity search over task descriptions | Injected into agent prompt or tool call context |
| Verifier signals | Ephemeral (computed per iteration, not persisted) | Not applicable (ephemeral) | Runs during skill discovery loop, not at inference |
| Trajectories | Logged for analysis, pruned after skill extraction | Sampled for contrastive pairing | Not reused directly; only distilled skills persist |
Skills are the only persistent artifact. Trajectories and verifier signals are intermediate byproducts of the discovery loop.
Illustrative Skill Template
The following structure represents a plausible skill format, though the paper does not specify exact implementation details:
{
"skill_id": "outlier_detection_iqr",
"description": "Detect outliers using interquartile range method",
"steps": [
"Compute Q1 (25th percentile) and Q3 (75th percentile) for {column}",
"Calculate IQR = Q3 - Q1",
"Flag values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR as outliers",
"Return count and indices of outliers"
],
"applicability": "Numeric columns with potential outliers",
"embedding": [0.23, -0.45, 0.67]
}
At inference time, when the agent encounters a task like “Analyze sales data for anomalies,” the retrieval layer fetches outlier_detection_iqr and injects its steps into the agent’s context. Placeholders like {column} are bound to actual dataset column names by the agent’s execution layer, which parses the skill template and substitutes runtime values before executing each step.
Operational Challenges
Skill overfitting: If the discovery loop runs on a narrow task distribution, extracted skills may not generalize. Mitigation: sample tasks from diverse domains during exploration.
Verifier drift: The Adaptive Checklist Verifier refines its criteria over time. If early iterations produce low-quality checklists, later skills inherit those biases. Mitigation: bootstrap the checklist with a small set of human-validated criteria, then let it adapt.
Skill composition errors: When multiple skills apply to the same task, the agent must decide which to use and in what order. DataCOPE does not address composition logic; it assumes the agent’s planner handles precedence. In practice, you need a separate orchestration layer (e.g., a DAG of skill dependencies).
Context pollution: Injecting too many skills into the prompt can exceed context windows or confuse the agent. Mitigation: limit retrieval to top-k skills by relevance score, and prune skills that have not been used in N inference runs.
Version conflicts: If you update a skill template (e.g., change the IQR threshold from 1.5 to 2.0), old trajectories that reference the skill may break. Mitigation: version skills with semantic versioning and maintain a compatibility matrix.
Evaluation Results
Qiu et al. evaluated DataCOPE on two benchmarks:
- Deep Data Research (report-style): Narrative analysis with visualizations.
- DABStep (reasoning-style): Multi-step calculations with verifiable answers.
Across both settings and four model configurations (GPT-4, Claude 3.5 Sonnet, Llama 2 70B, and Mistral Large), DataCOPE improved mean score by 9.71% over baselines (ArXiv).
The paper does not report inference latency or token overhead from skill injection. In production systems, you want to cache skill embeddings and precompute retrieval indices. If your agent runs in a loop (e.g., analyzing hundreds of datasets per hour), amortize retrieval cost by batching tasks with similar skill requirements.
Comparison to Other Approaches
DataCOPE differs from other unsupervised agent improvement methods:
- Self-play: Agents improve by competing against themselves (e.g., AlphaGo). Requires a well-defined reward signal and game-like environment. DataCOPE works in open-ended data analysis where rewards are ambiguous.
- In-context learning from trajectories: Agents learn by observing past successful runs. This is closer to RAG (retrieval of examples) than skill extraction. DataCOPE distills reusable procedures, not just example trajectories.
- Reinforcement learning from human feedback (RLHF): Requires human labels for preference ranking. DataCOPE is fully unsupervised, relying on verifier signals derived from the agent’s own exploration.
The key trade-off is generalization versus supervision cost. DataCOPE sacrifices some accuracy (no human labels means noisier signals) to eliminate annotation overhead.
Technical Verdict
DataCOPE is a practical middle ground between prompt hacking and fine-tuning. It automates skill discovery without labels and injects procedural knowledge at inference time without parameter updates. The architecture is clean: agent, verifier, skill manager. The operational challenges are manageable if you version skills, limit retrieval scope, and monitor for overfitting.
The approach works best when you have unlabeled data-analysis tasks and cannot afford human annotation. Your agent’s task distribution should be stable enough that skills discovered on past tasks transfer to future tasks. You need to invest in skill storage (vector DB or key-value store), retrieval logic (embedding search), and observability (skill usage metrics). This lets you improve agent performance without retraining or fine-tuning.
Skip DataCOPE if your tasks are one-off or highly heterogeneous (skills won’t transfer), you already have labeled data and can fine-tune cheaply, your agent operates in a low-latency environment where retrieval overhead is unacceptable, or you need formal guarantees about skill correctness (unsupervised verification is probabilistic, not deterministic).
The biggest operational challenge is skill lifecycle management. You need tooling to prune stale skills, detect conflicts, and audit which skills fire on which tasks. If you already run an agent orchestration platform (e.g., LangGraph, Temporal, or a custom DAG executor), you can bolt DataCOPE onto your existing infrastructure. If you are starting from scratch, budget time for the plumbing.