mech.app
Automation

Academic Research Skills: How a Full-Pipeline Agent Suite Orchestrates Literature Review, Summarization, and Citation

Deep dive into multi-stage research agent orchestration: state passing between search, PDF parsing, and citation extraction, plus failure handling for paywalls.

Academic Research Skills: How a Full-Pipeline Agent Suite Orchestrates Literature Review, Summarization, and Citation

Academic Research Skills is a Claude Code Skills suite that chains literature search, PDF retrieval, summarization, and citation extraction into a single orchestrated pipeline. The project exposes a real problem in multi-stage agent design: how do you pass state between tools that fail unpredictably (paywalled papers, rate-limited APIs, and malformed PDFs) while maintaining research reproducibility across workflows that span days?

The architecture matters because academic research is a challenging scenario for agent orchestration. You cannot retry a literature search with different parameters and expect the same results. Citation graphs change daily. PDFs come in dozens of formats, some scanned, some encrypted. The system needs to handle partial success, cache intermediate artifacts, and let humans intervene without losing context.

Pipeline Architecture

The suite defines four core skills that compose into a full research workflow:

Deep Research: Literature discovery and retrieval. Queries academic APIs, deduplicates results, and attempts PDF downloads from multiple sources. State output is a manifest of papers with metadata, download status, and failure reasons.

Academic Paper: Summarization and synthesis. Ingests the manifest, parses PDFs, extracts sections, runs LLM summarization per section, then assembles a structured summary. State includes extracted text, section boundaries, and summary embeddings.

Peer Reviewer: Citation validation and critique. Takes the summary, extracts all citation claims, cross-references them against the original PDFs, flags hallucinated citations, and checks for position collapse (when the agent agrees too readily with user feedback).

Full Pipeline: Orchestrates all three in sequence with checkpoint persistence. After each stage, writes intermediate state to disk, logs tool call results, and checks integrity gates before proceeding.

State Passing and Checkpointing

The system uses file-based state passing instead of in-memory objects. After Deep Research completes, it writes a manifest containing the query, timestamp, and paper metadata including download status. Papers that hit paywalls or API errors are marked with failure reasons. The Academic Paper skill reads this manifest, skips papers with missing PDFs, and writes a parallel summaries file. The Peer Reviewer reads both files and cross-references claims.

State ApproachWhy Academic Research Skills Uses It
File-based checkpointsResearch workflows span hours or days. Crashes are expected. Human review is required between stages (inspect paper list before spending tokens on summarization). Files are version-controllable and inspectable without running code.
Trade-offRequires disk I/O and schema versioning. But the alternative (losing hours of work to a rate limit or crash) is worse for research use cases.

Failure Handling and Integrity Gates

The project documentation shows that the author systematically studied how AI fails in academic contexts and engineered specific countermechanisms. The system defines integrity checks at critical transition points: after download validation and after summarization.

These gates exist because LLMs fail predictably in academic contexts. The design addresses failure modes including hallucinated citations (the agent confidently cites papers that do not exist), position collapse (the agent abandons its initial analysis too easily under social pressure), and premature convergence (the agent stops exploring alternative interpretations too quickly).

The post-download validation gate checks that a sufficient percentage of discovered papers have usable PDFs. If the success rate is too low, the agent prompts the user to adjust the query or manually upload PDFs. This prevents the summarization stage from running on an empty corpus.

The post-summarization citation verification gate runs hallucination detection. It extracts every citation claim from the generated summary, searches for it in the original PDF text, and flags any claim that cannot be verified. If too many citations are hallucinated, the agent refuses to proceed and logs the failures.

Paywalled Papers: The workflow explicitly tracks download status and failure reasons, allowing the pipeline to continue with partial success rather than failing completely.

API Rate Limiting: The workflow includes retry logic with exponential backoff. When an API returns a 429 status, the system waits before retrying, with increasing delays for subsequent failures.

Anti-Collapse Mechanisms

The Peer Reviewer skill includes a Devil’s Advocate (DA) mode that triggers when the agent detects agreement patterns. After five consecutive turns where the agent agrees with user feedback without pushback, it auto-injects a challenge question.

This mechanism addresses position collapse, where LLMs abandon their initial analysis too easily under social pressure. The DA mode is not optional. It fires automatically based on dialogue state.

The system also tracks a Dialogue Health Indicator that monitors conversation patterns. If the score drops below a threshold, the agent surfaces a warning that the conversation may be converging prematurely.

The design pattern is reusable: in any agent system where the AI needs to maintain a position under user pressure, you can implement similar agreement-tracking and automatic challenge injection.

Observability and Debugging

Every tool call logs to a structured format that tracks which skill invoked which tool, with input parameters, output metadata, latency, and errors. The Full Pipeline skill aggregates these logs into a timeline view that shows which papers failed at which stage, how many retries were attempted per tool, token usage per summarization batch, and time spent waiting on rate limits.

This is critical for debugging multi-hour runs. If the pipeline stalls at 80% completion, you can trace back to a specific API call that timed out or a PDF that crashed the parser.

Technical Verdict

Use Academic Research Skills when you need a reproducible, inspectable pipeline for literature review and you are willing to manually review intermediate outputs. The file-based checkpointing and integrity gates make it suitable for long-running research workflows where failures are expected and human oversight is required.

Avoid it if you need real-time search, distributed execution, or integration with institutional authentication systems. The single-user design and checkpoint-based architecture assume local execution with human review between stages.

The anti-collapse mechanisms (Devil’s Advocate, Dialogue Health Indicator) are the most interesting contribution. They show how to detect and counteract LLM failure modes that emerge over multi-turn conversations. These patterns are reusable in any agent system where the AI needs to maintain a position under user pressure.

The integrity gates demonstrate a key principle: in multi-stage pipelines, fail early and surface the failure to humans. Do not let bad data propagate downstream where it compounds into worse failures.

Tags

agentic-ai orchestration infrastructure