Document parsing is where most agentic workflows quietly fail. Your agent can reason, plan, and call tools, but if it cannot extract clean text from a scanned invoice or a multi-column research paper, the entire pipeline stalls. MinerU (63K stars, trending #15 on GitHub for Python) tackles the unglamorous layer: transforming PDFs, DOCX, and PPTX into markdown or JSON that LLMs can actually consume.
This is not a summarization tool. It is the pre-processing step that sits before your RAG pipeline or agent tool call. The repo explicitly targets agentic workflows, which means it needs to handle layout analysis, OCR, table extraction, and figure detection without manual intervention.
What MinerU Actually Does
MinerU is a Python library that parses complex documents and outputs structured markdown or JSON. The pipeline handles:
- Layout analysis: Identifies headers, body text, tables, figures, and captions.
- OCR: Extracts text from scanned PDFs or images embedded in documents.
- Table extraction: Converts tables into markdown or JSON structures.
- Multi-format support: PDF, DOCX, PPTX, and XLSX.
The output is designed to feed directly into LLM context windows or vector databases. No intermediate cleaning step required.
Output Format
MinerU produces markdown with semantic structure preserved:
# Document Title
## Section Header
Body text with **bold** and *italic* formatting preserved.
| Column 1 | Column 2 | Column 3 |
|----------|----------|----------|
| Data A | Data B | Data C |

For structured workflows, it also outputs JSON with layout metadata:
{
"pages": [
{
"page_num": 1,
"blocks": [
{
"type": "text",
"content": "Body paragraph with extracted text content.",
"bbox": [50, 100, 550, 150]
},
{
"type": "table",
"rows": [
{"cells": ["Header 1", "Header 2", "Header 3"]},
{"cells": ["Value A1", "Value A2", "Value A3"]},
{"cells": ["Value B1", "Value B2", "Value B3"]}
],
"bbox": [50, 200, 550, 400]
}
]
}
]
}
The bounding box coordinates let you trace back to the original document if the agent needs to verify a claim or show provenance.
Architecture: How the Pipeline Works
MinerU runs a multi-stage pipeline for each document:
- PDF rendering: Converts each page to an image (for layout analysis) and extracts embedded text (for native PDFs).
- Layout detection: Uses a vision model to classify regions (text, table, figure, header, footer).
- Text extraction: Pulls text from native PDFs or runs OCR on scanned regions.
- Table parsing: Detects table structure (rows, columns, merged cells) and converts to markdown or JSON.
- Post-processing: Merges text blocks, orders content by reading flow, and cleans formatting artifacts.
Key Components
| Component | Role | Failure Mode |
|---|---|---|
| Layout model | Classifies page regions | Misses tables in dense multi-column layouts |
| OCR engine | Extracts text from images | Struggles with handwriting, low-resolution scans |
| Table parser | Detects cell boundaries | Breaks on borderless tables, complex merges |
| Reading order | Sequences blocks logically | Fails on two-column academic papers with footnotes |
The layout model is the critical bottleneck. If it misclassifies a table as a figure, the downstream parser cannot recover. MinerU uses a pre-trained vision model (likely a variant of LayoutLM or similar), but you cannot easily swap it out without forking the repo.
Handling Edge Cases in Production
Real-world documents break assumptions. Here is how MinerU handles (or does not handle) common edge cases:
Multi-Column PDFs
MinerU attempts to detect reading order by analyzing spatial relationships between text blocks. For two-column academic papers, it usually succeeds. For three-column newsletters with sidebars, it often fails. The output interleaves columns incorrectly.
Workaround: Pre-process PDFs with column detection tools (like pdfplumber) or split pages manually before feeding to MinerU.
Scanned Documents
OCR quality depends on the underlying engine (Tesseract or a cloud API). MinerU does not expose OCR confidence scores in the output, so you cannot filter low-quality extractions automatically.
Workaround: Run a separate OCR pass with confidence scoring, then use MinerU only for layout analysis.
Non-Latin Scripts
MinerU supports multi-language OCR, but layout models trained on English documents often misclassify text blocks in Arabic, Chinese, or Hebrew. Right-to-left languages are especially problematic.
Workaround: Fine-tune the layout model on your target language or use language-specific parsers upstream.
Tables Without Borders
Borderless tables (common in financial reports) confuse the table parser. It relies on visual separators to detect cell boundaries. Without borders, it treats the table as plain text.
Workaround: Use a dedicated table extraction tool (like camelot-py) for documents with known table formats.
Integration with Agentic Workflows
MinerU fits into the ingestion layer of an agent pipeline. Here is a typical flow:
- Document arrives (user upload, webhook, file watcher).
- MinerU parses the document into markdown or JSON.
- Validation step checks for parsing errors (empty tables, missing text).
- Chunking splits the markdown into LLM-ready segments (by section, page, or token count).
- Embedding generates vectors for each chunk.
- Storage writes chunks and metadata to a vector database.
- Agent retrieval queries the database when a user asks a question.
Code Example: Basic Pipeline
from magic_pdf.pipe.UNIPipe import UNIPipe
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
import json
# Initialize parser
pdf_path = "/path/to/document.pdf"
output_dir = "/path/to/output"
# Parse PDF
pipe = UNIPipe(pdf_path, output_dir)
pipe.pipe_classify() # Detect layout
pipe.pipe_parse() # Extract content
# Load output
with open(f"{output_dir}/auto/document.md", "r") as f:
markdown = f.read()
# Validate before feeding to agent
if len(markdown.strip()) < 100:
raise ValueError("Parsing failed: output too short")
# Chunk and embed (pseudo-code)
chunks = split_by_section(markdown)
embeddings = embed_chunks(chunks)
store_in_vector_db(embeddings)
The validation step is critical. MinerU does not fail gracefully. It will output an empty markdown file if the layout model crashes or the PDF is corrupted.
Observability and Debugging
MinerU does not expose structured logs or telemetry. You get console output and file artifacts. For production pipelines, you need to instrument it yourself:
- Log file sizes: Track input PDF size vs. output markdown size. A 10MB PDF that produces 1KB of markdown is a red flag.
- Parse time: MinerU can take 10-30 seconds per page for complex documents. Monitor latency to detect hangs.
- Error rates: Wrap the parser in a try-except block and log exceptions. MinerU throws generic errors that do not indicate root cause.
- Sample outputs: Randomly inspect parsed documents to catch silent failures (tables rendered as text, missing images).
Common Failure Modes
| Symptom | Likely Cause | Fix |
|---|---|---|
| Empty output | Corrupted PDF, unsupported encryption | Pre-validate PDFs with PyPDF2 |
| Garbled text | Encoding mismatch, OCR failure | Check PDF metadata, re-run OCR |
| Missing tables | Layout model misclassification | Use dedicated table extractor |
| Slow parsing | High-resolution images, large file | Downsample images before parsing |
Deployment Considerations
MinerU is a Python library, not a service. You need to wrap it in an API or task queue for production use.
Deployment Shapes
- Synchronous API: Flask/FastAPI endpoint that accepts a PDF and returns markdown. Works for small documents (under 10 pages). Timeouts are a problem for larger files.
- Async task queue: Celery or RQ worker that processes documents in the background. Clients poll for results or receive a webhook callback.
- Batch processing: Cron job or Airflow DAG that processes a directory of PDFs overnight. No real-time requirement.
Resource Requirements
- CPU: Layout analysis is CPU-bound. Expect 1-2 cores per worker.
- Memory: 2-4GB per worker for typical documents. Spikes to 8GB for 100+ page PDFs with high-resolution images.
- Disk: Temporary storage for intermediate images and JSON. Plan for 5x the input PDF size.
Security Boundaries
MinerU executes arbitrary PDFs. Malicious PDFs can exploit vulnerabilities in the underlying libraries (Pillow, PyMuPDF). Run workers in sandboxed containers with no network access. Do not trust user-uploaded PDFs without scanning them first.
Technical Verdict
Use it if:
- You need to parse PDFs for RAG pipelines or agent tool calls.
- Your documents have consistent layouts (reports, invoices, academic papers).
- You can tolerate occasional parsing errors and have a validation layer.
- You are prototyping an agentic workflow and need a quick solution for document ingestion.
- You want open-source tooling that you can inspect, modify, and self-host.
Avoid it if:
- You need 100% accuracy on tables or figures (use domain-specific parsers like Camelot or Tabula).
- Your documents are scanned images with poor quality (invest in better OCR infrastructure first).
- You need real-time parsing of large documents (MinerU is too slow for sub-second latency).
- You are parsing non-English documents with complex scripts (layout models will fail on right-to-left text).
- You require production-grade error handling and observability out of the box (you will need to build instrumentation yourself).
MinerU solves the document ingestion problem well enough for most agentic workflows. It is not perfect. Layout detection fails on edge cases, OCR quality varies, and error handling is minimal. But it is open source, actively maintained, and designed for the exact use case (LLM-ready output). For production systems at scale, plan to replace MinerU with a commercial service (like AWS Textract or Google Document AI) once you need higher accuracy or better SLAs. But for prototyping and mid-scale deployments, it is the best open-source option available.
Source Links
- Primary repository: github.com/opendatalab/MinerU
- PyPI package: pypi.org/project/mineru