mech.app
Dev Tools

MinerU: PDF Parsing for Agentic Workflows Without the Usual Pain

How MinerU handles layout analysis, OCR, and table extraction to turn PDFs into LLM-ready markdown. The plumbing behind document ingestion.

Source: github.com
MinerU: PDF Parsing for Agentic Workflows Without the Usual Pain

Document parsing is where most agentic workflows quietly fail. Your agent can reason, plan, and call tools, but if it cannot extract clean text from a scanned invoice or a multi-column research paper, the entire pipeline stalls. MinerU (63K stars, trending #15 on GitHub for Python) tackles the unglamorous layer: transforming PDFs, DOCX, and PPTX into markdown or JSON that LLMs can actually consume.

This is not a summarization tool. It is the pre-processing step that sits before your RAG pipeline or agent tool call. The repo explicitly targets agentic workflows, which means it needs to handle layout analysis, OCR, table extraction, and figure detection without manual intervention.

What MinerU Actually Does

MinerU is a Python library that parses complex documents and outputs structured markdown or JSON. The pipeline handles:

  • Layout analysis: Identifies headers, body text, tables, figures, and captions.
  • OCR: Extracts text from scanned PDFs or images embedded in documents.
  • Table extraction: Converts tables into markdown or JSON structures.
  • Multi-format support: PDF, DOCX, PPTX, and XLSX.

The output is designed to feed directly into LLM context windows or vector databases. No intermediate cleaning step required.

Output Format

MinerU produces markdown with semantic structure preserved:

# Document Title

## Section Header

Body text with **bold** and *italic* formatting preserved.

| Column 1 | Column 2 | Column 3 |
|----------|----------|----------|
| Data A   | Data B   | Data C   |

![Figure 1: Caption text](path/to/extracted/image.png)

For structured workflows, it also outputs JSON with layout metadata:

{
  "pages": [
    {
      "page_num": 1,
      "blocks": [
        {
          "type": "text",
          "content": "Body paragraph with extracted text content.",
          "bbox": [50, 100, 550, 150]
        },
        {
          "type": "table",
          "rows": [
            {"cells": ["Header 1", "Header 2", "Header 3"]},
            {"cells": ["Value A1", "Value A2", "Value A3"]},
            {"cells": ["Value B1", "Value B2", "Value B3"]}
          ],
          "bbox": [50, 200, 550, 400]
        }
      ]
    }
  ]
}

The bounding box coordinates let you trace back to the original document if the agent needs to verify a claim or show provenance.

Architecture: How the Pipeline Works

MinerU runs a multi-stage pipeline for each document:

  1. PDF rendering: Converts each page to an image (for layout analysis) and extracts embedded text (for native PDFs).
  2. Layout detection: Uses a vision model to classify regions (text, table, figure, header, footer).
  3. Text extraction: Pulls text from native PDFs or runs OCR on scanned regions.
  4. Table parsing: Detects table structure (rows, columns, merged cells) and converts to markdown or JSON.
  5. Post-processing: Merges text blocks, orders content by reading flow, and cleans formatting artifacts.

Key Components

ComponentRoleFailure Mode
Layout modelClassifies page regionsMisses tables in dense multi-column layouts
OCR engineExtracts text from imagesStruggles with handwriting, low-resolution scans
Table parserDetects cell boundariesBreaks on borderless tables, complex merges
Reading orderSequences blocks logicallyFails on two-column academic papers with footnotes

The layout model is the critical bottleneck. If it misclassifies a table as a figure, the downstream parser cannot recover. MinerU uses a pre-trained vision model (likely a variant of LayoutLM or similar), but you cannot easily swap it out without forking the repo.

Handling Edge Cases in Production

Real-world documents break assumptions. Here is how MinerU handles (or does not handle) common edge cases:

Multi-Column PDFs

MinerU attempts to detect reading order by analyzing spatial relationships between text blocks. For two-column academic papers, it usually succeeds. For three-column newsletters with sidebars, it often fails. The output interleaves columns incorrectly.

Workaround: Pre-process PDFs with column detection tools (like pdfplumber) or split pages manually before feeding to MinerU.

Scanned Documents

OCR quality depends on the underlying engine (Tesseract or a cloud API). MinerU does not expose OCR confidence scores in the output, so you cannot filter low-quality extractions automatically.

Workaround: Run a separate OCR pass with confidence scoring, then use MinerU only for layout analysis.

Non-Latin Scripts

MinerU supports multi-language OCR, but layout models trained on English documents often misclassify text blocks in Arabic, Chinese, or Hebrew. Right-to-left languages are especially problematic.

Workaround: Fine-tune the layout model on your target language or use language-specific parsers upstream.

Tables Without Borders

Borderless tables (common in financial reports) confuse the table parser. It relies on visual separators to detect cell boundaries. Without borders, it treats the table as plain text.

Workaround: Use a dedicated table extraction tool (like camelot-py) for documents with known table formats.

Integration with Agentic Workflows

MinerU fits into the ingestion layer of an agent pipeline. Here is a typical flow:

  1. Document arrives (user upload, webhook, file watcher).
  2. MinerU parses the document into markdown or JSON.
  3. Validation step checks for parsing errors (empty tables, missing text).
  4. Chunking splits the markdown into LLM-ready segments (by section, page, or token count).
  5. Embedding generates vectors for each chunk.
  6. Storage writes chunks and metadata to a vector database.
  7. Agent retrieval queries the database when a user asks a question.

Code Example: Basic Pipeline

from magic_pdf.pipe.UNIPipe import UNIPipe
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
import json

# Initialize parser
pdf_path = "/path/to/document.pdf"
output_dir = "/path/to/output"

# Parse PDF
pipe = UNIPipe(pdf_path, output_dir)
pipe.pipe_classify()  # Detect layout
pipe.pipe_parse()     # Extract content

# Load output
with open(f"{output_dir}/auto/document.md", "r") as f:
    markdown = f.read()

# Validate before feeding to agent
if len(markdown.strip()) < 100:
    raise ValueError("Parsing failed: output too short")

# Chunk and embed (pseudo-code)
chunks = split_by_section(markdown)
embeddings = embed_chunks(chunks)
store_in_vector_db(embeddings)

The validation step is critical. MinerU does not fail gracefully. It will output an empty markdown file if the layout model crashes or the PDF is corrupted.

Observability and Debugging

MinerU does not expose structured logs or telemetry. You get console output and file artifacts. For production pipelines, you need to instrument it yourself:

  • Log file sizes: Track input PDF size vs. output markdown size. A 10MB PDF that produces 1KB of markdown is a red flag.
  • Parse time: MinerU can take 10-30 seconds per page for complex documents. Monitor latency to detect hangs.
  • Error rates: Wrap the parser in a try-except block and log exceptions. MinerU throws generic errors that do not indicate root cause.
  • Sample outputs: Randomly inspect parsed documents to catch silent failures (tables rendered as text, missing images).

Common Failure Modes

SymptomLikely CauseFix
Empty outputCorrupted PDF, unsupported encryptionPre-validate PDFs with PyPDF2
Garbled textEncoding mismatch, OCR failureCheck PDF metadata, re-run OCR
Missing tablesLayout model misclassificationUse dedicated table extractor
Slow parsingHigh-resolution images, large fileDownsample images before parsing

Deployment Considerations

MinerU is a Python library, not a service. You need to wrap it in an API or task queue for production use.

Deployment Shapes

  1. Synchronous API: Flask/FastAPI endpoint that accepts a PDF and returns markdown. Works for small documents (under 10 pages). Timeouts are a problem for larger files.
  2. Async task queue: Celery or RQ worker that processes documents in the background. Clients poll for results or receive a webhook callback.
  3. Batch processing: Cron job or Airflow DAG that processes a directory of PDFs overnight. No real-time requirement.

Resource Requirements

  • CPU: Layout analysis is CPU-bound. Expect 1-2 cores per worker.
  • Memory: 2-4GB per worker for typical documents. Spikes to 8GB for 100+ page PDFs with high-resolution images.
  • Disk: Temporary storage for intermediate images and JSON. Plan for 5x the input PDF size.

Security Boundaries

MinerU executes arbitrary PDFs. Malicious PDFs can exploit vulnerabilities in the underlying libraries (Pillow, PyMuPDF). Run workers in sandboxed containers with no network access. Do not trust user-uploaded PDFs without scanning them first.

Technical Verdict

Use it if:

  • You need to parse PDFs for RAG pipelines or agent tool calls.
  • Your documents have consistent layouts (reports, invoices, academic papers).
  • You can tolerate occasional parsing errors and have a validation layer.
  • You are prototyping an agentic workflow and need a quick solution for document ingestion.
  • You want open-source tooling that you can inspect, modify, and self-host.

Avoid it if:

  • You need 100% accuracy on tables or figures (use domain-specific parsers like Camelot or Tabula).
  • Your documents are scanned images with poor quality (invest in better OCR infrastructure first).
  • You need real-time parsing of large documents (MinerU is too slow for sub-second latency).
  • You are parsing non-English documents with complex scripts (layout models will fail on right-to-left text).
  • You require production-grade error handling and observability out of the box (you will need to build instrumentation yourself).

MinerU solves the document ingestion problem well enough for most agentic workflows. It is not perfect. Layout detection fails on edge cases, OCR quality varies, and error handling is minimal. But it is open source, actively maintained, and designed for the exact use case (LLM-ready output). For production systems at scale, plan to replace MinerU with a commercial service (like AWS Textract or Google Document AI) once you need higher accuracy or better SLAs. But for prototyping and mid-scale deployments, it is the best open-source option available.