MinerU: Advanced Document Parsing for AI Workflows
MinerU is a powerful open-source toolkit designed to convert unstructured documents, particularly PDFs, into structured, machine-readable formats such as Markdown and JSON. Developed by the OpenDataLab team, it addresses the challenges of processing complex scientific literature and other document types for use in large language model (LLM) pipelines and agentic workflows. Born from the pre-training needs of models like InternLM, MinerU focuses on high-fidelity extraction while preserving semantic coherence and document structure.
Core Functionality
At its heart, MinerU processes documents through a modular pipeline that handles various elements:
- Layout Analysis: Utilizes advanced models like DocLayout-YOLO to detect and classify elements such as text blocks, headings, lists, images, tables, and formulas with high precision.
- Text Extraction and OCR: Automatically detects scanned or garbled PDFs and applies multilingual OCR supporting up to 109 languages, including Latin scripts with accents, Arabic, and Asian languages. It combines rule-based and model-driven approaches for hybrid text extraction, improving accuracy in dense or irregular layouts.
- Formula Recognition: Converts mathematical expressions to LaTeX format using UniMERNet, handling complex, multi-line, and hybrid Chinese-English formulas effectively.
- Table Parsing: Employs models like RapidTable and StructTable-InternVL2 for recognizing and converting tables to HTML, supporting rotated, borderless, and cross-page tables.
- Reading Order Sorting: Uses layoutreader to determine logical reading sequences, even in multi-column or complex layouts, ensuring natural flow.
- Semantic Cleanup: Removes headers, footers, footnotes, and page numbers to maintain content integrity.
The tool outputs include:
- Markdown: Multimodal NLP-friendly format with embedded LaTeX and HTML.
- JSON: Structured data like
content_list.jsonfor reading-order sorted elements andmiddle.jsonfor intermediate processing. - Visualizations: Layout and span visualizations to verify output quality.
Key Features and Innovations
MinerU stands out with its dual-backend architecture:
- Pipeline Backend: Fast, hallucination-free processing using specialized models for each task. It supports CPU/GPU acceleration and is optimized for efficiency, with recent updates like PP-OCRv5 for 37+ languages and improved table merging.
- VLM Backend: Leverages multimodal vision-language models (e.g., MinerU2.5, a 1.2B parameter model outperforming 100B+ VLMs on OmniDocBench). It achieves SOTA performance in layout, text, formula, and table recognition, with backends like transformers, vLLM, LMDeploy, and MLX for Apple Silicon.
Notable enhancements include:
- Efficiency: Processes documents at high speeds (e.g., >10,000 tokens/s on NVIDIA 4090) with low memory requirements (6-8GB VRAM minimum).
- Extensibility: Configuration files for custom delimiters, model paths, and feature toggles; API and WebUI support via FastAPI and Gradio.
- Compatibility: Runs on Windows, Linux (2019+ distros), macOS; supports CUDA, MPS, and offline deployment.
- Handwriting and Vertical Text: Limited support for handwritten docs and vertical layouts.
Versions and Evolution
Initial release in July 2024, MinerU has evolved rapidly:
- v2.0+ (2025): Restructured for better usability, integrated VLM MinerU2.5 (SOTA on benchmarks), switched to vLLM for inference, and added cross-page merging.
- Recent Updates (v2.6.x): OCR speed boosts (200-300%), Chinese formula support, timeout configs, and backend optimizations like MLX for Apple devices.
The project has garnered 49,954 GitHub stars, reflecting its impact in AI data preparation.
Use Cases
Ideal for researchers processing scientific papers, developers building RAG systems, and teams handling enterprise documents. Demos are available on Hugging Face, ModelScope, and the official web app at mineru.net.
For deployment, use pip/uv installation or Docker. Detailed docs cover APIs, extensions, and troubleshooting.
MinerU continues to iterate, with future plans for chemical formula and geometric shape recognition.
