Dolphin — Document Image Parsing via Heterogeneous Anchor Prompting
Dolphin is an open-source project (ByteDance) and accompanying research that tackles document image parsing across diverse document types. It centers on a document-type-aware two-stage architecture and a novel heterogeneous anchor prompting mechanism to robustly parse complex page content.
Core ideas
-
Two-stage, document-type-aware pipeline:
- Document type classification (digital vs photographed) plus layout analysis with reading-order prediction.
- Hybrid parsing: for photographed documents use holistic parsing; for digital documents use parallel element-wise parsing to exploit structural regularities.
-
Heterogeneous Anchor Prompting: uses different anchor prompts tailored to element categories (text paragraphs, tables, formulas, figures, code blocks) to guide a single VLM-based parser to produce structured outputs efficiently.
Key features
- Supports page-level parsing (produce structured JSON/Markdown for full pages) and element-level parsing (tables, formulas, text, code).
- Dolphin-v2: a larger 3B-parameter model with expanded element detection (21 element types), attribute extraction, dedicated formula/code parsing, and stronger photographed-document handling.
- Efficiency: designed for lightweight inference and parallel element decoding; offers deployment support including vLLM and TensorRT-LLM for accelerated inference.
- Practical tooling: demo scripts for layout, page, and element parsing; instructions to fetch pretrained weights from Hugging Face; multi-page PDF parsing support.
- Open-source license: MIT. Includes BibTeX citation for academic use.
Performance & releases (high level)
- The project reports substantial improvements across page- and element-level metrics on OmniDocBench (v1.5) and related benchmarks. Dolphin-v2 is reported to reach high overall parsing scores compared to earlier Dolphin versions.
- Changelog highlights: model/code release and demo in May 2025, vLLM and TensorRT-LLM support added June 2025, and Dolphin-v2 released in December 2025.
Usage & integration
- Clone the repo, install requirements, and download pretrained weights (Hugging Face model card provided).
- Example scripts: demo_page.py, demo_layout.py, demo_element.py for different parsing granularities; CLI examples are provided in the repository.
- Deployment: example integration with vLLM and TensorRT-LLM for production/inference speedups.
Target applications
- Document OCR and understanding, automated information extraction from reports/forms/research papers, table and formula parsing, digital archive structuring, and any downstream NLP/knowledge extraction pipelines that require structured representations of document pages.
Notes
- The repository bundles code, pretrained models (Hugging Face), demo data and README guides. Community contributions such as edge-case reports are invited via issues.
(Information summarized from the project's GitHub README and changelog maintained in the repository.)
