Overview
olmOCR is an open-source document OCR/linearization toolkit developed and maintained by the AllenNLP team at the Allen Institute for AI (AI2). Its goal is to convert PDFs and other image-based documents into clean, natural-reading plain text or Markdown while preserving structure (sections, tables, equations, multi-column flow) and removing irrelevant elements like headers and footers. The system is powered by a 7B-parameter vision-language model and a full pipeline for rendering, inference, post-processing, benchmarking, and training.
Key features
- Convert PDF, PNG, JPEG documents into clean Markdown/plain text.
- Support for complex content types: equations, tables, handwriting, figures, multi-column layouts and insets.
- Automatic header/footer removal and natural reading-order reconstruction even for complex layouts.
- Includes olmOCR-Bench: a comprehensive benchmark suite (≈7,000 test cases across ~1,400 documents) to measure OCR performance across categories (arXiv papers, old scans, tables, headers/footers, multi-column, long/tiny text, etc.).
- Training and fine-tuning utilities: SFT finetuning code, synthetic-data generation, and an RL training script used in later model improvements.
- Containerized/docker support and examples for running at scale (including multi-node/cluster workflows with S3 and Beaker coordination).
- Verified to work with several inference providers and streamable model names (examples and cost examples provided in the repo).
- Licensed under Apache 2.0.
Notable model/releases and benchmark highlights
- The project ships model releases referenced as olmOCR-7B / olmOCR-2-7B variants (FP8 variants are distributed on Hugging Face). Release history in the repo includes performance and robustness improvements across 2025, for example v0.4.0 (Oct 21, 2025) which improved olmOCR-bench scores and introduced RL training.
- The benchmark table in the repo shows competitive overall performance for olmOCR v0.4.0 (overall ~82.4 on their olmOCR-Bench), with particularly strong results on headers/footers and multi-column layouts.
Installation & requirements (summary)
- Requires a recent NVIDIA GPU (tested on RTX 4090, L40S, A100, H100) with at least 12 GB VRAM for model inference.
- ~30 GB free disk space recommended.
- poppler-utils and additional fonts needed for PDF rendering.
- A conda-based clean Python environment is recommended; example commands are provided in the README (pip extras for CPU-only
olmocr[bench]vs GPUolmocr[gpu]). - Docker images are provided (including a large image containing the model ~30GB and a base image without the model).
Example usage
- Quick test: try the online demo at the official site.
- Convert a single PDF to markdown locally (requires GPU for inference):
curl -o olmocr-sample.pdf https://olmocr.allenai.org/papers/olmocr_3pg_sample.pdf
python -m olmocr.pipeline ./localworkspace --markdown --pdfs olmocr-sample.pdf- The pipeline supports external vLLM/OpenAI-compatible endpoints so you can point the pipeline at hosted inference providers instead of running locally.
Engineering & research components
- The repo provides modular code useful for other projects: prompting strategies for text linearization, filtering and SEO detection, synthetic data mining, trainer code (SFT and RL), and a robust pipeline (
pipeline.py) for processing at scale. - The project is accompanied by at least two arXiv technical reports (v1 and v2) describing methods and RL-based improvements.
Team & license
- Maintained by the AllenNLP team at the Allen Institute for AI (AI2). Community contributors are tracked in the GitHub contributors graph.
- Distributed under the Apache 2.0 license.
Good fits & limitations
- Good for researchers and engineers who need high-quality OCR/linearization for complex documents (scientific papers, scanned books, forms with equations and tables).
- Requires GPU resources and some setup (fonts, poppler) for best results; larger-scale processing workflows are supported via Docker/S3/Beaker.
Where to find more
- Primary code and issues: the GitHub repo (provided url).
- Demo: the official demo site shown as the project homepage.
- Benchmarks, usage examples, and provider integration examples are included in the repository README and subfolders.
