LogoAIAny
Icon for item

olmOCR

olmOCR is an open-source toolkit from the Allen Institute for AI (AI2) / AllenNLP team for converting image-based documents (PDF, PNG, JPEG) into clean, readable plain text or Markdown. It uses a 7B-parameter vision-language model to handle complex layouts, equations, tables and handwriting, removes headers/footers, and outputs text in natural reading order. The repo includes a processing pipeline, benchmark suite (olmOCR-Bench), training and RL components, Docker images, and an online demo. Licensed under Apache 2.0.

Introduction

Overview

olmOCR is an open-source document OCR/linearization toolkit developed and maintained by the AllenNLP team at the Allen Institute for AI (AI2). Its goal is to convert PDFs and other image-based documents into clean, natural-reading plain text or Markdown while preserving structure (sections, tables, equations, multi-column flow) and removing irrelevant elements like headers and footers. The system is powered by a 7B-parameter vision-language model and a full pipeline for rendering, inference, post-processing, benchmarking, and training.

Key features
  • Convert PDF, PNG, JPEG documents into clean Markdown/plain text.
  • Support for complex content types: equations, tables, handwriting, figures, multi-column layouts and insets.
  • Automatic header/footer removal and natural reading-order reconstruction even for complex layouts.
  • Includes olmOCR-Bench: a comprehensive benchmark suite (≈7,000 test cases across ~1,400 documents) to measure OCR performance across categories (arXiv papers, old scans, tables, headers/footers, multi-column, long/tiny text, etc.).
  • Training and fine-tuning utilities: SFT finetuning code, synthetic-data generation, and an RL training script used in later model improvements.
  • Containerized/docker support and examples for running at scale (including multi-node/cluster workflows with S3 and Beaker coordination).
  • Verified to work with several inference providers and streamable model names (examples and cost examples provided in the repo).
  • Licensed under Apache 2.0.
Notable model/releases and benchmark highlights
  • The project ships model releases referenced as olmOCR-7B / olmOCR-2-7B variants (FP8 variants are distributed on Hugging Face). Release history in the repo includes performance and robustness improvements across 2025, for example v0.4.0 (Oct 21, 2025) which improved olmOCR-bench scores and introduced RL training.
  • The benchmark table in the repo shows competitive overall performance for olmOCR v0.4.0 (overall ~82.4 on their olmOCR-Bench), with particularly strong results on headers/footers and multi-column layouts.
Installation & requirements (summary)
  • Requires a recent NVIDIA GPU (tested on RTX 4090, L40S, A100, H100) with at least 12 GB VRAM for model inference.
  • ~30 GB free disk space recommended.
  • poppler-utils and additional fonts needed for PDF rendering.
  • A conda-based clean Python environment is recommended; example commands are provided in the README (pip extras for CPU-only olmocr[bench] vs GPU olmocr[gpu]).
  • Docker images are provided (including a large image containing the model ~30GB and a base image without the model).
Example usage
  • Quick test: try the online demo at the official site.
  • Convert a single PDF to markdown locally (requires GPU for inference):
curl -o olmocr-sample.pdf https://olmocr.allenai.org/papers/olmocr_3pg_sample.pdf
python -m olmocr.pipeline ./localworkspace --markdown --pdfs olmocr-sample.pdf
  • The pipeline supports external vLLM/OpenAI-compatible endpoints so you can point the pipeline at hosted inference providers instead of running locally.
Engineering & research components
  • The repo provides modular code useful for other projects: prompting strategies for text linearization, filtering and SEO detection, synthetic data mining, trainer code (SFT and RL), and a robust pipeline (pipeline.py) for processing at scale.
  • The project is accompanied by at least two arXiv technical reports (v1 and v2) describing methods and RL-based improvements.
Team & license
  • Maintained by the AllenNLP team at the Allen Institute for AI (AI2). Community contributors are tracked in the GitHub contributors graph.
  • Distributed under the Apache 2.0 license.
Good fits & limitations
  • Good for researchers and engineers who need high-quality OCR/linearization for complex documents (scientific papers, scanned books, forms with equations and tables).
  • Requires GPU resources and some setup (fonts, poppler) for best results; larger-scale processing workflows are supported via Docker/S3/Beaker.
Where to find more
  • Primary code and issues: the GitHub repo (provided url).
  • Demo: the official demo site shown as the project homepage.
  • Benchmarks, usage examples, and provider integration examples are included in the repository README and subfolders.

Information

  • Websitegithub.com
  • AuthorsAllen Institute for AI (AI2), AllenNLP team
  • Published date2024/09/17

Categories

More Items