AIAny - olmOCR

Overview

olmOCR is an open-source document OCR/linearization toolkit developed and maintained by the AllenNLP team at the Allen Institute for AI (AI2). Its goal is to convert PDFs and other image-based documents into clean, natural-reading plain text or Markdown while preserving structure (sections, tables, equations, multi-column flow) and removing irrelevant elements like headers and footers. The system is powered by a 7B-parameter vision-language model and a full pipeline for rendering, inference, post-processing, benchmarking, and training.

Key features

Convert PDF, PNG, JPEG documents into clean Markdown/plain text.
Support for complex content types: equations, tables, handwriting, figures, multi-column layouts and insets.
Automatic header/footer removal and natural reading-order reconstruction even for complex layouts.
Includes olmOCR-Bench: a comprehensive benchmark suite (≈7,000 test cases across ~1,400 documents) to measure OCR performance across categories (arXiv papers, old scans, tables, headers/footers, multi-column, long/tiny text, etc.).
Training and fine-tuning utilities: SFT finetuning code, synthetic-data generation, and an RL training script used in later model improvements.
Containerized/docker support and examples for running at scale (including multi-node/cluster workflows with S3 and Beaker coordination).
Verified to work with several inference providers and streamable model names (examples and cost examples provided in the repo).
Licensed under Apache 2.0.

Notable model/releases and benchmark highlights

The project ships model releases referenced as olmOCR-7B / olmOCR-2-7B variants (FP8 variants are distributed on Hugging Face). Release history in the repo includes performance and robustness improvements across 2025, for example v0.4.0 (Oct 21, 2025) which improved olmOCR-bench scores and introduced RL training.
The benchmark table in the repo shows competitive overall performance for olmOCR v0.4.0 (overall ~82.4 on their olmOCR-Bench), with particularly strong results on headers/footers and multi-column layouts.

Installation & requirements (summary)

Requires a recent NVIDIA GPU (tested on RTX 4090, L40S, A100, H100) with at least 12 GB VRAM for model inference.
~30 GB free disk space recommended.
poppler-utils and additional fonts needed for PDF rendering.
A conda-based clean Python environment is recommended; example commands are provided in the README (pip extras for CPU-only olmocr[bench] vs GPU olmocr[gpu]).
Docker images are provided (including a large image containing the model ~30GB and a base image without the model).

Example usage

Quick test: try the online demo at the official site.
Convert a single PDF to markdown locally (requires GPU for inference):

curl -o olmocr-sample.pdf https://olmocr.allenai.org/papers/olmocr_3pg_sample.pdf
python -m olmocr.pipeline ./localworkspace --markdown --pdfs olmocr-sample.pdf

The pipeline supports external vLLM/OpenAI-compatible endpoints so you can point the pipeline at hosted inference providers instead of running locally.

Engineering & research components

The repo provides modular code useful for other projects: prompting strategies for text linearization, filtering and SEO detection, synthetic data mining, trainer code (SFT and RL), and a robust pipeline (pipeline.py) for processing at scale.
The project is accompanied by at least two arXiv technical reports (v1 and v2) describing methods and RL-based improvements.

Team & license

Maintained by the AllenNLP team at the Allen Institute for AI (AI2). Community contributors are tracked in the GitHub contributors graph.
Distributed under the Apache 2.0 license.

Good fits & limitations

Good for researchers and engineers who need high-quality OCR/linearization for complex documents (scientific papers, scanned books, forms with equations and tables).
Requires GPU resources and some setup (fonts, poppler) for best results; larger-scale processing workflows are supported via Docker/S3/Beaker.

Where to find more

Primary code and issues: the GitHub repo (provided url).
Demo: the official demo site shown as the project homepage.
Benchmarks, usage examples, and provider integration examples are included in the repository README and subfolders.

olmOCR

Introduction

Overview

Key features

Notable model/releases and benchmark highlights

Installation & requirements (summary)

Example usage

Engineering & research components

Team & license

Good fits & limitations

Where to find more

Information

Categories

Tags

More Items

Kornia

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

Qwen-Image