Overview
Chandra is an OCR-focused model and command-line/tooling suite developed by Datalab (datalab-to) that targets difficult document intelligence tasks: cursive and messy handwriting, multi-row/merged-cell tables, inline and block mathematical expressions, complex forms with checkboxes and fields, and multi-column layouts such as newspapers and textbooks.
Key features
- Handwriting recognition: designed to read cursive and messy print that traditional OCR systems struggle with.
- Table reconstruction: recovers table structure including merged cells (colspan/rowspan) and outputs structured table representations.
- Math support: detects and renders inline and block equations as LaTeX.
- Form understanding: extracts fields, checkboxes, radio buttons, and associated values.
- Layout-aware outputs: every text block, table and image includes bounding-box coordinates so downstream systems can preserve document layout.
- Multiple output formats: Markdown, HTML (with bounding boxes), and JSON containing full layout metadata.
- Multilingual: supports 40+ languages.
Inference & deployment
Chandra supports two inference modes:
- Local inference via HuggingFace Transformers for development and experimentation.
- A vLLM server mode for optimized production throughput and batch processing.
A lightweight vLLM-based server container and CLI (chandra_vllm, chandra) are provided to simplify deployment and scaling. Environment variables (e.g., VLLM_API_BASE, VLLM_MODEL_NAME, VLLM_GPUS) configure production behavior.
Hosted API & Playground
Datalab offers a hosted API and a free playground at the project homepage for users who prefer a hosted endpoint instead of running models locally. The hosted service includes additional accuracy improvements and a commercial pricing/licensing page for larger-scale or competitive commercial uses.
Quick start & examples
Install via pip:
pip install chandra-ocrCLI examples:
# Run with vLLM server
chandra input.pdf ./output --method vllm
# Local HF inference
chandra ./documents ./output --method hfPython snippet:
from chandra.model import InferenceManager
from chandra.input import load_pdf_images
manager = InferenceManager(method="hf")
images = load_pdf_images("document.pdf")
results = manager.generate(images)
print(results[0].markdown)Benchmarks & examples
The repository includes benchmark results (e.g., olmocr bench) and many visual examples demonstrating handwriting, tables, math, newspapers, forms, and textbooks. Example images and sample outputs are bundled in the repo under assets/examples.
Licensing & commercial usage
- Code: Apache-2.0.
- Model weights: a modified OpenRAIL-M license — free for research, personal use, and startups under stated limits; restricted for competitive use with Datalab's hosted API. Refer to the project's pricing/licensing page for commercial terms.
Integration & ecosystem
Chandra leverages HuggingFace Transformers and vLLM and can be integrated into document pipelines that require structured extraction (RAG, data ingestion, indexing, analytics). It can export visual/structured outputs for downstream NLP, search, or data extraction tasks.
Who should use it
Chandra is aimed at developers and organizations that need high-fidelity OCR for challenging document types: finance (10-Ks, invoices), legal contracts, medical notes, educational materials (homework, worksheets), and archival digitization projects.
Additional notes
For production users, Datalab provides a hosted API and configuration options to tune throughput (batch-size, max-workers, GPU selection) as well as options to include/exclude images or headers/footers in extractions.
