LogoAIAny
Icon for item

Chandra

Chandra is an open-source OCR model and toolkit from Datalab for complex documents — handwriting, tables, math, and messy forms. It supports local HuggingFace inference or a vLLM server for production, returns layout-aware outputs (Markdown/HTML/JSON with bounding boxes), provides a hosted API and playground, and ships under Apache-2.0 code license with a model license based on OpenRAIL-M.

Introduction

Overview

Chandra is an OCR-focused model and command-line/tooling suite developed by Datalab (datalab-to) that targets difficult document intelligence tasks: cursive and messy handwriting, multi-row/merged-cell tables, inline and block mathematical expressions, complex forms with checkboxes and fields, and multi-column layouts such as newspapers and textbooks.

Key features
  • Handwriting recognition: designed to read cursive and messy print that traditional OCR systems struggle with.
  • Table reconstruction: recovers table structure including merged cells (colspan/rowspan) and outputs structured table representations.
  • Math support: detects and renders inline and block equations as LaTeX.
  • Form understanding: extracts fields, checkboxes, radio buttons, and associated values.
  • Layout-aware outputs: every text block, table and image includes bounding-box coordinates so downstream systems can preserve document layout.
  • Multiple output formats: Markdown, HTML (with bounding boxes), and JSON containing full layout metadata.
  • Multilingual: supports 40+ languages.
Inference & deployment

Chandra supports two inference modes:

  • Local inference via HuggingFace Transformers for development and experimentation.
  • A vLLM server mode for optimized production throughput and batch processing.

A lightweight vLLM-based server container and CLI (chandra_vllm, chandra) are provided to simplify deployment and scaling. Environment variables (e.g., VLLM_API_BASE, VLLM_MODEL_NAME, VLLM_GPUS) configure production behavior.

Hosted API & Playground

Datalab offers a hosted API and a free playground at the project homepage for users who prefer a hosted endpoint instead of running models locally. The hosted service includes additional accuracy improvements and a commercial pricing/licensing page for larger-scale or competitive commercial uses.

Quick start & examples

Install via pip:

pip install chandra-ocr

CLI examples:

# Run with vLLM server
chandra input.pdf ./output --method vllm
 
# Local HF inference
chandra ./documents ./output --method hf

Python snippet:

from chandra.model import InferenceManager
from chandra.input import load_pdf_images
 
manager = InferenceManager(method="hf")
images = load_pdf_images("document.pdf")
results = manager.generate(images)
print(results[0].markdown)
Benchmarks & examples

The repository includes benchmark results (e.g., olmocr bench) and many visual examples demonstrating handwriting, tables, math, newspapers, forms, and textbooks. Example images and sample outputs are bundled in the repo under assets/examples.

Licensing & commercial usage
  • Code: Apache-2.0.
  • Model weights: a modified OpenRAIL-M license — free for research, personal use, and startups under stated limits; restricted for competitive use with Datalab's hosted API. Refer to the project's pricing/licensing page for commercial terms.
Integration & ecosystem

Chandra leverages HuggingFace Transformers and vLLM and can be integrated into document pipelines that require structured extraction (RAG, data ingestion, indexing, analytics). It can export visual/structured outputs for downstream NLP, search, or data extraction tasks.

Who should use it

Chandra is aimed at developers and organizations that need high-fidelity OCR for challenging document types: finance (10-Ks, invoices), legal contracts, medical notes, educational materials (homework, worksheets), and archival digitization projects.

Additional notes

For production users, Datalab provides a hosted API and configuration options to tune throughput (batch-size, max-workers, GPU selection) as well as options to include/exclude images or headers/footers in extractions.

Information

  • Websitegithub.com
  • AuthorsDatalab (datalab-to)
  • Published date2025/10/08

Categories