PDF Document Layout Analysis
Overview
PDF Document Layout Analysis is an open-source, Docker-powered microservice developed and maintained by HURIDOCS. It combines layout-aware machine learning models, OCR, and format conversion tools to analyze PDFs, segment and classify page elements (titles, text, tables, pictures, formulas, footnotes, headers/footers, etc.), determine reading order, and export results into JSON, Markdown, or HTML. The project exposes both a user-friendly Gradio web UI and a comprehensive REST API for automation and integration.
Key Capabilities
- Layout segmentation and classification using two model families:
- VGT (Vision Grid Transformer) for high visual-accuracy layout understanding.
- LightGBM-based models for fast CPU processing and batch workloads.
- OCR integration using Tesseract + ocrmypdf (150+ languages supported).
- Table extraction (HTML), formula extraction (LaTeX), and caption/footnote detection.
- Reading-order resolution and segmentation metadata (coordinates, page size, page number).
- Format conversion endpoints: Markdown and HTML exports, with segmentation data packaged in zip files.
- Automatic translation support using Ollama models (configurable translation model list).
- Visual overlays and interactive analysis via Gradio UI.
API & Usage
- Runs as a service (default API port 5060, UI at 7860) and provides endpoints such as
/(POST analyze),/text,/markdown,/html,/ocr,/visualize,/toc, and utility endpoints like/info. - Example quick commands (service running locally):
- Analyze PDF:
curl -X POST -F 'file=@document.pdf' http://localhost:5060 - Fast analysis:
-F 'fast=true'(uses LightGBM) - Convert to Markdown with translation: POST to
/markdownwithtarget_languagesandtranslation_model.
- Analyze PDF:
Models & Performance
- VGT provides strong visual-context performance (recommended when GPU available); LightGBM gives much faster CPU throughput for large batches.
- Integrations with DocLayNet for training data and pre-built model configurations.
Deployment & Dev
- Fully Dockerized (Docker Compose), with optional GPU support via NVIDIA container toolkit.
- Development helpers:
make start,make stop,make install, and test commands. - Configurable environment variables for OCR path, models path, ports, and Ollama endpoint.
Typical Use Cases
- Digitizing scanned documents and extracting structured content (TOC, tables, figures).
- Converting institutional PDFs to Markdown/HTML while preserving layout and structure.
- Building document search/indexing pipelines that require segmented content and reading order.
Integrations & Extensibility
- Works with Hugging Face (models/artifacts) and Docker Hub images provided by HURIDOCS.
- Translation step is pluggable through Ollama model selection.
- Clean Architecture codebase makes it straightforward to extend model adapters, add endpoints, or swap OCR engines.
License & Community
- Open-source project (see repository LICENSE). Contributions, issues and PRs are welcomed; repository contains developer docs, tests and a contribution guide.
