LogoAIAny
Icon for item

PDF Document Layout Analysis

An open-source, Docker-ready microservice by HURIDOCS for intelligent PDF document layout analysis, OCR and content extraction. It supports high-accuracy visual models (VGT) and fast LightGBM models, identifies titles, text, tables, images and formulas, provides a Gradio web UI and a full REST API, and offers translation via Ollama and OCR via Tesseract (150+ languages).

Introduction

PDF Document Layout Analysis

Overview

PDF Document Layout Analysis is an open-source, Docker-powered microservice developed and maintained by HURIDOCS. It combines layout-aware machine learning models, OCR, and format conversion tools to analyze PDFs, segment and classify page elements (titles, text, tables, pictures, formulas, footnotes, headers/footers, etc.), determine reading order, and export results into JSON, Markdown, or HTML. The project exposes both a user-friendly Gradio web UI and a comprehensive REST API for automation and integration.

Key Capabilities
  • Layout segmentation and classification using two model families:
    • VGT (Vision Grid Transformer) for high visual-accuracy layout understanding.
    • LightGBM-based models for fast CPU processing and batch workloads.
  • OCR integration using Tesseract + ocrmypdf (150+ languages supported).
  • Table extraction (HTML), formula extraction (LaTeX), and caption/footnote detection.
  • Reading-order resolution and segmentation metadata (coordinates, page size, page number).
  • Format conversion endpoints: Markdown and HTML exports, with segmentation data packaged in zip files.
  • Automatic translation support using Ollama models (configurable translation model list).
  • Visual overlays and interactive analysis via Gradio UI.
API & Usage
  • Runs as a service (default API port 5060, UI at 7860) and provides endpoints such as / (POST analyze), /text, /markdown, /html, /ocr, /visualize, /toc, and utility endpoints like /info.
  • Example quick commands (service running locally):
    • Analyze PDF: curl -X POST -F 'file=@document.pdf' http://localhost:5060
    • Fast analysis: -F 'fast=true' (uses LightGBM)
    • Convert to Markdown with translation: POST to /markdown with target_languages and translation_model.
Models & Performance
  • VGT provides strong visual-context performance (recommended when GPU available); LightGBM gives much faster CPU throughput for large batches.
  • Integrations with DocLayNet for training data and pre-built model configurations.
Deployment & Dev
  • Fully Dockerized (Docker Compose), with optional GPU support via NVIDIA container toolkit.
  • Development helpers: make start, make stop, make install, and test commands.
  • Configurable environment variables for OCR path, models path, ports, and Ollama endpoint.
Typical Use Cases
  • Digitizing scanned documents and extracting structured content (TOC, tables, figures).
  • Converting institutional PDFs to Markdown/HTML while preserving layout and structure.
  • Building document search/indexing pipelines that require segmented content and reading order.
Integrations & Extensibility
  • Works with Hugging Face (models/artifacts) and Docker Hub images provided by HURIDOCS.
  • Translation step is pluggable through Ollama model selection.
  • Clean Architecture codebase makes it straightforward to extend model adapters, add endpoints, or swap OCR engines.
License & Community
  • Open-source project (see repository LICENSE). Contributions, issues and PRs are welcomed; repository contains developer docs, tests and a contribution guide.

Information

  • Websitegithub.com
  • AuthorsHURIDOCS
  • Published date2024/05/06

More Items