PDF Craft: Advanced PDF to Markdown/EPUB Converter
Overview
PDF Craft is a lightweight, high-performance library designed to transform PDF documents—particularly scanned books—into structured formats like Markdown and EPUB. Built on the robust foundation of DeepSeek OCR, it excels in recognizing and extracting complex content, including text, tables, mathematical formulas, and images. This tool is especially useful for researchers, students, and professionals dealing with academic or technical PDFs, as it preserves essential elements such as footnotes, inline images, and document hierarchies without relying on cloud services.
Starting from version 1.0.0, PDF Craft has shifted to a fully local processing model, eliminating dependencies on large language models (LLMs) for text correction. This results in faster execution times, reduced latency from network requests, and improved reliability. The conversion process occurs entirely on the user's machine, supporting GPU acceleration for even greater efficiency. For users needing LLM-based corrections, the legacy version (v0.2.8) remains available.
Key Features
- Accurate OCR Recognition: Utilizes DeepSeek OCR models (from tiny to gundam sizes) to handle scanned documents with high precision, capturing tables, formulas, and multilingual text.
- Structure-Aware Processing: Automatically identifies and extracts main body text while filtering out noise like headers, footers, and watermarks. It also manages footnotes and inline elements seamlessly.
- Multi-Format Output: Supports conversion to Markdown (with asset folders for images) and EPUB (with customizable metadata, covers, and auto-generated TOC).
- Rendering Options: Flexible handling of tables (HTML or image clipping) and formulas (MathML, SVG, or clipping), plus support for preserving inline LaTeX in EPUBs.
- Offline and Local-First: Models are cached locally via Hugging Face, with options for pre-downloading and strict offline mode to ensure privacy and speed.
- Error Resilience: Configurable to ignore PDF rendering errors, inserting placeholders for problematic pages instead of halting the process.
- Customization: Extensive parameters for model sizes, cache paths, temporary folders, and even custom PDF handlers (e.g., specifying Poppler paths).
Installation and Setup
To get started, install via pip after setting up prerequisites like Poppler for PDF parsing and CUDA for OCR (if using GPU):
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
pip install pdf-craftDetailed setup, including Poppler integration and CUDA configuration, is covered in the Installation Guide. For production, pre-download models using predownload_models() to avoid runtime delays.
Usage Examples
Markdown Conversion
from pdf_craft import transform_markdown
transform_markdown(
pdf_path="input.pdf",
markdown_path="output.md",
markdown_assets_path="images",
ocr_size="gundam", # Largest model for best quality
includes_footnotes=True,
generate_plot=False # Optional visualization
)This extracts text, embeds images in an assets folder, and structures the Markdown with proper headings and lists.
EPUB Conversion
from pdf_craft import transform_epub, BookMeta, TableRender, LaTeXRender
transform_epub(
pdf_path="input.pdf",
epub_path="output.epub",
book_meta=BookMeta(
title="Book Title",
authors=["Author 1", "Author 2"],
publisher="Publisher",
language="en"
),
ocr_size="large",
includes_cover=True,
table_render=TableRender.HTML,
latex_render=LaTeXRender.MATHML,
inline_latex=True,
local_only=True # Offline mode
)The output EPUB is reader-ready, with embedded assets, semantic markup for formulas, and a navigable table of contents.
Model and Performance Management
PDF Craft supports five OCR model sizes: tiny (fastest, lowest accuracy) to gundam (slowest, highest quality). Models are auto-downloaded from Hugging Face but can be managed via models_cache_path for custom storage. Use local_only=True for air-gapped environments after pre-downloading.
For visualization, enable generate_plot=True to produce charts of the processing pipeline, aiding in debugging complex PDFs.
Integrations and Ecosystem
This tool pairs well with related projects like epub-translator, which can translate converted EPUBs while preserving layouts. An online demo is available at pdf.oomol.com for testing without installation.
Limitations and Best Practices
- Requires Poppler and optionally CUDA for optimal performance.
- For very large PDFs, allocate sufficient temporary storage via
analysing_path. - Scanned documents yield better results with higher model sizes, but balance speed vs. quality based on hardware.
License and Acknowledgments
Released under the MIT License (from v1.0.0), with transitive LGPLv3 dependency via DeepSeek OCR. Credits go to DeepSeek AI and doc-page-extractor for foundational components.
PDF Craft streamlines the digitization of printed knowledge, making scanned resources accessible and editable in modern formats.
