LangExtract — detailed introduction
LangExtract is an open-source Python library designed to turn unstructured text into structured data by leveraging large language models (LLMs). It is optimized for practical extraction tasks (entities, relationships, attributes) and includes features that make the extraction pipeline reliable, auditable, and scalable.
Key features
- Precise source grounding: every extracted item is linked to its exact location in the source text, enabling highlightable visual review and easier verification.
- Robust structured outputs: supports user-defined extraction schemas and few-shot examples to enforce consistent structured outputs across documents.
- Scalable long-document processing: uses chunking, parallel processing, and multiple extraction passes to improve recall on long or noisy documents.
- Interactive visualization: generates self-contained HTML visualizations to browse thousands of extractions in-context without additional tooling.
- Flexible model support: integrates with cloud-hosted LLMs (e.g., Gemini, OpenAI) and local inference (via Ollama), and supports adding custom providers through a plugin-style provider system.
- Practical integrations: examples and helpers for saving/loading JSONL, Vertex AI batch processing, and running with local models for offline/secure use cases.
Typical uses
LangExtract is suitable for tasks such as:
- Structuring clinical notes (medication extraction, dosages, relations) — with caveats about medical use.
- Converting reports (e.g., radiology, legal, financial) into structured records.
- Extracting entities, sentiments, relationships, and other attributes from novels, transcripts, or long-form documents.
Usage & extensibility
- Quick start: the library exposes a simple
lx.extract(...)API where you pass the input text (or URL), a prompt description, and a few example extractions. - Model selection: works with cloud APIs (requires API key) and local models via Ollama. It includes guidance for OpenAI/Gemini and configuration examples for Vertex AI batch processing.
- Extensibility: add custom model providers via a lightweight provider registry; plugin packages can register new providers without changing core code.
Installation & license
- Install via PyPI:
pip install langextractor from source for development. - Released under the Apache 2.0 License; the repository includes contribution guidelines, tests, and CI.
Notes & caveats
- The quality and safety of inferred attributes depend on the chosen LLM, prompt clarity, and examples.
- Demonstrations (e.g., medical extraction) are illustrative and not intended as clinical advice.
(Repository created: 2025-07-08.)
