Overview
PageIndex is an open-source project (by VectifyAI) that proposes a vectorless, reasoning-native approach to Retrieval-Augmented Generation (RAG) for long professional documents. Instead of relying on vector similarity and chunking, PageIndex converts documents into a hierarchical tree (a refined, LLM-aware table of contents) and uses LLM-guided tree search / reasoning to find the most relevant sections. This workflow simulates how human experts navigate complex documents—narrowing down context by following structure and reasoning—resulting in more relevant and explainable retrieval.
Key Concepts and Features
- No Vector DB: Retrieval is driven by structure + reasoning rather than nearest-neighbor vector search.
- No Chunking: Documents are split into natural semantic nodes (sections/pages) instead of arbitrary token chunks.
- Tree Index: Produces a multi-level tree (node metadata: title, start/end page, summary, node_id) that maps to document portions.
- Reasoning-based Retrieval: LLMs perform a guided tree search (multi-step reasoning) to locate relevant nodes, increasing precision for domain-heavy queries.
- Vision & OCR Options: Supports vision-based pipelines (including OCR-free workflows) for PDFs and images, preserving original layout and hierarchy where possible.
- Explainability: Retrieval results include traceable references (page/section, node summaries) and reasoning traces for auditing.
Components & Integrations
- Open-source repo: provides the code, example notebooks (including Vectorless RAG and Vision-based RAG), and utilities to generate PageIndex trees from PDFs/Markdown.
- Chat Platform: pageindex.ai provides a ChatGPT-style interface (PageIndex Chat) for interactive document analysis.
- API & MCP: PageIndex exposes an API and supports MCP integration to plug the reasoning-native retrieval into other agents or platforms.
- Deployment: Options include self-hosting the repo locally, using the hosted Chat Platform, or enterprise/on‑prem deployments.
Usage (high level)
- Generate tree index: Run the provided scripts to parse a PDF and build a hierarchical index (node titles, page ranges, summaries).
- Query with reasoning: Route user queries through the reasoning agent which traverses the tree to locate and aggregate relevant nodes.
- Return traced answer: The system returns an answer with node references and the LLM's reasoning path for transparency.
Example quickstart steps (summary):
- Install requirements (pip3 install -r requirements.txt).
- Set CHATGPT_API_KEY in .env.
- Run the runner script: python3 run_pageindex.py --pdf_path /path/to/doc.pdf (with optional flags for model, toc pages, tokens/node, etc.).
Technical Advantages
- Better relevance for multi-step, domain-specific queries where semantic similarity may miss critical context.
- Reduced reliance on external vector DB infrastructure, simplifying deployment and costs.
- Preserves document hierarchy and yields human-interpretable retrieval traces.
Case studies & Results
PageIndex powers reasoning-based RAG systems (e.g., Mafin 2.5) and has been reported to achieve strong results on finance QA benchmarks—demonstrating the approach's effectiveness on complex financial documents.
Who is it for
Researchers, engineers, and product teams working on long-document QA, enterprise document understanding (financial reports, legal/regulatory docs, technical manuals), and teams needing explainable retrieval integrated with LLMs.
Project metadata
- Owner: VectifyAI
- Repo origin / created: 2025-04-01
- Provides cookbooks, Colab examples, and docs for getting started and advanced setups.
