LogoAIAny
Icon for item

LEDGER — Long-Context KPI Question Answering & Page Retrieval

Provides page-level relevance judgments and full OCR'd annual-report text for KPI question answering and page retrieval benchmarking — supports retrieval (per-page qrels) and needle‑in‑a‑haystack numeric extraction over long documents, with eval and train configs.

Introduction

Finding a single numeric KPI in a 100+ page OCR'd annual report is a classic long‑context information‑retrieval and extraction challenge. LEDGER frames that real‑world task as two complementary, measurable problems — page retrieval (which pages contain the answer) and precise numeric extraction from the full OCRed report — enabling reproducible benchmarking for models and retrieval pipelines.

What Sets It Apart
  • Two-task design: per-query TREC-style page qrels let you evaluate retrieval metrics (Recall@k, MRR, nDCG) separately from the extraction task, enabling modular research on indexing, reranking, and reader models. This separation clarifies whether errors come from retrieval or from the model’s extraction logic.
  • Real long contexts at scale: reports are OCR'd into page-aligned Markdown (median ~124 pages, ~126k tokens), with an mmd_text field and raw .mmd files provided for visual/format-aware methods. The eval config contains 10,000 queries across 494 reports (2017–2022), while the larger no_eval split (~104k queries over 4,505 reports, 2009–2024) supports training and development.
  • Ground-truth value provenance and graded qrels: KPI values are reconciled from SEC EDGAR XBRL, Yahoo Finance, and Alpha Vantage using a deterministic waterfall; per-page relevance is graded 0/1/2 and is directly compatible with trec_eval/pytrec_eval. Relevance judgments were mined via unit-normalized matching and validated by an LLM judge (Qwen 3.6-27B).
  • Practical evaluation protocol: extraction success uses a numeric tolerance (default ±0.05%) and expects structured answers (value + unit scale + page number). Baseline recall/precision numbers (Qwen3.6-27B: ~91.4/93.5; others listed) give a starting point for comparison.
Who It's For — Tradeoffs

Great fit if you need a reproducible benchmark for long‑context retrieval and numeric extraction in finance (e.g., testing RAG pipelines, long‑context LLMs, OCR‑aware readers, or retrieval/rerank strategies). The dataset’s page‑aligned qrels and full OCR text make it useful for both IR and LLM evaluation. Look elsewhere if your focus is non‑financial text, short-context QA, or multilingual document corpora (this dataset is English and tailored to corporate annual reports). Also note OCR noise and table rendering mean methods must be robust to OCR artifacts and layout; heavy reliance on perfectly parsed tables will reduce applicability.

Additional notes
  • Format and tooling: data provided in parquet, with mmd files and per-page images (eval). Libraries noted: datasets, dask, polars, mlcroissant. Data license: CC-BY-4.0; code: MIT.
  • Typical workflows: index pages (split on <--- Page Split --->), run retrieval, evaluate against qrels, then pass retrieved pages or full mmd_text + query_text to reader LLMs for extraction. The dataset is suited for research on long-context LLM prompting, hybrid retrieval+LLM pipelines, and OCR-aware extraction.

Information

Categories