Annual reports are long, noisy, and heterogeneous — finding specific KPI values inside OCR'd pages is a classic long-context, needle-in-a-haystack problem for LLMs and information-extraction systems. This dataset deliberately pairs raw OCR text (DeepSeek) with structured, as-reported KPI ground truth so models are evaluated on real OCR artifacts, layout noise, and multi-year context rather than idealized, pre-parsed tables.
What Sets It Apart
- Real OCR + ground truth: OCR text (Markdown .mmd with page splits) comes from DeepSeek applied to annual report PDFs, while KPI values are aligned to reports from XBRL/financial sources. That combination stresses model robustness to OCR noise and layout variance.
- Long-context, multi-KPI coverage: The release splits into a large "no_eval" config for development (thousands of reports, ~104k KPI rows) and a smaller "eval" config with page-level JPEGs for held-out benchmarking (~13k KPI rows). 31 KPI columns cover common balance-sheet, income-statement, and cash-flow items, enabling multi-target extraction tasks rather than single-field QA.
- Practical benchmarking focus: Values are as-reported (millions, no FX conversion) and include NaNs where KPIs are absent — this reflects real-world missingness and forces systems to decide presence/absence, unit interpretation, and value localization across pages.
Who It's For and Trade-offs
Great fit if you need to evaluate or train systems for: long-document retrieval and QA, LLM-based KPI extraction, OCR-to-structured-data pipelines, or table-question-answering under realistic noise. It’s also useful for benchmarking “needle-in-a-haystack” retrieval where a small numeric target is buried in many pages. Look elsewhere if you require canonicalized multi-currency normalization, full XBRL-tagged raw filings as primary source, or image-only datasets for OCR model training at scale — the eval set provides images but the larger split focuses on OCR text to save space.
Where It Fits
This dataset complements structured XBRL/financial databases by providing the noisy, human-readable surface form of annual reports. Use it when you want to measure how an LLM or extraction pipeline performs end-to-end from OCR/text to numeric KPI outputs, especially across multi-year contexts and company variations.
