Long-form document transcription becomes impractical with standard decoder attention because the KV cache and attention cost grow with output length. Unlimited OCR explores a different trade-off: replace full decoder attention with Reference Sliding Window Attention (R-SWA), which always attends to visual/reference tokens while limiting output-side attention to a short causal window. That keeps the KV cache size effectively constant and preserves visual token fidelity, enabling one-shot, multi-page parsing in a single forward pass with near-constant inference latency.
Key Capabilities
- Reference Sliding Window Attention: for each output token the model attends to all visual/reference tokens plus a fixed-size causal window (default width 128) over recent outputs. This design prevents progressive blurring of visual features while avoiding ever-growing KV state.
- Long-horizon single-pass parsing: supports very long outputs (standard max_length up to 32K), allowing dozens of pages to be transcribed in one forward pass instead of iterative chunking.
- Empirical gains: demonstrates substantial throughput and accuracy improvements in document OCR benchmarks (reported ~93% end-to-end on OmniDocBench v1.5 and further gains on v1.6) and measured higher TPS vs the DeepSeek-OCR baseline.
- Practical deployment: published model, code, and recipes for frameworks like Hugging Face Transformers and vLLM; includes support for multi-page/PDF inference workflows.
Who It's For and Trade-offs
Great fit if you need to transcribe long multi-page documents in a single pass, want stable inference latency as output length grows, or are integrating OCR into pipelines where memory and KV-cache growth are bottlenecks. Look elsewhere if your primary need is best-in-class single-page recognition in highly constrained resource environments (smaller models may be cheaper), or if you require architectures that update visual reference tokens recurrently (R-SWA intentionally avoids recurrent visual state updates to preserve fidelity).
