Most large-scale uses of arXiv text (LLM pretraining, structured paper understanding, citation extraction) hit two friction points: S3 requester-pays egress costs and deeply nested raw archives that are expensive to unpack and parse. This dataset removes that friction by ingesting arXiv source inside the same region, parsing and aligning LaTeX sources to metadata, and exposing the result as partitioned Parquet files ready for analysis.
What Sets It Apart
- Row-level schema with parsed LaTeX: every row contains metadata fields plus a single
latexfield that bundles all source files (.tex, .bib, .sty, figures) as a readable tree, so you can search or extract structural elements without re-parsing thousands of tarballs. This reduces preprocessing time for ML pipelines. - Monthly sync + manifest: the project maintains an XML manifest mapping Parquet partitions to source S3 tar inputs, sizes, checksums and processed timestamps, enabling reproducible incremental updates and resumable ingestion.
- Practical mitigation of egress and CPU cost: by doing heavy ingest work in-region and publishing ready Parquet files, it avoids requester-pays download complexity and the CPU overhead of nested archive extraction for downstream users.
- Large coverage: the dataset spans millions of arXiv papers (multi-million row scale) with hundreds of gigabytes of parsed LaTeX content suitable for pretraining and long-form document tasks.
Who It's For & Tradeoffs
Great fit if you are a researcher or engineer who needs full LaTeX source at scale—LLM pretraining, scientific document understanding, citation/figure extraction, or building structured paper corpora. Look elsewhere if you only need metadata or abstracts (use arXiv API/OAI-PMH) or if you require redistribution rights beyond what individual paper licenses permit. Note that copyright and individual paper licenses remain with authors: downstream systems must respect each paper's license and arXiv Terms of Use. Also plan for large storage and compute when working with multi-hundred-GB Parquet partitions.
