Most public LLM trace dumps mix mirrors, repacks, and session-artifacts that complicate provenance and model evaluation; this corpus stitches together available FABLE.5/Mythos releases into a normalized, deduplicated export so researchers can reason about agent behavior and dataset provenance at row level. The collection is sized for practical download and analysis (2,006,487 clean rows, 1.94 GiB) and intentionally keeps provenance metadata to support reproducible dataset auditing.
What Sets It Apart
- Row-level canonicalization and SHA256 row hashes: each canonical row preserves the original JSON and a stable hash, so identical rows across mirrors are deduplicated while traceability is preserved (first_source_* fields).
- Provenance fields for every row:
first_source_dataset,first_source_config,first_source_split, andfirst_source_row_indexlet you map back to the originating Hugging Face release for citation and audit. - Clean export formats tuned for workflows: a viewer-friendly Parquet split plus a gzip-compressed canonical JSONL mirror (same 2,006,487 rows) make it easy to load with datasets, pandas or polars and to convert for SFT/analysis pipelines.
- Minimal post-processing: only 604 rows removed (session-limit assistant-answer pattern) to reduce noisy artifacts while retaining the broad trace content useful for chain-of-thought, tool-use, and coding-agent research.
Who It's For & Tradeoffs
Great fit if you need a large, traceable corpus of agent-LLM interactions for language-model training, behavior analysis, or distillation experiments and you want explicit first-source provenance for auditing. Also useful for converting into SFT/CoT-style training splits or for examining tool-use and coding traces. Look elsewhere if you require fully curated human-annotated gold labels, strict privacy-cleansing beyond the provided removals, or guaranteed absence of synthetic/model-generated content — this corpus aggregates public traces which can include machine-generated assistant outputs and programmatic mirrors. The MIT license enables reuse, but downstream users should still perform their own privacy and safety review before publishing models trained on the data.
