Datasets assembled from content that was later removed often become the only window into a model family’s real-world behavior — and that’s exactly why this collection matters. Glint-Research gathered 953 Fable 5 interaction traces (with added chain-of-thought entries) before the original data became unavailable, creating a compact corpus for empirical analysis and targeted fine-tuning.
What Sets It Apart
- Compact, model-behavior-focused traces: 953 JSON-formatted interaction traces — small enough for quick experiments but large enough to reveal recurring failure modes and reasoning patterns.
- Chain-of-thought (CoT) included: many entries contain CoT-style reasoning, enabling researchers to study intermediate reasoning steps or to fine-tune models for better stepwise explanations.
- Provenance and contributors: the dataset cites contributions from TeichAI (953 traces supplied) and Glint-Research (CoT augmentation). This provenance matters for reproducibility and attribution.
- Hugging Face dataset + common tooling: distributed as a Hugging Face dataset and tagged for use with the datasets/pandas ecosystem, making ingestion into typical LLM fine-tuning or analysis pipelines straightforward.
Who It's For and Trade-offs
Great fit if you want to: perform quick diagnostics of LLM reasoning behavior, prototype fine-tuning strategies on a small trace corpus, or analyze CoT patterns across prompts and responses. The small size makes iteration fast.
Look elsewhere if you need: large-scale, curated benchmark datasets for production-grade fine-tuning, or datasets with fully audited copyrights and explicit permissions — this corpus was assembled from available sources before removal, and some provenance or content licensing details may be incomplete.
License and ethical/legal note: the dataset is published under AGPL-3.0. That imposes strong copyleft requirements on derivative works and deployed services; verify compatibility with your intended use. Also consider privacy and copyright checks before using examples from the traces in downstream models.
Where It Fits
This dataset is a tactical resource: useful for researchers and engineers doing behavior analysis, hypothesis-driven fine-tuning, or creating small prototype models that study stepwise reasoning. It is not a replacement for large, curated, license-cleared corpora intended for production LLM training.
