Agent training benefits from traces that keep thinking, context, and tool interactions intact — this dataset supplies raw, upload-first agent sessions captured from GLM-5.2 with an eye toward training and distillation workflows. Rather than offering preprocessed SFT pairs, it preserves the original event stream (messages, reasoning fragments, tool calls and runtime metadata) so you can convert or filter for your pipeline.
What Sets It Apart
- Preservation of event-level fidelity: Teich-exported JSONL keeps assistant reasoning fragments before or alongside visible assistant text and records tool calls as first-class events — so models trained on these traces learn when and how tools were invoked, not just final outputs.
- Training-ready tool schema snapshot: a complete dataset-level tools snapshot is embedded and applied as a fallback, making it simple to train tool-using agents without reconstructing schemas from noisy transcripts.
- Upload-first / convertible format: each session is a standalone newline-delimited JSON file and Teich utilities can convert traces to OpenAI-style prompt/messages rows for direct use in SFT or distillation pipelines.
Who It's For and Tradeoffs
Great fit if you need realistic agent behavior and tool-use examples for fine-tuning or distillation (especially tool-enabled LLMs) and want raw session fidelity to craft custom conversion and masking logic. Look elsewhere if you need ready-made, heavily curated SFT pairs or large-scale deduplicated corpora — this collection is raw traces that require conversion and filtering for some training setups. Note also that licensing and downstream use constraints should be checked before production use.
Where It Fits
Use this dataset as a source of high-fidelity agent traces to bootstrap tool-use supervision, create chain-of-thought augmented SFT data, or distill agent policies. It sits upstream of cleaned SFT datasets: consume it when you want control over conversion, masking, and tool-schema application rather than an off-the-shelf training split.
