ITBench-AA supplies a compact, evaluation-focused slice of IBM's ITBench: 40 public SRE scenarios consisting of offline Kubernetes incident snapshots paired with ground-truth contributing-factor entities. It exists to benchmark agentic workflows that must read alerts, events, traces and topology and produce a minimal set of root-cause Kubernetes entities (Deployments, Pods, Services, ConfigMaps, etc.).
What Sets It Apart
- Focused evaluation slices: contains only the 40 scenarios marked public in the upstream ITBench repo, keeping the dataset lean for reproducible leaderboard evaluation. This makes iteration fast while preserving realistic, multi-modal incident data (logs, traces, topology).
- Structured ground truth: each row includes a ground_truth_yaml describing fault propagation, entity groups and recommended remediations—enabling automated scoring against minimal root-cause sets.
- Engineered for agentic harnesses: designed to be mounted as an offline sandbox for agent runs (Stirrup-style harnesses), matching the workflow used on the ITBench-AA leaderboard where agents inspect filesystem snapshots and output a structured JSON diagnosis.
Who It's For and Trade-offs
- Great fit if you develop or evaluate AI agents for SRE/root-cause analysis, want reproducible offline benchmarks, or need realistic incident scenarios without running live clusters.
- Look elsewhere if you need the full ITBench suite (this release omits 19 private/held-out tasks) or if you require live, interactive cluster environments instead of offline snapshots. The dataset emphasizes diagnosis evaluation over orchestration or remediation execution.
Where It Fits
Use ITBench-AA when your goal is to measure an agent's precision at identifying minimal root-cause entities from complex, multi-source evidence. For broader coverage (more scenarios or other IT domains) pair it with the upstream itbench-hub/ITBench repository or the full ITBench benchmark releases.
