Tag
Explore by tags
2026
Julia Belikova, Rauf Parchiev +5
Benchmark for evaluating procedural skill evolution in LLM agents: isolates reusable skill bodies, role-specific work surfaces, and hidden oracle assets to measure whether skill refinements transfer across tasks, roles, and model backbones. Includes 382 workplace tasks, 22 skills, and a controlled evaluation protocol.
