Most benchmarks focus on short answers or single-step tasks; DRACO targets the harder problem of long-form, multi-hop research outputs and the systems that generate them. By pairing 100 anonymized, real-user research queries with detailed, expert-curated rubrics, DRACO makes it possible to measure not just surface correctness but synthesis, attribution, and harmful-error penalties in a reproducible way.
What Sets It Apart
- Real-world provenance: tasks are sampled from Perplexity Deep Research usage and then systematically augmented (persona, scope, temporal/geographic breadth) to reflect realistic, challenging information needs.
- Expert rubrics at scale: each task includes a JSON-encoded rubric with ~30–60 criteria (avg ≈40) organized into four axes—factual accuracy, breadth-and-depth-of-analysis, presentation-quality, and citation-quality—with integer weights (positive rewards and negative penalties) for fine-grained scoring.
- Safety-aware scoring: negative-weight criteria explicitly encode harmful or dangerous errors (stronger penalties for hazardous medical guidance), so systems that make risky assertions are penalized in the raw score.
- Reproducible judge protocol: evaluation is designed for an LLM-as-judge setup where a judge model assesses each criterion (MET/UNMET) to compute normalized task scores, enabling comparability across systems and runs.
Who it's for and tradeoffs
Great fit if you need to benchmark agentic research systems that browse, retrieve, and synthesize heterogeneous sources (e.g., RAG agents, web-enabled LLMs, multi-step research assistants). Use DRACO to compare citation practices, detect brittle reasoning, and measure how systems handle complex multi-source synthesis. Look elsewhere if you only need short QA, classification, or token-level benchmarks—DRACO is intentionally heavyweight (long tasks, many rubric checks) and reflects a static snapshot of information from late 2025, so it won't replace domain-specific, continuously updated evaluation suites. Also expect some score variance depending on judge-model choice and configuration.
