A new cross-domain benchmark reveals how the leading AI research tools perform on real-world production tasks
Listen to this article:
Two AI-generated research reports land on your desk before a major decision. Both are polished, confidently written, and well-structured, but they reach different conclusions. Which one do you trust, and how would you even begin to find out? In “DRACO: a Cross-Domain Benchmark for Deep Research Accuracy, Completeness, and Objectivity,” a team at Perplexity and Jeremy Yang, Assistant Professor of Business Administration at Harvard Business School and affiliate with the Digital Data Design Institute at Harvard (D^3), present a rigorous new benchmark for measuring how well AI deep research systems actually perform on real-world production tasks.
Key Insight: A New Standard for Deep Research Evaluation
“We introduce a cross-domain benchmark derived from real-world production deep research tasks designed to bridge the gap between AI evaluations and authentic research needs.” [1]
AI “deep research” systems, tools that can autonomously decompose a complex question, search hundreds of sources, reconcile conflicting evidence, and synthesize findings into a cited report, are increasingly being used for high-stakes analytical work in areas such as finance, legal, and medicine. Unlike a simple chatbot response, these systems operate more like an analyst running an independent research process. While this technology has been advancing quickly, the frameworks for evaluating it have not kept pace. The authors argue that evaluating deep research must reflect realistic use cases, span domains, account for region-specific sources, and probe multiple system capabilities such as planning, search, and reasoning all at once.
Key Insight: Tasks Deeply Rooted in Practice
“Our main contribution is a curated set of benchmark tasks that closely mirror real deep research needs and how people use deep research agents in practice.” [2]
Many AI benchmarks are built by researchers and experts imagining what hard questions look like. DRACO takes a different approach: its 100 tasks were sourced directly from actual user queries submitted to Perplexity’s deep research system in fall 2025. Specifically, researchers sampled from high-difficulty requests where users had expressed dissatisfaction, making these exactly the kinds of tasks where AI systems tend to struggle. Those raw queries were then anonymized, augmented to add specificity and scope, and filtered to ensure each task was objectively evaluable, appropriately bounded, and genuinely challenging. The results span 10 domains drawing on sources from 40 countries across five regions.
Key Insight: Rating Real-World Complexity
“Twenty-six domain experts, including medical professionals, attorneys, financial analysts, software engineers, and designers, were recruited to develop rubrics for selected tasks.” [3]
DRACO’s grading rubrics were developed through a rigorous human-expert pipeline: an initial rubric is drafted by one expert, reviewed and refined by a second, subjected to a “saturation test” to ensure the current system cannot easily exceed 90% (which would indicate an overly easy task or lenient rubric), and finally validated by a third and fourth expert for quality assurance. Each task was ultimately assessed across an average of 39 criteria spanning four dimensions: factual accuracy, breadth and depth of analysis, presentation quality, and citation quality.
Key Insight: Progress, But Gaps Remain
“Our evaluation of frontier deep research systems reveals that while significant progress has been made (especially in presentation quality), substantial headroom remains (especially in factual accuracy).” [4]
The evaluation results indicate that while agents have improved across all rubric dimensions—and now excel in presentation quality—they continue to struggle with factual accuracy. This may partly stem from design choices: roughly half of all criteria focused on verifiable factual claims, and the rubrics also included negative criteria penalizing specific failure modes. In domains like medicine and law, these penalties are particularly severe, as incorrect or unsafe recommendations carry heavy negative weights. This reflects a core design principle: in high-stakes domains, what AI gets wrong matters as much as what it gets right.
Why This Matters
As we increasingly rely on AI for high-stakes tasks, from brainstorming and research to actual execution, the bottleneck is no longer speed, it’s accuracy. The area where AI performs best, producing polished, well-structured output, is precisely where it’s hardest for a non-specialist to detect errors. For business leaders, DRACO’s task-and-rubric design offers a concrete blueprint for evaluating and choosing research agents: define success criteria, test on representative workloads, and be sure to clarify how you’ll know when it’s wrong.
Bonus
While it seems self-evident that we want the best and most accurate information from AI, that’s actually not always the case. Check out “Explanations on Mute: Why We Turn Away From Explainable AI” to see why.
References
[1] Joey Zhong et al., “DRACO: a Cross-Domain Benchmark for Deep Research Accuracy, Completeness, and Objectivity,” arXiv preprint arXiv:2602.11685 (2026): 2. https://doi.org/10.48550/arXiv.2602.11685
[2] Zhong et al., “DRACO”: 2.
[3] Zhong et al., “DRACO”: 5.
[4] Zhong et al., “DRACO”: 12.
Meet the Authors

Jeremy Yang is an Assistant Professor of Business Administration at Harvard Business School and affiliated with the Digital Data Design Institute at Harvard (D^3).
Additional Authors (Perplexity): Joey Zhong, Hao Zhang, Clare Southern, Thomas Wang, Kate Jung, Shu Zhang, Denis Yarats, Johnny Ho, Jerry Ma