DDR-Bench: Benchmarking Agentic Data Research

DDR-Bench: LLMs That Hunt Instead of Wait.
What is Deep Data Research?
We introduce Deep Data Research (DDR), a task where LLMs autonomously dive into databases to explore insights they deem important — no pre-defined questions, no explicit targets, just fully autonomous Data→Insights.
Unlike traditional QA or coding benchmarks, DDR evaluates whether models can proactively set investigative goals and extract meaningful insights from complex databases, mimicking how expert data scientists work in practice.
Highlights
- Verifiable Evaluation: Checklist-based assessment extracted from unstructured reports, validated by 50+ domain experts
- Three Diverse Domains: Electronic Health Records (MIMIC-IV), Sport & Exercise Psychology (GLOBEM), Annual Financial Reports (10-K SEC filings)
- Highest Autonomy: No pre-set questions or targets — LLMs decide what to investigate
- Minimalist Design: Built for Agentic LLMs with simple ReAct prompts and minimal toolset (2 MCP servers, 6 functions)
- Long-Horizon: Up to 100 turns and 70,000+ tokens per trajectory
Key Findings
- Domain knowledge defines the ceiling — it determines how deeply a model can reason within a domain
- Exploration strategy governs whether models approach that ceiling — reflecting the ability to generate informative hypotheses
- Cost efficiency determines convergence speed — advanced architectures achieve higher information gain per token
Current SOTA models still struggle to exceed 50% average accuracy, indicating DDR tasks are far from saturated.
Read More
For detailed methodology, experimental results, and analysis on test-time scaling and exploration patterns, check out the full write-up: 👉 DDR-Bench Notion Blog