Deep Data Research -- Database as Hunting Ground, LLMs as Hunters

Introducing DDR-Bench. Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models.

What is Deep Data Research?

We introduce Deep Data Research (DDR), a task where LLMs autonomously dive into databases to explore insights they deem important — no pre-defined questions, no explicit targets, no interaction limit, just fully autonomous Data→Insights.

Unlike traditional QA or coding benchmarks, DDR evaluates whether models can proactively set investigative goals and extract meaningful insights from complex databases, mimicking how expert data scientists work in practice.

Please checkout our project page and arXiv paper.

Highlights

  • Verifiable Evaluation: Checklist-based assessment extracted from unstructured reports, validated by 50+ domain experts
  • Three Diverse Domains: Electronic Health Records (MIMIC-IV), Sport & Exercise Psychology (GLOBEM), Annual Financial Reports (10-K SEC filings)
  • Highest Autonomy: No pre-set questions or targets — LLMs decide what to investigate
  • Minimalist Design: Built for Agentic LLMs with simple ReAct prompts and minimal toolset (2 MCP servers, 6 functions)
  • Long-Horizon: No limit put on the number of interactions, Agentic LLMs decide when to stop.

Key Findings

  • DDR evaluates investigatory intelligence rather than executional intelligence. The former places substantially higher demands on agency, requiring models to autonomously set goals and determine exploration directions.
  • Frontier models already exhibit signs of agency, yet long horizon exploration remains the primary bottleneck.
  • High quality Deep Data Research behavior emerges from a stable implicit coordination between reasoning and exploration, rather than from a simple accumulation of isolated capabilities.
  • Explicit reasoning is often concentrated in the initial interaction rounds and gradually gives way to tool dominated behavior. Part of the reasoning is implicitly embedded in tool parameters and code comments, rather than being expressed through explicit chains of thought.
  • Test time scaling analyses from the perspectives of interactions, tokens, and cost show that strong LLMs behave like hunters, patiently exploring before drilling deeply into insights, while exhibiting exceptionally high token efficiency.
  • Increasing the reasoning budget can substantially raise the number of reasoning tokens and reduce the number of interaction rounds, but the final performance metrics fluctuate significantly. This indicates a trade off between reasoning depth and interaction frequency, where neither extreme is optimal.
  • Effective agency depends on the model’s internal exploration strategy, rather than relying solely on agent modules or parameter scaling. Agent modules primarily reshape interaction patterns, instead of consistently improving deep data research capability.
  • Training time factors systematically influence test time scaling behavior. The effects of parameter scale and long context optimization are weaker than those of agentic native training.
  • Current SOTA models still struggle to exceed 50% average accuracy, indicating DDR tasks are far from saturated.

Read More

Citation

If you found the topics in this blog post interesting and would like to cite it, you may use the following BibTeX entry:

1
2
3
4
5
6
7
8
9
@misc{liu2026huntinsteadwaitevaluating,
title={Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models},
author={Wei Liu and Peijie Yu and Michele Orini and Yali Du and Yulan He},
year={2026},
eprint={2602.02039},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.02039},
}