DREAM: Deep Research Evaluation with Agentic Metrics

• DREAM introduces agentic evaluation for AI research agents, addressing lack of ground truth. • Highlights Mirage of Synthesis: surface fluency can mask factual and reasoning flaws. • Proposes taxonomy across four verticals, revealing static evaluators miss tool‑use for temporal validity. • DREAM uses query‑agnostic metrics plus adaptive tool‑calling agent for temporally aware coverage. • Enables grounded verification and systematic reasoning probes, improving detection of factual decay. • Controlled tests show DREAM outperforms existing benchmarks, scalable and reference‑free.

Article Summaries:

DREAM: Deep Research Evaluation with Agentic Metrics A new framework, DREAM, tackles the difficulty of assessing AI‑generated research reports, which lack a single ground truth and exhibit multidimensional quality. The authors identify a “Mirage of Synthesis” where fluent, well‑cited outputs mask factual and reasoning flaws, and note that static evaluators cannot judge temporal validity or tool‑use. DREAM introduces a taxonomy of four verticals and makes the evaluator itself agentic, combining query‑agnostic metrics with adaptive, tool‑calling probes. Experiments show DREAM detects factual and temporal decay more sensitively than existing benchmarks, offering a scalable, reference‑free evaluation paradigm for deep research agents.

Sources:

https://arxiv.org/abs/2602.18940