• LLMs increasingly synthesize research into structured reports, but logical reliability remains unassessed. • ReportLogic benchmark quantifies report‑level logical quality for deep research auditability. • Hierarchical taxonomy evaluates Macro‑Logic, Expositional‑Logic, and Structural‑Logic within LLM‑generated reports. • Human‑annotated rubric dataset trains open‑source LogicJudge for scalable evaluation of report quality. • Adversarial tests reveal LLM judges misled by verbosity and reasoning masks. • Findings guide development of robust logic evaluators and improve LLM report trustworthiness.

Article Summaries:

  • Researchers have introduced ReportLogic, a new benchmark for assessing the logical quality of deep research reports generated by large language models (LLMs). The framework evaluates reports through a reader‑centric taxonomy: (1) Macro‑Logic, checking for a coherent analytical arc; (2) Expositional‑Logic, ensuring contextual clarity; and (3) Structural‑Logic, verifying claim‑support relationships. A human‑annotated dataset and an open‑source “LogicJudge” model are released to enable scalable evaluation. Experiments show that existing LLM judges are vulnerable to superficial cues such as verbosity, and that reasoning modes can obscure unsupported claims. The study offers guidance for building more reliable logic evaluators and improving LLM‑generated report trustworthiness.

Sources: