TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks

• TemporalBench offers a multi-domain benchmark for temporal reasoning in LLM agents. • Four-tier taxonomy tests historical structure, context-free, contextual, and event-conditioned predictions. • Domains include retail, healthcare, energy, and physical systems, covering diverse real-world data. • Baseline results reveal strong forecasting does not guarantee robust contextual or event-aware reasoning. • Existing agent frameworks show fragmented strengths and systematic failure modes hidden by forecasting-only tests. • Dataset and leaderboard are publicly available, encouraging community-driven evaluation and improvement.

Article Summaries:

Computer Science > Artificial Intelligence [Submitted on 5 Feb 2026] Title:TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks View PDF HTML (experimental)Abstract:It is unclear whether strong forecasting performance reflects genuine temporal understanding or the ability to reason under contextual and event-driven conditions. We introduce TemporalBench, a multi-domain benchmark designed to evaluate temporal reasoning behavior under progressively richer informational settings. TemporalBench adopts a four-tier task taxonomy that examines

Sources:

https://arxiv.org/abs/2602.13272