SourceBench: Can AI Answers Reference Quality Web Sources?

• Computer Science > Artificial Intelligence [Submitted on 18 Feb 2026] Title:SourceBench: Can AI Answers Reference Quality Web Sources? • View PDF HTML (experimental)Abstract:Large language models (LLMs) increasingly answer queries by citing web sources, but existing evaluations emphasize answer correctness rather than evidence quality. • We introduce SourceBench, a benchmark for measuring the quality of cited web sources across 100 real-world queries spanning informational, factual, argumentative, social, and shopping intents. • SourceBench uses an eight-metric framework covering content quality (content relevance, factual accuracy, objectivity) and page-level signals (e.g., freshness, authority/accountability, clarity), and includes a human-labeled dataset with a calibrated LLM-based evaluator that matches expert judgments closely. • We evaluate eight LLMs, Google Search, and three AI search tools over 3996 cited sources using SourceBench and conduct further experiments to understand the evaluation results. • Overall, our work reveals four key new insights that can guide future research in the direction of GenAI and web search.

Article Summaries:

Researchers have released SourceBench, a benchmark that evaluates the quality of web sources cited by large language models (LLMs). The benchmark covers 100 real‑world queries across multiple intent categories and employs an eight‑metric framework assessing content relevance, factual accuracy, objectivity, and page‑level signals such as freshness and authority. A human‑labeled dataset and a calibrated LLM‑based evaluator align closely with expert judgments. In a study of eight LLMs, Google Search, and three AI search tools on 3,996 cited sources, the authors identified four key insights that could guide future research in generative AI and web search.

Sources:

https://arxiv.org/abs/2602.16942