LLM-WikiRace: Benchmarking Long-term Planning and Reasoning over Real-World Knowledge Graphs

• Computer Science > Artificial Intelligence [Submitted on 18 Feb 2026] Title:LLM-WikiRace: Benchmarking Long-term Planning and Reasoning over Real-World Knowledge Graphs View PDF HTML (experimental)Abstract:We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). • In LLM-Wikirace, models must efficiently navigate Wikipedia hyperlinks step by step to reach a target page from a given source, requiring look-ahead planning and the ability to reason about how concepts are connected in the real world. • We evaluate a broad set of open- and closed-source models, including Gemini-3, GPT-5, and Claude Opus 4.5, which achieve the strongest results on the easy level of the task and demonstrate superhuman performance. • Despite this, performance drops sharply on hard difficulty: the best-performing model, Gemini-3, succeeds in only 23% of hard games, highlighting substantial remaining challenges for frontier models. • Our analysis shows that world knowledge is a necessary ingredient for success, but only up to a point, beyond this threshold, planning and long-horizon reasoning capabilities become the dominant factors. • Trajectory-level analysis further reveals that even the strongest models struggle to replan after failure, frequently entering loops rather than recovering.

Article Summaries:

A new benchmark, LLM‑WikiRace, tests large language models on long‑term planning and reasoning by having them navigate Wikipedia hyperlinks from a source page to a target page. The task requires look‑ahead planning and real‑world knowledge of concept connections. Open‑ and closed‑source models-including Gemini‑3, GPT‑5, and Claude Opus 4.5-perform well on easy levels, even surpassing human scores. However, on hard levels performance drops sharply; Gemini‑3 succeeds only 23 % of the time. Analysis shows that while world knowledge helps up to a point, planning and long‑horizon reasoning become the main bottlenecks, and models often fail to replan after errors. The benchmark and leaderboard are publicly available.

Sources:

https://arxiv.org/abs/2602.16902