The Token Games: Evaluating Language Model Reasoning with Puzzle Duels

• Computer Science > Artificial Intelligence [Submitted on 19 Feb 2026] Title:The Token Games: Evaluating Language Model Reasoning with Puzzle Duels View PDF HTML (experimental)Abstract:Evaluating the reasoning capabilities of Large Language Models is increasingly challenging as models improve. • Human curation of hard questions is highly expensive, especially in recent benchmarks using PhD-level domain knowledge to challenge the most capable models. • Even then, there is always a concern about whether these questions test genuine reasoning or if similar problems have been seen during training. • Here, we take inspiration from 16th-century mathematical duels to design The Token Games (TTG): an evaluation framework where models challenge each other by creating their own puzzles. • We leverage the format of Programming Puzzles - given a Python function that returns a boolean, find inputs that make it return True - to flexibly represent problems and enable verifying solutions. • Using results from pairwise duels, we then compute Elo ratings, allowing us to compare models relative to each other.

Article Summaries:

Summary

Researchers introduced “The Token Games” (TTG), a novel framework for assessing large language model (LLM) reasoning without human‑crafted questions. In TTG, models generate and solve programming‑style puzzles-Python functions that return a boolean-challenging each other in pairwise duels. Solutions are automatically verified, and Elo ratings are computed to rank models. Evaluation of ten leading LLMs produced a ranking that closely matches existing benchmarks such as Humanity’s Last Exam, yet required no human effort to create puzzles. The study also found that generating effective puzzles remains difficult for current models, highlighting TTG’s potential to test reasoning, creativity, and problem‑creation skills in a self‑sustaining manner.

Sources:

https://arxiv.org/abs/2602.17831