Advancing AI benchmarking with Game Arena

• DeepMind expands Game Arena, adding Werewolf and poker to benchmark AI beyond perfect-information games. • Werewolf tests social deduction, communication, and deception handling in AI agents. • Poker benchmark evaluates risk management, uncertainty quantification, and strategic decision-making. • Live tournaments on Kaggle showcase top models’ performance across chess, Werewolf, and poker. • These games simulate real-world social dynamics, aiding safer AI development. • Researchers can now benchmark AI’s adaptability to imperfect information and human-like interaction.

Article Summaries:

Google DeepMind has expanded its public AI benchmarking platform, Game Arena, to include new games that test models beyond perfect‑information scenarios. After launching a chess benchmark last year to gauge strategic reasoning and long‑term planning, the team added Werewolf-a natural‑language, team‑based social‑deduction game that requires navigating imperfect information, deception, and negotiation. The platform now hosts both chess and Werewolf, allowing researchers to track progress of models such as Gemini 3 Pro and Gemini 3 Flash, and to evaluate agentic safety and soft‑skill capabilities in controlled, sandboxed environments.

Sources:

https://blog.google/innovation-and-ai/models-and-research/google-deepmind/kaggle-game-arena-updates/