ZeroSumEval: Scaling LLM Evaluation with Inter-Model Competition

Link: http://arxiv.org/abs/2504.12562v1

PDF Link: http://arxiv.org/pdf/2504.12562v1

Summary: Evaluating the capabilities of Large Language Models (LLMs) has traditionallyrelied on static benchmark datasets, human assessments, or model-basedevaluations - methods that often suffer from overfitting, high costs, andbiases.

ZeroSumEval is a novel competition-based evaluation protocol thatleverages zero-sum games to assess LLMs with dynamic benchmarks that resistsaturation.

ZeroSumEval encompasses a diverse suite of games, includingsecurity challenges (PyJail), classic games (Chess, Liar's Dice, Poker),knowledge tests (MathQuiz), and persuasion challenges (Gandalf, Debate).

Thesegames are designed to evaluate a range of AI capabilities such as strategicreasoning, planning, knowledge application, and creativity.

Building uponrecent studies that highlight the effectiveness of game-based evaluations forLLMs, ZeroSumEval enhances these approaches by providing a standardized andextensible framework.

To demonstrate this, we conduct extensive experimentswith >7000 simulations across 7 games and 13 models.

Our results show thatwhile frontier models from the GPT and Claude families can play common gamesand answer questions, they struggle to play games that require creating noveland challenging questions.

We also observe that models cannot reliablyjailbreak each other and fail generally at tasks requiring creativity.

Werelease our code at https://github.

com/facebookresearch/ZeroSumEval.

Published on arXiv on: 2025-04-17T01:23:50Z