Why AI leaderboards are inaccurate and how to fix them

Faulty ranking mechanisms used in AI leaderboards can be overcome through approaches evaluated at the University of Michigan.

Nvidia Will Spend $26 Billion to Build Open-Weight AI Models, Filings Show

March 12, 2026

How the Iran War Could Jack Up Prices on Store Shelves

March 12, 2026

In their study, U-M researchers assessed the performance of four ranking methods used in popular online AI leaderboards, such as Chatbot Arena, as well as other sporting and gaming leaderboards. They found that the type and implementation of a ranking method can yield different results, even with the same crowdsourced dataset of model performance. From their results, the researchers developed guidelines for leaderboards to represent the AI models’ true performance.

“Large companies keep announcing newer and larger gen AI models, but how do you know which model is truly the best if your evaluation methods aren’t accurate or well studied?” said Lingjia Tang, associate professor of computer science and engineering and a co-corresponding author of the study.

“Society is increasingly interested in adopting this technology. To do that effectively, we need robust methods to evaluate AI for a variety of use cases. Our study identifies what makes an effective AI ranking system, and provides guidelines on when and how to use them.”

Gen AI models are difficult to evaluate because judgments on AI-generated content can be subjective. Some leaderboards evaluate how accurately AI models perform specific tasks, such as answering multiple choice questions, but those leaderboards don’t assess how well an AI creates diverse content without a single right answer.

To evaluate more open-ended output, other leaderboards, such as the popular Chatbot Arena, ask people to rate the generated content in head-to-head comparisons, in what the researchers call an “LLM Smackdown.” The human contributors blindly submit a prompt to two random AI models then record their preferred answer in the leaderboard’s database, which is then fed into the ranking system.

But the rankings can depend on the implementation of the systems. Chatbot Arena once used a ranking system called Elo, which is also commonly used to rank chess players and athletes. It has settings that allow users to set how drastically a win or a loss changes the leaderboard’s rankings, and how that impact changes based on the player or model’s age. In theory, these features allow a ranking system to be more flexible, but the proper settings for evaluating AI aren’t always obvious.

“In chess and sport matches, there’s a logical order of games that proceed as the players’ skills change over their careers. But AI models don’t change between releases, and they can instantly and simultaneously play many games,” said Roland Daynauth, U-M doctoral student in computer science and engineering and the study’s first author.

To help prevent accidental misuse, the researchers evaluated each rating system by feeding them a portion of two crowdsourced datasets of AI model performance—one from Chatbot Arena and another previously collected by the researchers. They then checked to see how accurately their rankings matched the win rate in a withheld portion of the datasets.

They also checked to see how sensitive each system’s rankings were to user-defined settings, and whether the rankings followed the logic of all the pairwise comparisons: If A beats B, and B beats C, then A must be ranked higher than C.

They found that Glicko, a ranking system used in e-sports, tends to produce the most consistent results, especially when the number of comparisons are uneven. Other ranking systems—such as the Bradley-Terry system that Chatbot Arena implemented in December 2023—could also be accurate, but only when each model had an even number of comparisons. Such a system could allow a newer model to appear stronger than is warranted.

“Just because a model comes onto the scene and beats a grandmaster doesn’t necessarily mean it’s the best model. You need many, many games to know what the truth is,” said Jason Mars, U-M associate professor of computer science and engineering and a co-corresponding author of the study.

In contrast, the rankings made by the Elo system, as well as the Markov Chains used by Google to rank pages in a web search, were highly dependent on how users configured the system. The Bradley-Terry system lacks user-defined settings, so it could be the best option for large datasets with an even number of comparisons for each AI.

“There’s no single right answer, so hopefully our analysis will help guide how we evaluate the AI industry moving forward,” Tang said.

More information:
Roland Daynauth et al. Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat: aclanthology.org/2025.acl-long.1265/

Provided by
University of Michigan

Citation:
Why AI leaderboards are inaccurate and how to fix them (2025, July 29)
retrieved 29 July 2025
from https://techxplore.com/news/2025-07-ai-leaderboards-inaccurate.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.

Source link