• Business
  • Markets
  • Politics
  • Crypto
  • Finance
  • Intelligence
    • Policy Intelligence
    • Security Intelligence
    • Economic Intelligence
    • Fashion Intelligence
  • Energy
  • Technology
  • Taxes
  • Creator Economy
  • Wealth Management
  • LBNN Blueprints
  • Business
  • Markets
  • Politics
  • Crypto
  • Finance
  • Intelligence
    • Policy Intelligence
    • Security Intelligence
    • Economic Intelligence
    • Fashion Intelligence
  • Energy
  • Technology
  • Taxes
  • Creator Economy
  • Wealth Management
  • LBNN Blueprints

Why AI leaderboards are inaccurate and how to fix them

Simon Osuji by Simon Osuji
July 29, 2025
in Artificial Intelligence
0
Why AI leaderboards are inaccurate and how to fix them
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


Why AI leaderboards are inaccurate and how to fix them
Online leaderboards evaluate AI models by asking people to rate the generated content in head-to-head comparisons, in what the researchers call an “LLM Smackdown.” A faulty ranking system could give a model the championship belt for the wrong reasons. Credit: Generated by Google Gemini 2.5 Flash and edited by Derek Smith

Faulty ranking mechanisms used in AI leaderboards can be overcome through approaches evaluated at the University of Michigan.

Related posts

HHS Is Using AI Tools From Palantir to Target ‘DEI’ and ‘Gender Ideology’ in Grants

HHS Is Using AI Tools From Palantir to Target ‘DEI’ and ‘Gender Ideology’ in Grants

February 2, 2026
Dyson Deals: WIRED’s Top Pick Pet Vacuum and Purifier Heater

Dyson Deals: WIRED’s Top Pick Pet Vacuum and Purifier Heater

February 2, 2026

In their study, U-M researchers assessed the performance of four ranking methods used in popular online AI leaderboards, such as Chatbot Arena, as well as other sporting and gaming leaderboards. They found that the type and implementation of a ranking method can yield different results, even with the same crowdsourced dataset of model performance. From their results, the researchers developed guidelines for leaderboards to represent the AI models’ true performance.

“Large companies keep announcing newer and larger gen AI models, but how do you know which model is truly the best if your evaluation methods aren’t accurate or well studied?” said Lingjia Tang, associate professor of computer science and engineering and a co-corresponding author of the study.

“Society is increasingly interested in adopting this technology. To do that effectively, we need robust methods to evaluate AI for a variety of use cases. Our study identifies what makes an effective AI ranking system, and provides guidelines on when and how to use them.”

Gen AI models are difficult to evaluate because judgments on AI-generated content can be subjective. Some leaderboards evaluate how accurately AI models perform specific tasks, such as answering multiple choice questions, but those leaderboards don’t assess how well an AI creates diverse content without a single right answer.

To evaluate more open-ended output, other leaderboards, such as the popular Chatbot Arena, ask people to rate the generated content in head-to-head comparisons, in what the researchers call an “LLM Smackdown.” The human contributors blindly submit a prompt to two random AI models then record their preferred answer in the leaderboard’s database, which is then fed into the ranking system.

But the rankings can depend on the implementation of the systems. Chatbot Arena once used a ranking system called Elo, which is also commonly used to rank chess players and athletes. It has settings that allow users to set how drastically a win or a loss changes the leaderboard’s rankings, and how that impact changes based on the player or model’s age. In theory, these features allow a ranking system to be more flexible, but the proper settings for evaluating AI aren’t always obvious.

Why AI leaderboards are inaccurate and how to fix them
Different ranking algorithms can produce different rankings with the same human evaluation data, making it difficult to determine which algorithm is appropriate for various use cases. Credit: Roland Daynauth et al.

“In chess and sport matches, there’s a logical order of games that proceed as the players’ skills change over their careers. But AI models don’t change between releases, and they can instantly and simultaneously play many games,” said Roland Daynauth, U-M doctoral student in computer science and engineering and the study’s first author.

To help prevent accidental misuse, the researchers evaluated each rating system by feeding them a portion of two crowdsourced datasets of AI model performance—one from Chatbot Arena and another previously collected by the researchers. They then checked to see how accurately their rankings matched the win rate in a withheld portion of the datasets.

They also checked to see how sensitive each system’s rankings were to user-defined settings, and whether the rankings followed the logic of all the pairwise comparisons: If A beats B, and B beats C, then A must be ranked higher than C.

They found that Glicko, a ranking system used in e-sports, tends to produce the most consistent results, especially when the number of comparisons are uneven. Other ranking systems—such as the Bradley-Terry system that Chatbot Arena implemented in December 2023—could also be accurate, but only when each model had an even number of comparisons. Such a system could allow a newer model to appear stronger than is warranted.

“Just because a model comes onto the scene and beats a grandmaster doesn’t necessarily mean it’s the best model. You need many, many games to know what the truth is,” said Jason Mars, U-M associate professor of computer science and engineering and a co-corresponding author of the study.

In contrast, the rankings made by the Elo system, as well as the Markov Chains used by Google to rank pages in a web search, were highly dependent on how users configured the system. The Bradley-Terry system lacks user-defined settings, so it could be the best option for large datasets with an even number of comparisons for each AI.

“There’s no single right answer, so hopefully our analysis will help guide how we evaluate the AI industry moving forward,” Tang said.

More information:
Roland Daynauth et al. Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat: aclanthology.org/2025.acl-long.1265/

Provided by
University of Michigan

Citation:
Why AI leaderboards are inaccurate and how to fix them (2025, July 29)
retrieved 29 July 2025
from https://techxplore.com/news/2025-07-ai-leaderboards-inaccurate.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.





Source link

Previous Post

South Africa Diverts 1.3 Million Tonnes Of Paper From Landfills

Next Post

5 Job Search Tips For Those Without Connections

Next Post
5 Job Search Tips For Those Without Connections

5 Job Search Tips For Those Without Connections

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

RECOMMENDED NEWS

Telecom Revenue in South Africa Hit ZAR 232.6 Billion in 2024

Telecom Revenue in South Africa Hit ZAR 232.6 Billion in 2024

10 months ago
National Grid apprentice conquers A-level results day nerves

National Grid apprentice conquers A-level results day nerves

1 year ago
Meta Is Offering Nine-Figure Pay for Superintelligence Team

Meta Is Offering Nine-Figure Pay for Superintelligence Team

8 months ago
Hormuud bolsters mobile money platform with bank partnerships

Hormuud bolsters mobile money platform with bank partnerships

1 year ago

POPULAR NEWS

  • Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    0 shares
    Share 0 Tweet 0
  • The world’s top 10 most valuable car brands in 2025

    0 shares
    Share 0 Tweet 0
  • Top 10 African countries with the highest GDP per capita in 2025

    0 shares
    Share 0 Tweet 0
  • Global ranking of Top 5 smartphone brands in Q3, 2024

    0 shares
    Share 0 Tweet 0
  • When Will SHIB Reach $1? Here’s What ChatGPT Says

    0 shares
    Share 0 Tweet 0

Get strategic intelligence you won’t find anywhere else. Subscribe to the Limitless Beliefs Newsletter for monthly insights on overlooked business opportunities across Africa.

Subscription Form

© 2026 LBNN – All rights reserved.

Privacy Policy | About Us | Contact

Tiktok Youtube Telegram Instagram Linkedin X-twitter
No Result
View All Result
  • Home
  • Business
  • Politics
  • Markets
  • Crypto
  • Economics
    • Manufacturing
    • Real Estate
    • Infrastructure
  • Finance
  • Energy
  • Creator Economy
  • Wealth Management
  • Taxes
  • Telecoms
  • Military & Defense
  • Careers
  • Technology
  • Artificial Intelligence
  • Investigative journalism
  • Art & Culture
  • LBNN Blueprints
  • Quizzes
    • Enneagram quiz
  • Fashion Intelligence

© 2023 LBNN - All rights reserved.