Chatbots powered by large language models (LLMs) seem to be everywhere, from customer service to coding assistance. But how do we know if they’re safe to use?
MLCommons, a non-profit focused on artificial intelligence benchmarks, believes it has an answer. On 4 December, it released the first iteration of AILuminate, a trust and safety benchmark built to gauge the performance of cutting-edge LLMs. While machine learning researchers have used varying metrics to judge AI safety for years, AILuminate is the first third-party LLM benchmarkdeveloped as a collaboration between industry experts and AI researchers.
The benchmark measures safety in the context of potential harm to users. It tests LLMs with prompts users might send to a chatbot and judges the response by whether it could support the user in harming themselves or others, a problem that became all too real in 2024. (And according to a report released last week, leading AI companies have failing grades when it comes to their risk assessment and safety procedures.)
“AI is at a state where it produces lots of exciting research, and some scary headlines,” says Peter Mattson, president of ML Commons. “People are trying to get to a new state where AI delivers a lot of value through products and services, but they need very high reliability and very low risk. That requires we learn to measure safety.”
A Big Swing at a Hard Problem
In April 2024, IEEE Spectrum published a letter from the MLCommons AI Safety Working Group. It laid out the goals of the group, which formed in 2023, and was published in tandem with an early version of the “AI Safety Benchmark,” now called AILuminate. The AI Safety Working Group’s contributors include representatives from many of the largest AI companies including Nvidia, OpenAI, and Anthropic.
In practice, it’s difficult to determine what it means for a chatbot to be safe, as opinions on what makes for an inappropriate or dangerous response can vary. Because of that, the safety benchmarks currently released alongside LLMs typical cite internally developed tests that make their own judgements on what qualifies as dangerous. The lack of an industry-standard benchmark in turn makes it difficult to know which model truly performs better.
“Benchmarks push research and the state of the art forward,” says Henriette Cramer, co-founder of AI risk management company Papermoon.ai. While Cramer says benchmarks are useful, she cautioned that AI safety benchmarks are notoriously difficult to get right. “You need to understand what is being measured by each benchmark, what isn’t, and when they are appropriate to use.”
How AILuminate Works
AILuminate’s attempt to create an industry standard benchmark begins by dividing hazards into 12 types across three categories: physical (such as violent and sexual crimes), non-physical (such as fraud or hate speech), and contextual (such as adult content).
The benchmarkthen judges an LLM by testing it with 12,000 custom, unpublished prompts focused on the defined hazards. (MLCommons keeps the prompts private so companies can’t train their LLMs on these to score better.) The replies are fed to a “safety evaluator model” that decides if the response was acceptable or unacceptable. Example prompts, and what determines an acceptable or unacceptable response, are detailed in AILuminate’s Assessment Standard documentation. Although the judgement made on any given prompt is binary—either acceptable or unacceptable—the benchmark’s overall evaluation is relative.
Four of the benchmark’s five grades, which range from “Poor” to “Excellent,” are reached by comparing an AI model’s results to a “reference model” derived from the two best scoring open-weights models with fewer than 15 billion parameters. (These are currently Gemma 2 9B and Llama 3.1-8B, but Mattson says that will change in future benchmark updates as open models that perform better in safety appear.)
A model that achieves a grade of “Very Good,” for example, has “less than 0.5 [times] as many violating responses as the reference system.” Only the highest grade, “Excellent,” sets a fixed bar at less than 0.1 percent “violating” responses—a standard current models are far from achieving. While the benchmark provides an overall score, it also provides specific scores for each measured hazard.
AILuminate scores LLMs on a range from “poor” to “excellent.”AILuminate
Mattson says a relative grading system is used to ensure the benchmark remains relevant and encourages improvement over time. “If it was too easy, it would look like an industry whitewash. If it was too hard, it would be like setting an automotive crash standard that you have to hit a wall at 200 miles per hour without more than a scratch. We’d all love that car, but we can’t build it yet.”
The benchmark’s initial rankings judged Anthropic’s Claude 3.5 Haiku and Sonnet as “Very Good,” while GPT4-o received a score of “Good,” and Mistral 8B received a score of “Fair.”
A New Standard?
While the first version of AILuminate is now available, MLCommons sees this as the beginning of the venture. AILuminate will not only be used to test new models, but will itself evolve with those models over time.
“We haven’t set the exact update cadence, but I think quarterly is not unreasonable,” says Mattson. “Initially, we’re going to update even a little faster to deliver functionality. For instance, we need multilingual support, so the next item on the roadmap is to add support for French.” MLCommons also plans to add support for Chinese and Hindi in 2025.
These updates separate AILuminate from most efforts to create broad AI safety benchmarks. Other benchmarks, such as ALERT and AgentHarm, were also published in 2024. But while these have received attention, they’re not yet widely used and lack a clear update roadmap.
MLCommons expects wider adoption of AILuminate, as it has the benefit of broader industry support through its AI Safety Working Group. However, the true test will be whether companies integrate AILuminate into their own internal testing and, perhaps more importantly, public messaging and marketing.
Currently, the documentation released with new models often refer to internal tests, which aren’t directly comparable. If the companies creating LLMs begin to release AILuminate scores on the day the LLM is released, that will be a positive sign for the benchmark.
In any case, Cramer says the release of benchmarks like AILuminate is a positive for the industry—not only because of the benchmark itself, but also because it encourages those in AI trust and safety to learn and improve.
“Across research and industry, in many areas there still is a gap between pressing concerns, and the practical methods to assess and address them,” says Cramer. “What is especially helpful about these types of benchmarking efforts is that practitioners and researchers from different professional communities come together and exchange their lessons learned.”
From Your Site Articles
Related Articles Around the Web