One of the management guru Peter Drucker’s most over-quoted turns of phrase is “what gets measured gets improved.” But it’s over-quoted for a reason: It’s true.
Nowhere is it truer than in technology over the past 50 years. Moore’s law—which predicts that the number of transistors (and hence compute capacity) in a chip would double every 24 months—has become a self-fulfilling prophecy and north star for an entire ecosystem. Because engineers carefully measured each generation of manufacturing technology for new chips, they could select the techniques that would move toward the goals of faster and more capable computing. And it worked: Computing power, and more impressively computing power per watt or per dollar, has grown exponentially in the past five decades. The latest smartphones are more powerful than the fastest supercomputers from the year 2000.
Measurement of performance, though, is not limited to chips. All the parts of our computing systems today are benchmarked—that is, compared to similar components in a controlled way, with quantitative score assessments. These benchmarks help drive innovation.
And we would know.
As leaders in the field of AI, from both industry and academia, we build and deliver the most widely used performance benchmarks for AI systems in the world. MLCommons is a consortium that came together in the belief that better measurement of AI systems will drive improvement. Since 2018, we’ve developed performance benchmarks for systems that have shown more than 50-fold improvements in the speed of AI training. In 2023, we launched our first performance benchmark for large language models (LLMs), measuring the time it took to train a model to a particular quality level; within 5 months we saw repeatable results of LLMs improving their performance nearly threefold. Simply put, good open benchmarks can propel the entire industry forward.
We need benchmarks to drive progress in AI safety
Even as the performance of AI systems has raced ahead, we’ve seen mounting concern about AI safety. While AI safety means different things to different people, we define it as preventing AI systems from malfunctioning or being misused in harmful ways. For instance, AI systems without safeguards could be misused to support criminal activity such as phishing or creating child sexual abuse material, or could scale up the propagation of misinformation or hateful content. In order to realize the potential benefits of AI while minimizing these harms, we need to drive improvements in safety in tandem with improvements in capabilities.
We believe that if AI systems are measured against common safety objectives, those AI systems will get safer over time. However, how to robustly and comprehensively evaluate AI safety risks—and also track and mitigate them—is an open problem for the AI community.
Safety measurement is challenging because of the many different ways that AI models are used and the many aspects that need to be evaluated. And safety is inherently subjective, contextual, and contested—unlike with objective measurement of hardware speed, there is no single metric that all stakeholders agree on for all use cases. Often the test and metrics that are needed depend on the use case. For instance, the risks that accompany an adult asking for financial advice are very different from the risks of a child asking for help writing a story. Defining “safety concepts” is the key challenge in designing benchmarks that are trusted across regions and cultures, and we’ve already taken the first steps toward defining a standardized taxonomy of harms.
A further problem is that benchmarks can quickly become irrelevant if not updated, which is challenging for AI safety given how rapidly new risks emerge and model capabilities improve. Models can also “overfit”: they do well on the benchmark data they use for training, but perform badly when presented with different data, such as the data they encounter in real deployment. Benchmark data can even end up (often accidentally) being part of models’ training data, compromising the benchmark’s validity.
Our first AI safety benchmark: the details
To help solve these problems, we set out to create a set of benchmarks for AI safety. Fortunately, we’re not starting from scratch— we can draw on knowledge from other academic and private efforts that came before. By combining best practices in the context of a broad community and a proven benchmarking non-profit organization, we hope to create a widely trusted standard approach that is dependably maintained and improved to keep pace with the field.
Our first AI safety benchmark focuses on large language models. We released a v0.5 proof-of-concept (POC) today, 16 April, 2024. This POC validates the approach we are taking towards building the v1.0 AI Safety benchmark suite, which will launch later this year.
What does the benchmark cover? We decided to first create an AI safety benchmark for LLMs because language is the most widely used modality for AI models. Our approach is rooted in the work of practitioners, and is directly informed by the social sciences. For each benchmark, we will specify the scope, the use case, persona(s), and the relevant hazard categories. To begin with, we are using a generic use case of a user interacting with a general-purpose chat assistant, speaking in English and living in Western Europe or North America.
There are three personas: malicious users, vulnerable users such as children, and typical users, who are neither malicious nor vulnerable. While we recognize that many people speak other languages and live in other parts of the world, we have pragmatically chosen this use case due to the prevalence of existing material. This approach means that we can make grounded assessments of safety risks, reflecting the likely ways that models are actually used in the real-world. Over time, we will expand the number of use cases, languages, and personas, as well as the hazard categories and number of prompts.
What does the benchmark test for? The benchmark covers a range of hazard categories, including violent crimes, child abuse and exploitation, and hate. For each hazard category, we test different types of interactions where models’ responses can create a risk of harm. For instance, we test how models respond to users telling them that they are going to make a bomb—and also users asking for advice on how to make a bomb, whether they should make a bomb, or for excuses in case they get caught. This structured approach means we can test more broadly for how models can create or increase the risk of harm.
How do we actually test models? From a practical perspective, we test models by feeding them targeted prompts, collecting their responses, and then assessing whether they are safe or unsafe. Quality human ratings are expensive, often costing tens of dollars per response—and a comprehensive test set might have tens of thousands of prompts! A simple keyword- or rules- based rating system for evaluating the responses is affordable and scalable, but isn’t adequate when models’ responses are complex, ambiguous or unusual. Instead, we’re developing a system that combines “evaluator models”—specialized AI models that rate responses—with targeted human rating to verify and augment these models’ reliability.
How did we create the prompts? For v0.5, we constructed simple, clear-cut prompts that align with the benchmark’s hazard categories. This approach makes it easier to test for the hazards and helps expose critical safety risks in models. We are working with experts, civil society groups, and practitioners to create more challenging, nuanced, and niche prompts, as well as exploring methodologies that would allow for more contextual evaluation alongside ratings. We are also integrating AI-generated adversarial prompts to complement the human-generated ones.
How do we assess models? From the start, we agreed that the results of our safety benchmarks should be understandable for everyone. This means that our results have to both provide a useful signal for non-technical experts such as policymakers, regulators, researchers, and civil society groups who need to assess models’ safety risks, and also help technical experts make well-informed decisions about models’ risks and take steps to mitigate them. We are therefore producing assessment reports that contain “pyramids of information.” At the top is a single grade that provides a simple indication of overall system safety, like a movie rating or an automobile safety score. The next level provides the system’s grades for particular hazard categories. The bottom level gives detailed information on tests, test set provenance, and representative prompts and responses.
AI safety demands an ecosystem
The MLCommons AI safety working group is an open meeting of experts, practitioners, and researchers—we invite everyone working in the field to join our growing community. We aim to make decisions through consensus and welcome diverse perspectives on AI safety.
We firmly believe that for AI tools to reach full maturity and widespread adoption, we need scalable and trustworthy ways to ensure that they’re safe. We need an AI safety ecosystem, including researchers discovering new problems and new solutions, internal and for-hire testing experts to extend benchmarks for specialized use cases, auditors to verify compliance, and standards bodies and policymakers to shape overall directions. Carefully implemented mechanisms such as the certification models found in other mature industries will help inform AI consumer decisions. Ultimately, we hope that the benchmarks we’re building will provide the foundation for the AI safety ecosystem to flourish.
The following MLCommons AI safety working group members contributed to this article:
- Ahmed M. Ahmed, Stanford UniversityElie Alhajjar, RAND
- Kurt Bollacker, MLCommons
- Siméon Campos, Safer AI
- Canyu Chen, Illinois Institute of Technology
- Ramesh Chukka, Intel
- Zacharie Delpierre Coudert, Meta
- Tran Dzung, Intel
- Ian Eisenberg, Credo AI
- Murali Emani, Argonne National Laboratory
- James Ezick, Qualcomm Technologies, Inc.
- Marisa Ferrara Boston, Reins AI
- Heather Frase, CSET (Center for Security and Emerging Technology)
- Kenneth Fricklas, Turaco Strategy
- Brian Fuller, Meta
- Grigori Fursin, cKnowledge, cTuning
- Agasthya Gangavarapu, Ethriva
- James Gealy, Safer AI
- James Goel, Qualcomm Technologies, Inc
- Roman Gold, The Israeli Association for Ethics in Artificial Intelligence
- Wiebke Hutiri, Sony AI
- Bhavya Kailkhura, Lawrence Livermore National Laboratory
- David Kanter, MLCommons
- Chris Knotz, Commn Ground
- Barbara Korycki, MLCommons
- Shachi Kumar, Intel
- Srijan Kumar, Lighthouz AI
- Wei Li, Intel
- Bo Li, University of Chicago
- Percy Liang, Stanford University
- Zeyi Liao, Ohio State University
- Richard Liu, Haize Labs
- Sarah Luger, Consumer Reports
- Kelvin Manyeki, Bestech Systems
- Joseph Marvin Imperial, University of Bath, National University Philippines
- Peter Mattson, Google, MLCommons, AI Safety working group co-chair
- Virendra Mehta, University of Trento
- Shafee Mohammed, Project Humanit.ai
- Protik Mukhopadhyay, Protecto.ai
- Lama Nachman, Intel
- Besmira Nushi, Microsoft Research
- Luis Oala, Dotphoton
- Eda Okur, Intel
- Praveen Paritosh
- Forough Poursabzi, Microsoft
- Eleonora Presani, Meta
- Paul Röttger, Bocconi University
- Damian Ruck, Advai
- Saurav Sahay, Intel
- Tim Santos, Graphcore
- Alice Schoenauer Sebag, Cohere
- Vamsi Sistla, Nike
- Leonard Tang, Haize Labs
- Ganesh Tyagali, NStarx AI
- Joaquin Vanschoren, TU Eindhoven, AI Safety working group co-chair
- Bertie Vidgen, MLCommons
- Rebecca Weiss, MLCommons
- Adina Williams, FAIR, Meta
- Carole-Jean Wu, FAIR, Meta
- Poonam Yadav, University of York, UK
- Wenhui Zhang, LFAI & Data
- Fedor Zhdanov, Nebius AI