This Week in AI: Maybe we should ignore AI benchmarks for now

Welcome to TechCrunch’s regular AI newsletter! We’re going on hiatus for a bit, but you can find all our AI coverage, including my columns, our daily analysis, and breaking news stories, at TechCrunch. If you want those stories and much more in your inbox every day, sign up for our daily newsletters here.

This week, billionaire Elon Musk’s AI startup, xAI, released its latest flagship AI model, Grok 3, which powers the company’s Grok chatbot apps. Trained on around 200,000 GPUs, the model beats a number of other leading models, including from OpenAI, on benchmarks for mathematics, programming, and more.

But what do these benchmarks really tell us?

Here at TC, we often reluctantly report benchmark figures because they’re one of the few (relatively) standardized ways the AI industry measures model improvements. Popular AI benchmarks tend to test for esoteric knowledge, and give aggregate scores that correlate poorly to proficiency on the tasks that most people care about.

As Wharton professor Ethan Mollick pointed out in a series of posts on X after Grok 3’s unveiling Monday, there’s an “urgent need for better batteries of tests and independent testing authorities.” AI companies self-report benchmark results more often than not, as Mollick alluded to, making those results even tougher to accept at face value.

“Public benchmarks are both ‘meh’ and saturated, leaving a lot of AI testing to be like food reviews, based on taste,” Mollick wrote. “If AI is critical to work, we need more.”

There’s no shortage of independent tests and organizations proposing new benchmarks for AI, but their relative merit is far from a settled matter within the industry. Some AI commentators and experts propose aligning benchmarks with economic impact to ensure their usefulness, while others argue that adoption and utility are the ultimate benchmarks.

This debate may rage until the end of time. Perhaps we should instead, as X user Roon prescribes, simply pay less attention to new models and benchmarks barring major AI technical breakthroughs. For our collective sanity, that may not be the worst idea, even if it does induce some level of AI FOMO.

As mentioned above, This Week in AI is going on hiatus. Thanks for sticking with us, readers, through this roller coaster of a journey. Until next time.

News

**Image Credits:**Nathan Laine/Bloomberg / Getty Images

OpenAI tries to “uncensor” ChatGPT: Max wrote about how OpenAI is changing its AI development approach to explicitly embrace “intellectual freedom,” no matter how challenging or controversial a topic may be.

Mira’s new startup: Former OpenAI CTO Mira Murati’s new startup, Thinking Machines Lab, intends to build tools to “make AI work for [people’s] unique needs and goals.”

Grok 3 cometh: Elon Musk’s AI startup, xAI, has released its latest flagship AI model, Grok 3, and unveiled new capabilities for the Grok apps for iOS and the web.

A very Llama conference: Meta will host its first developer conference dedicated to generative AI this spring. Called LlamaCon after Meta’s Llama family of generative AI models, the conference is scheduled for April 29.

AI and Europe’s digital sovereignty: Paul profiled OpenEuroLLM, a collaboration between some 20 organizations to build “a series of foundation models for transparent AI in Europe” that preserves the “linguistic and cultural diversity” of all EU languages.

Research paper of the week

OpenAI ChatGPT website displayed on a laptop screen is seen in this illustration photo. — **Image Credits:**Jakub Porzycki/NurPhoto / Getty Images

OpenAI researchers have created a new AI benchmark, SWE-Lancer, that aims to evaluate the coding prowess of powerful AI systems. The benchmark consists of over 1,400 freelance software engineering tasks that range from bug fixes and feature deployments to “manager-level” technical implementation proposals.

According to OpenAI, the best-performing AI model, Anthropic’s Claude 3.5 Sonnet, scores 40.3% on the full SWE-Lancer benchmark — suggesting that AI has quite a ways to go. It’s worth noting that the researchers didn’t benchmark newer models like OpenAI’s o3-mini or Chinese AI company DeepSeek’s R1.

Model of the week

A Chinese AI company named Stepfun has released an “open” AI model, Step-Audio, that can understand and generate speech in several languages. Step-Audio supports Chinese, English, and Japanese and lets users adjust the emotion and even dialect of the synthetic audio it creates, including singing.

Stepfun is one of several well-funded Chinese AI startups releasing models under a permissive license. Founded in 2023, Stepfun reportedly recently closed a funding round worth several hundred million dollars from a host of investors that include Chinese state-owned private equity firms.

Grab bag

Nous Research DeepHermes — **Image Credits:**Nous Research

Nous Research, an AI research group, has released what it claims is one of the first AI models that unifies reasoning and “intuitive language model capabilities.”

The model, DeepHermes-3 Preview, can toggle on and off long “chains of thought” for improved accuracy at the cost of some computational heft. In “reasoning” mode, DeepHermes-3 Preview, similar to other reasoning AI models, “thinks” longer for harder problems and shows its thought process to arrive at the answer.

Anthropic reportedly plans to release an architecturally similar model soon, and OpenAI has said such a model is on its near-term roadmap.

Source link