• Business
  • Markets
  • Politics
  • Crypto
  • Finance
  • Intelligence
    • Policy Intelligence
    • Security Intelligence
    • Economic Intelligence
    • Fashion Intelligence
  • Energy
  • Technology
  • Taxes
  • Creator Economy
  • Wealth Management
  • LBNN Blueprints
  • Business
  • Markets
  • Politics
  • Crypto
  • Finance
  • Intelligence
    • Policy Intelligence
    • Security Intelligence
    • Economic Intelligence
    • Fashion Intelligence
  • Energy
  • Technology
  • Taxes
  • Creator Economy
  • Wealth Management
  • LBNN Blueprints

This Week in AI: Maybe we should ignore AI benchmarks for now

Simon Osuji by Simon Osuji
February 19, 2025
in Creator Economy
0
This Week in AI: Maybe we should ignore AI benchmarks for now
0
SHARES
7
VIEWS
Share on FacebookShare on Twitter

Welcome to TechCrunch’s regular AI newsletter! We’re going on hiatus for a bit, but you can find all our AI coverage, including my columns, our daily analysis, and breaking news stories, at TechCrunch. If you want those stories and much more in your inbox every day, sign up for our daily newsletters here.

This week, billionaire Elon Musk’s AI startup, xAI, released its latest flagship AI model, Grok 3, which powers the company’s Grok chatbot apps. Trained on around 200,000 GPUs, the model beats a number of other leading models, including from OpenAI, on benchmarks for mathematics, programming, and more.

But what do these benchmarks really tell us?

Here at TC, we often reluctantly report benchmark figures because they’re one of the few (relatively) standardized ways the AI industry measures model improvements. Popular AI benchmarks tend to test for esoteric knowledge, and give aggregate scores that correlate poorly to proficiency on the tasks that most people care about.

As Wharton professor Ethan Mollick pointed out in a series of posts on X after Grok 3’s unveiling Monday, there’s an “urgent need for better batteries of tests and independent testing authorities.” AI companies self-report benchmark results more often than not, as Mollick alluded to, making those results even tougher to accept at face value.

“Public benchmarks are both ‘meh’ and saturated, leaving a lot of AI testing to be like food reviews, based on taste,” Mollick wrote. “If AI is critical to work, we need more.”

There’s no shortage of independent tests and organizations proposing new benchmarks for AI, but their relative merit is far from a settled matter within the industry. Some AI commentators and experts propose aligning benchmarks with economic impact to ensure their usefulness, while others argue that adoption and utility are the ultimate benchmarks.

This debate may rage until the end of time. Perhaps we should instead, as X user Roon prescribes, simply pay less attention to new models and benchmarks barring major AI technical breakthroughs. For our collective sanity, that may not be the worst idea, even if it does induce some level of AI FOMO.

As mentioned above, This Week in AI is going on hiatus. Thanks for sticking with us, readers, through this roller coaster of a journey. Until next time.

News

Image Credits:Nathan Laine/Bloomberg / Getty Images

OpenAI tries to “uncensor” ChatGPT: Max wrote about how OpenAI is changing its AI development approach to explicitly embrace “intellectual freedom,” no matter how challenging or controversial a topic may be.

Mira’s new startup: Former OpenAI CTO Mira Murati’s new startup, Thinking Machines Lab, intends to build tools to “make AI work for [people’s] unique needs and goals.”

Grok 3 cometh: Elon Musk’s AI startup, xAI, has released its latest flagship AI model, Grok 3, and unveiled new capabilities for the Grok apps for iOS and the web.

A very Llama conference: Meta will host its first developer conference dedicated to generative AI this spring. Called LlamaCon after Meta’s Llama family of generative AI models, the conference is scheduled for April 29.

AI and Europe’s digital sovereignty: Paul profiled OpenEuroLLM, a collaboration between some 20 organizations to build “a series of foundation models for transparent AI in Europe” that preserves the “linguistic and cultural diversity” of all EU languages.

Research paper of the week

OpenAI ChatGPT website displayed on a laptop screen is seen in this illustration photo.
Image Credits:Jakub Porzycki/NurPhoto / Getty Images

OpenAI researchers have created a new AI benchmark, SWE-Lancer, that aims to evaluate the coding prowess of powerful AI systems. The benchmark consists of over 1,400 freelance software engineering tasks that range from bug fixes and feature deployments to “manager-level” technical implementation proposals.

According to OpenAI, the best-performing AI model, Anthropic’s Claude 3.5 Sonnet, scores 40.3% on the full SWE-Lancer benchmark — suggesting that AI has quite a ways to go. It’s worth noting that the researchers didn’t benchmark newer models like OpenAI’s o3-mini or Chinese AI company DeepSeek’s R1.

Model of the week

A Chinese AI company named Stepfun has released an “open” AI model, Step-Audio, that can understand and generate speech in several languages. Step-Audio supports Chinese, English, and Japanese and lets users adjust the emotion and even dialect of the synthetic audio it creates, including singing.

Stepfun is one of several well-funded Chinese AI startups releasing models under a permissive license. Founded in 2023, Stepfun reportedly recently closed a funding round worth several hundred million dollars from a host of investors that include Chinese state-owned private equity firms.

Grab bag

Nous Research DeepHermes
Image Credits:Nous Research

Nous Research, an AI research group, has released what it claims is one of the first AI models that unifies reasoning and “intuitive language model capabilities.”

The model, DeepHermes-3 Preview, can toggle on and off long “chains of thought” for improved accuracy at the cost of some computational heft. In “reasoning” mode, DeepHermes-3 Preview, similar to other reasoning AI models, “thinks” longer for harder problems and shows its thought process to arrive at the answer.

Anthropic reportedly plans to release an architecturally similar model soon, and OpenAI has said such a model is on its near-term roadmap.

Source link

Related posts

Blackstone backs Neysa in up to $1.2B financing as India pushes to build domestic AI infrastructure

Blackstone backs Neysa in up to $1.2B financing as India pushes to build domestic AI infrastructure

February 16, 2026
As AI data centers hit power limits, Peak XV backs Indian startup C2i to fix the bottleneck

As AI data centers hit power limits, Peak XV backs Indian startup C2i to fix the bottleneck

February 16, 2026
Previous Post

Egypt: Release social media users detained for supporting calls to end President Abdel Fattah al-Sisi’s rule

Next Post

EU Extends Red Sea Security Mission Until 2026

Next Post
EU Extends Red Sea Security Mission Until 2026

EU Extends Red Sea Security Mission Until 2026

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

RECOMMENDED NEWS

China’s housing slump is much worse than official data shows

China’s housing slump is much worse than official data shows

3 years ago
Meesho taps micro-entrepreneurs to plug gaps in India’s supply chain network

Meesho taps micro-entrepreneurs to plug gaps in India’s supply chain network

2 years ago
See the Perseids and Southern Delta Aquariids in a Stunning Double Meteor Shower

See the Perseids and Southern Delta Aquariids in a Stunning Double Meteor Shower

2 years ago
Central Bank of India Triggered Gold Auction Move

Central Bank of India Triggered Gold Auction Move

4 months ago

POPULAR NEWS

  • Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    0 shares
    Share 0 Tweet 0
  • The world’s top 10 most valuable car brands in 2025

    0 shares
    Share 0 Tweet 0
  • Top 10 African countries with the highest GDP per capita in 2025

    0 shares
    Share 0 Tweet 0
  • Global ranking of Top 5 smartphone brands in Q3, 2024

    0 shares
    Share 0 Tweet 0
  • When Will SHIB Reach $1? Here’s What ChatGPT Says

    0 shares
    Share 0 Tweet 0

Get strategic intelligence you won’t find anywhere else. Subscribe to the Limitless Beliefs Newsletter for monthly insights on overlooked business opportunities across Africa.

Subscription Form

© 2026 LBNN – All rights reserved.

Privacy Policy | About Us | Contact

Tiktok Youtube Telegram Instagram Linkedin X-twitter
No Result
View All Result
  • Home
  • Business
  • Politics
  • Markets
  • Crypto
  • Economics
    • Manufacturing
    • Real Estate
    • Infrastructure
  • Finance
  • Energy
  • Creator Economy
  • Wealth Management
  • Taxes
  • Telecoms
  • Military & Defense
  • Careers
  • Technology
  • Artificial Intelligence
  • Investigative journalism
  • Art & Culture
  • LBNN Blueprints
  • Quizzes
    • Enneagram quiz
  • Fashion Intelligence

© 2023 LBNN - All rights reserved.