• Business
  • Markets
  • Politics
  • Crypto
  • Finance
  • Intelligence
    • Policy Intelligence
    • Security Intelligence
    • Economic Intelligence
    • Fashion Intelligence
  • Energy
  • Technology
  • Taxes
  • Creator Economy
  • Wealth Management
  • LBNN Blueprints
  • Business
  • Markets
  • Politics
  • Crypto
  • Finance
  • Intelligence
    • Policy Intelligence
    • Security Intelligence
    • Economic Intelligence
    • Fashion Intelligence
  • Energy
  • Technology
  • Taxes
  • Creator Economy
  • Wealth Management
  • LBNN Blueprints

Meta’s Investment in AI Data Labeling Explained

Simon Osuji by Simon Osuji
August 1, 2025
in Artificial Intelligence
0
Meta’s Investment in AI Data Labeling Explained
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter



Earlier this summer Meta made a US $14.3 billion bet on a company most people had never heard of before: Scale AI. The deal, which gave Meta a 49 percent stake, sent Meta’s competitors—including OpenAI and Google—scrambling to exit their contracts with Scale AI for fear it might give Meta insight into how they train and fine-tune their AI models.

Scale AI is a leader in data labeling for AI models. It’s an industry that, at its core, does what it says on the tin. The most basic example can be found in the thumbs-up and thumbs-down icons you’ve likely seen if you’ve ever used ChatGPT. One labels a reply as positive; the other, negative.

But as AI models grow, both in model size and popularity, this seemingly simple task has grown into a beast every organization looking to train or tune a model must manage.

“The vast majority of compute is used on pre-training data that’s of poor quality,” says Sara Hooker, a vice president of research at Cohere Labs. “We need to mitigate that, to improve it, applying super high-quality gold dust data in post-training.”

 

What Is Data Labeling?

 

Computer scientists have, in the past, relied on the axiom “garbage in, garbage out.” It suggests that bad inputs always lead to bad outputs.

However, as Hooker suggests, the training of modern AI models defies that axiom. Large language models are trained on raw text data scraped from the public Internet, much of which is of low quality (Reddit posts tend to outnumber academic papers).

Cleaning and sorting training data makes sense in theory, but with modern models training on petabytes of data, it’s impractical in practice due to the sheer volume of data involved. That’s a problem, because popular AI data training sets are known to include racist, sexist, and criminal data. Training data can also include more subtle issues, like sarcastic advice or purposefully misleading advice. Put simply: a lot of garbage finds its way into the training data.

So data labeling steps in to clean up the mess. Rather than trying to scrub out all of the problematic elements of the training data, human experts manually provide feedback on the AI model’s output after the model is trained. This molds the model, reducing undesirable replies and changing the model’s demeanor.

Sajjad Abdoli, founding AI scientist at data labeling company Perle, explains this process of creating “golden benchmarks” to fine-tune AI models. What exactly that benchmark contains will depend on the purpose of the model. “We walk our customers through the procedure, and create the criteria for a quality assessment,” says Abdoli.

Consider a typical chatbot. Most companies want to build a chatbot that’s helpful, accurate, and concise, so data labelers provide feedback with those goals in mind. Human data labelers read the replies generated by the model on a set of test prompts. A reply that seems to answer the prompt with concise and accurate information would be considered positive. A meandering reply that ends in an insult would be labeled as negative.

Not all AI models are meant to be chatbots, however, or focus on text. As a counterpoint, Abdoli described Perle’s work assisting a customer working on a model to label images. Perle contracted human experts to meticulously label the objects in thousands of images, creating a standard that could be used to improve the model. “We found a huge gap between what the human experts mentioned in an image, and what the machine learning model could recognize,” Abdoli says.

Why Meta Invested Billions in Scale AI

Data labeling is necessary to fine-tune any AI model, but that alone doesn’t explain why Meta was willing to invest over $14 billion in Scale AI. To understand that, we need to understand the AI industry’s latest obsession: agentic AI.

OpenAI’s CEO, Sam Altman, believes AI will make it possible for a single person to build a company worth $1 billion (or more). To make that dream come true, though, AI companies need to invent agentic AI models capable of complex multi-step workflows that might span days, even weeks, and include the use of numerous software tools.

And it turns out that data labeling is a key ingredient in the agentic AI recipe.

“Take a universe where you have multiple agents interacting with each other,” said Jason Liang, a senior vice president at AI data labeling company SuperAnnotate. “Somebody will have to come in and review, did the agent call the right tool? Did it call the next agent properly?”

In fact, the problem is even more complicated than it at first appears, as it requires evaluation of both specific actions and the AI agent’s overall plan. For example, several agents might call another in sequence, each for reasons that seem justifiable. “But actually, the first agent could have just called the fourth one and skipped the two in the middle,” says Liang.

Agentic AI also requires models that can solve problems in high-stakes fields where an agent’s results could have life-or-death consequences. Perle’s Abdoli pointed to medical use as a leading example. An agentic AI doctor capable of accurate diagnosis, even if just in a single field or in limited circumstances, could prove immensely valuable. But the creation of such an agent, if it’s even possible, will push the data labeling industry to its limits.

“If you’re collecting medical notes, or data from CT scans, or data like that, you need to source physicians [to label and annotate the data]. And they’re quite expensive,” says Abdoli. “However, for these kinds of activities, the precision and quality of the data is the most important thing.”

 

Synthetic Data’s Impact on AI Training

 

However, if AI models require human experts for data labeling to judge and improve models, where does that need end? Will we have teams of doctors labeling data in offices instead of doing actual medical work?

That’s where synthetic data steps in.

Rather than relying entirely on human experts, data labeling companies often use AI models to generate training data for other AI models—essentially letting machines teach machines. Modern data labeling is often a mix of manual human feedback and automated AI teachers designed to reinforce desirable model behavior.

“You have a teacher, and your teacher, which in this case is just another deep neural network, is outputting an example,” says Cohere’s Hooker. “And then the student model is trained on that example.” The key, she notes, is to use a high-quality teacher, and to use multiple different AI “teachers” rather than relying on a single model. This avoids the problem of model collapse, in which the output quality of an AI model trained on AI generated data drastically collapses.

DeepSeek R1, the model from the Chinese company of the same name that made waves in January for how cheap it was to train, is an extreme example of how synthetic data can work in practice. It achieved reasoning performance comparable to the best models from OpenAI, Anthropic, and Google without traditional human feedback. Instead, DeepSeek R1 was trained on “cold start” data consisting of a few thousand human-selected examples of chain-of-thought reasoning. After that, DeepSeek used rules-based rewards to reinforce the model’s reasoning behavior.

However, SuperAnnotate’s Liang cautioned that synthetic data isn’t a silver bullet. While the AI industry is often eager to automate whenever possible, attempts to use models for ever-more-complex tasks can reveal edge cases that only humans catch. “As we’re starting to see enterprises putting models into production, they’re all coming to the realization, holy moly, I need to get humans into the mix,” he says.

That’s precisely why data labeling companies like Scale AI, Perle, and SuperAnnotate (among dozens of others) are enjoying the spotlight. The best method for tuning agentic AI models to tackle complicated or niche use cases—whether through human feedback, synthetic data, some combination, or new techniques yet to be discovered—remains an open question. Meta’s $14 billion bet suggests the answer won’t come cheap.

From Your Site Articles

Related Articles Around the Web



Source link

Related posts

Area Man Accidentally Hacks 6,700 Camera-Enabled Robot Vacuums

Area Man Accidentally Hacks 6,700 Camera-Enabled Robot Vacuums

March 1, 2026
X Is Drowning in Disinformation Following US and Israel’s Attack on Iran

X Is Drowning in Disinformation Following US and Israel’s Attack on Iran

February 28, 2026
Previous Post

Diversification can protect pig farmers in volatile market

Next Post

Are We In A Bear Market?

Next Post
Are We In A Bear Market?

Are We In A Bear Market?

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

RECOMMENDED NEWS

Georgia GOP Proposes RICO Expansion for “Loitering” Protesters

Georgia GOP Proposes RICO Expansion for “Loitering” Protesters

2 years ago
Beijing pledge saves China stocks from Asian slump: Markets wrap

Beijing pledge saves China stocks from Asian slump: Markets wrap

2 years ago
Motshekga again quizzed on military veterans by parliamentarians

Motshekga again quizzed on military veterans by parliamentarians

1 year ago
Ghana could become a victim of its own success, as borrowing costs raise concerns

Ghana could become a victim of its own success, as borrowing costs raise concerns

3 months ago

POPULAR NEWS

  • Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    0 shares
    Share 0 Tweet 0
  • Mahama attends Liberia’s 178th independence anniversary

    0 shares
    Share 0 Tweet 0
  • The world’s top 10 most valuable car brands in 2025

    0 shares
    Share 0 Tweet 0
  • Top 10 African countries with the highest GDP per capita in 2025

    0 shares
    Share 0 Tweet 0
  • Global ranking of Top 5 smartphone brands in Q3, 2024

    0 shares
    Share 0 Tweet 0

Get strategic intelligence you won’t find anywhere else. Subscribe to the Limitless Beliefs Newsletter for monthly insights on overlooked business opportunities across Africa.

Subscription Form

© 2026 LBNN – All rights reserved.

Privacy Policy | About Us | Contact

Tiktok Youtube Telegram Instagram Linkedin X-twitter
No Result
View All Result
  • Home
  • Business
  • Politics
  • Markets
  • Crypto
  • Economics
    • Manufacturing
    • Real Estate
    • Infrastructure
  • Finance
  • Energy
  • Creator Economy
  • Wealth Management
  • Taxes
  • Telecoms
  • Military & Defense
  • Careers
  • Technology
  • Artificial Intelligence
  • Investigative journalism
  • Art & Culture
  • LBNN Blueprints
  • Quizzes
    • Enneagram quiz
  • Fashion Intelligence

© 2023 LBNN - All rights reserved.