Meta's Investment in AI Data Labeling Explained

Earlier this summer Meta made a US $14.3 billion bet on a company most people had never heard of before: Scale AI. The deal, which gave Meta a 49 percent stake, sent Meta’s competitors—including OpenAI and Google—scrambling to exit their contracts with Scale AI for fear it might give Meta insight into how they train and fine-tune their AI models.

Scale AI is a leader in data labeling for AI models. It’s an industry that, at its core, does what it says on the tin. The most basic example can be found in the thumbs-up and thumbs-down icons you’ve likely seen if you’ve ever used ChatGPT. One labels a reply as positive; the other, negative.

But as AI models grow, both in model size and popularity, this seemingly simple task has grown into a beast every organization looking to train or tune a model must manage.

“The vast majority of compute is used on pre-training data that’s of poor quality,” says Sara Hooker, a vice president of research at Cohere Labs. “We need to mitigate that, to improve it, applying super high-quality gold dust data in post-training.”

What Is Data Labeling?

Computer scientists have, in the past, relied on the axiom “garbage in, garbage out.” It suggests that bad inputs always lead to bad outputs.

However, as Hooker suggests, the training of modern AI models defies that axiom. Large language models are trained on raw text data scraped from the public Internet, much of which is of low quality (Reddit posts tend to outnumber academic papers).

Cleaning and sorting training data makes sense in theory, but with modern models training on petabytes of data, it’s impractical in practice due to the sheer volume of data involved. That’s a problem, because popular AI data training sets are known to include racist, sexist, and criminal data. Training data can also include more subtle issues, like sarcastic advice or purposefully misleading advice. Put simply: a lot of garbage finds its way into the training data.

So data labeling steps in to clean up the mess. Rather than trying to scrub out all of the problematic elements of the training data, human experts manually provide feedback on the AI model’s output after the model is trained. This molds the model, reducing undesirable replies and changing the model’s demeanor.

Sajjad Abdoli, founding AI scientist at data labeling company Perle, explains this process of creating “golden benchmarks” to fine-tune AI models. What exactly that benchmark contains will depend on the purpose of the model. “We walk our customers through the procedure, and create the criteria for a quality assessment,” says Abdoli.

Consider a typical chatbot. Most companies want to build a chatbot that’s helpful, accurate, and concise, so data labelers provide feedback with those goals in mind. Human data labelers read the replies generated by the model on a set of test prompts. A reply that seems to answer the prompt with concise and accurate information would be considered positive. A meandering reply that ends in an insult would be labeled as negative.

Not all AI models are meant to be chatbots, however, or focus on text. As a counterpoint, Abdoli described Perle’s work assisting a customer working on a model to label images. Perle contracted human experts to meticulously label the objects in thousands of images, creating a standard that could be used to improve the model. “We found a huge gap between what the human experts mentioned in an image, and what the machine learning model could recognize,” Abdoli says.

Why Meta Invested Billions in Scale AI

Data labeling is necessary to fine-tune any AI model, but that alone doesn’t explain why Meta was willing to invest over $14 billion in Scale AI. To understand that, we need to understand the AI industry’s latest obsession: agentic AI.

OpenAI’s CEO, Sam Altman, believes AI will make it possible for a single person to build a company worth $1 billion (or more). To make that dream come true, though, AI companies need to invent agentic AI models capable of complex multi-step workflows that might span days, even weeks, and include the use of numerous software tools.

And it turns out that data labeling is a key ingredient in the agentic AI recipe.

“Take a universe where you have multiple agents interacting with each other,” said Jason Liang, a senior vice president at AI data labeling company SuperAnnotate. “Somebody will have to come in and review, did the agent call the right tool? Did it call the next agent properly?”

In fact, the problem is even more complicated than it at first appears, as it requires evaluation of both specific actions and the AI agent’s overall plan. For example, several agents might call another in sequence, each for reasons that seem justifiable. “But actually, the first agent could have just called the fourth one and skipped the two in the middle,” says Liang.

Agentic AI also requires models that can solve problems in high-stakes fields where an agent’s results could have life-or-death consequences. Perle’s Abdoli pointed to medical use as a leading example. An agentic AI doctor capable of accurate diagnosis, even if just in a single field or in limited circumstances, could prove immensely valuable. But the creation of such an agent, if it’s even possible, will push the data labeling industry to its limits.

“If you’re collecting medical notes, or data from CT scans, or data like that, you need to source physicians [to label and annotate the data]. And they’re quite expensive,” says Abdoli. “However, for these kinds of activities, the precision and quality of the data is the most important thing.”

Synthetic Data’s Impact on AI Training

However, if AI models require human experts for data labeling to judge and improve models, where does that need end? Will we have teams of doctors labeling data in offices instead of doing actual medical work?

That’s where synthetic data steps in.

Rather than relying entirely on human experts, data labeling companies often use AI models to generate training data for other AI models—essentially letting machines teach machines. Modern data labeling is often a mix of manual human feedback and automated AI teachers designed to reinforce desirable model behavior.

“You have a teacher, and your teacher, which in this case is just another deep neural network, is outputting an example,” says Cohere’s Hooker. “And then the student model is trained on that example.” The key, she notes, is to use a high-quality teacher, and to use multiple different AI “teachers” rather than relying on a single model. This avoids the problem of model collapse, in which the output quality of an AI model trained on AI generated data drastically collapses.

DeepSeek R1, the model from the Chinese company of the same name that made waves in January for how cheap it was to train, is an extreme example of how synthetic data can work in practice. It achieved reasoning performance comparable to the best models from OpenAI, Anthropic, and Google without traditional human feedback. Instead, DeepSeek R1 was trained on “cold start” data consisting of a few thousand human-selected examples of chain-of-thought reasoning. After that, DeepSeek used rules-based rewards to reinforce the model’s reasoning behavior.

However, SuperAnnotate’s Liang cautioned that synthetic data isn’t a silver bullet. While the AI industry is often eager to automate whenever possible, attempts to use models for ever-more-complex tasks can reveal edge cases that only humans catch. “As we’re starting to see enterprises putting models into production, they’re all coming to the realization, holy moly, I need to get humans into the mix,” he says.

That’s precisely why data labeling companies like Scale AI, Perle, and SuperAnnotate (among dozens of others) are enjoying the spotlight. The best method for tuning agentic AI models to tackle complicated or niche use cases—whether through human feedback, synthetic data, some combination, or new techniques yet to be discovered—remains an open question. Meta’s $14 billion bet suggests the answer won’t come cheap.

From Your Site Articles