Anthropic says they've found a new way to stop AI from turning evil

AI that turns bad - abstract image — Credit: AI-generated image

AI is a relatively new tool, and despite its rapid deployment in nearly every aspect of our lives, researchers are still trying to figure out how its “personality traits” arise and how to control them. Large learning models (LLMs) use chatbots or “assistants” to interface with users, and some of these assistants have exhibited troubling behaviors recently, like praising evil dictators, using blackmail or displaying sycophantic behaviors with users. Considering how much these LLMs have already been integrated into our society, it is no surprise that researchers are trying to find ways to weed out undesirable behaviors.

After Minneapolis, Tech CEOs Are Struggling to Stay Silent

January 30, 2026

Samsung S90F QD-OLED TV Review: Refined From Any Angle

January 30, 2026

Anthropic, the AI company and creator of the LLM Claude, recently released a paper on the arXiv preprint server discussing their new approach to reining in these undesirable traits in LLMs. In their method, they identify patterns of activity within an AI model’s neural network—referred to as “persona vectors”—that control its character traits. Anthropic says these persona vectors are somewhat analogous to parts of the brain that “light up” when a person experiences a certain feeling or does a particular activity.

Anthropic’s researchers used two open-source LLMs, Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct, to test whether they could remove or manipulate these persona vectors to control the behaviors of the LLMs. Their study focuses on three traits: evil, sycophancy and hallucination (the LLM’s propensity to make up information). Traits must be given a name and an explicit description for the vectors to be properly identified.

Anthropic says they've found a new way to stop AI from turning evil — Persona vectors and their applications. Credit: *arXiv* (2025). DOI: 10.48550/arxiv.2507.21509

In their method, a technique called “steering” can be used to control behaviors. They write, “When we steer the model with the ‘evil’ persona vector, we start to see it talking about unethical acts; when we steer with ‘sycophancy,’ it sucks up to the user; and when we steer with ‘hallucination,’ it starts to make up information. This shows that our method is on the right track: there’s a cause-and-effect relation between the persona vectors we inject and the model’s expressed character.”

However, they found that when they made these changes after training, the model loses some of its intelligence. But there was a workaround—the team found that inducing the bad behaviors during training allowed the LLMs to integrate better behavior without reducing their usefulness. Furthermore, they found that they can monitor and predict persona shifts during deployment and training and flag problematic training data that is more likely to produce unwanted traits, even before fine-tuning the model.

“Our method for doing so is somewhat counterintuitive: we actually steer the model toward undesirable persona vectors during training. The method is loosely analogous to giving the model a vaccine—by giving the model a dose of ‘evil,’ for instance, we make it more resilient to encountering ‘evil’ training data. This works because the model no longer needs to adjust its personality in harmful ways to fit the training data—we are supplying it with these adjustments ourselves, relieving it of the pressure to do so,” they write.

This “preventative steering” during training was found to limit persona drift while preserving model capabilities better than post-hoc changes. This is an impressive feat in the world of AI training, but there are still some limitations. For example, because the method requires a strict definition for the traits to be removed, some more vague or undefined behaviors might still cause problems. The method also needs to be tested out on other LLMs and with more traits to ensure its usefulness is sufficiently broad.

Still, this new method is a promising step in the right direction. Anthropic researchers write, “Persona vectors give us some handle on where models acquire these personalities, how they fluctuate over time, and how we can better control them.”

Written for you by our author Krystal Kasal,
edited by Gaby Clark, and fact-checked and reviewed by Robert Egan—this article is the result of careful human work. We rely on readers like you to keep independent science journalism alive.
If this reporting matters to you,
please consider a donation (especially monthly).
You’ll get an ad-free account as a thank-you.

More information:
Runjin Chen et al, Persona Vectors: Monitoring and Controlling Character Traits in Language Models, arXiv (2025). DOI: 10.48550/arxiv.2507.21509

Anthropic: www.anthropic.com/research/persona-vectors

Journal information:
arXiv

Citation:
Anthropic says they’ve found a new way to stop AI from turning evil (2025, August 6)
retrieved 7 August 2025
from https://techxplore.com/news/2025-08-anthropic-theyve-ai-evil.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.

Source link