• Business
  • Markets
  • Politics
  • Crypto
  • Finance
  • Intelligence
    • Policy Intelligence
    • Security Intelligence
    • Economic Intelligence
    • Fashion Intelligence
  • Energy
  • Technology
  • Taxes
  • Creator Economy
  • Wealth Management
  • LBNN Blueprints
  • Business
  • Markets
  • Politics
  • Crypto
  • Finance
  • Intelligence
    • Policy Intelligence
    • Security Intelligence
    • Economic Intelligence
    • Fashion Intelligence
  • Energy
  • Technology
  • Taxes
  • Creator Economy
  • Wealth Management
  • LBNN Blueprints

Anthropic says they’ve found a new way to stop AI from turning evil

Simon Osuji by Simon Osuji
August 7, 2025
in Artificial Intelligence
0
Anthropic says they’ve found a new way to stop AI from turning evil
0
SHARES
1
VIEWS
Share on FacebookShare on Twitter


AI that turns bad - abstract image
Credit: AI-generated image

AI is a relatively new tool, and despite its rapid deployment in nearly every aspect of our lives, researchers are still trying to figure out how its “personality traits” arise and how to control them. Large learning models (LLMs) use chatbots or “assistants” to interface with users, and some of these assistants have exhibited troubling behaviors recently, like praising evil dictators, using blackmail or displaying sycophantic behaviors with users. Considering how much these LLMs have already been integrated into our society, it is no surprise that researchers are trying to find ways to weed out undesirable behaviors.

Related posts

After Minneapolis, Tech CEOs Are Struggling to Stay Silent

After Minneapolis, Tech CEOs Are Struggling to Stay Silent

January 30, 2026
Samsung S90F QD-OLED TV Review: Refined From Any Angle

Samsung S90F QD-OLED TV Review: Refined From Any Angle

January 30, 2026

Anthropic, the AI company and creator of the LLM Claude, recently released a paper on the arXiv preprint server discussing their new approach to reining in these undesirable traits in LLMs. In their method, they identify patterns of activity within an AI model’s neural network—referred to as “persona vectors”—that control its character traits. Anthropic says these persona vectors are somewhat analogous to parts of the brain that “light up” when a person experiences a certain feeling or does a particular activity.

Anthropic’s researchers used two open-source LLMs, Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct, to test whether they could remove or manipulate these persona vectors to control the behaviors of the LLMs. Their study focuses on three traits: evil, sycophancy and hallucination (the LLM’s propensity to make up information). Traits must be given a name and an explicit description for the vectors to be properly identified.

Anthropic says they've found a new way to stop AI from turning evil
Persona vectors and their applications. Credit: arXiv (2025). DOI: 10.48550/arxiv.2507.21509

In their method, a technique called “steering” can be used to control behaviors. They write, “When we steer the model with the ‘evil’ persona vector, we start to see it talking about unethical acts; when we steer with ‘sycophancy,’ it sucks up to the user; and when we steer with ‘hallucination,’ it starts to make up information. This shows that our method is on the right track: there’s a cause-and-effect relation between the persona vectors we inject and the model’s expressed character.”

However, they found that when they made these changes after training, the model loses some of its intelligence. But there was a workaround—the team found that inducing the bad behaviors during training allowed the LLMs to integrate better behavior without reducing their usefulness. Furthermore, they found that they can monitor and predict persona shifts during deployment and training and flag problematic training data that is more likely to produce unwanted traits, even before fine-tuning the model.

“Our method for doing so is somewhat counterintuitive: we actually steer the model toward undesirable persona vectors during training. The method is loosely analogous to giving the model a vaccine—by giving the model a dose of ‘evil,’ for instance, we make it more resilient to encountering ‘evil’ training data. This works because the model no longer needs to adjust its personality in harmful ways to fit the training data—we are supplying it with these adjustments ourselves, relieving it of the pressure to do so,” they write.

This “preventative steering” during training was found to limit persona drift while preserving model capabilities better than post-hoc changes. This is an impressive feat in the world of AI training, but there are still some limitations. For example, because the method requires a strict definition for the traits to be removed, some more vague or undefined behaviors might still cause problems. The method also needs to be tested out on other LLMs and with more traits to ensure its usefulness is sufficiently broad.

Still, this new method is a promising step in the right direction. Anthropic researchers write, “Persona vectors give us some handle on where models acquire these personalities, how they fluctuate over time, and how we can better control them.”

Written for you by our author Krystal Kasal,
edited by Gaby Clark, and fact-checked and reviewed by Robert Egan—this article is the result of careful human work. We rely on readers like you to keep independent science journalism alive.
If this reporting matters to you,
please consider a donation (especially monthly).
You’ll get an ad-free account as a thank-you.

More information:
Runjin Chen et al, Persona Vectors: Monitoring and Controlling Character Traits in Language Models, arXiv (2025). DOI: 10.48550/arxiv.2507.21509

Anthropic: www.anthropic.com/research/persona-vectors

Journal information:
arXiv

© 2025 Science X Network

Citation:
Anthropic says they’ve found a new way to stop AI from turning evil (2025, August 6)
retrieved 7 August 2025
from https://techxplore.com/news/2025-08-anthropic-theyve-ai-evil.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.





Source link

Previous Post

Bank of England, facing jobs-inflation dilemma, poised to cut rates

Next Post

Gaza Aid Contractor Is “Infidels” Member, Has Crusader Tattoos

Next Post
Gaza Aid Contractor Is “Infidels” Member, Has Crusader Tattoos

Gaza Aid Contractor Is “Infidels” Member, Has Crusader Tattoos

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

RECOMMENDED NEWS

Mali court denies bail for Barrick staff as gold mine seizure deepens crisis

Mali court denies bail for Barrick staff as gold mine seizure deepens crisis

6 months ago
Flooding engulfs Sudan and Chad as drought ravages Lesotho

World needs a bold climate action reboot

1 year ago
Photonic quantum chips are making AI smarter and greener

Photonic quantum chips are making AI smarter and greener

8 months ago
What Is Thirst? | WIRED

What Is Thirst? | WIRED

4 months ago

POPULAR NEWS

  • Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    0 shares
    Share 0 Tweet 0
  • The world’s top 10 most valuable car brands in 2025

    0 shares
    Share 0 Tweet 0
  • Top 10 African countries with the highest GDP per capita in 2025

    0 shares
    Share 0 Tweet 0
  • Global ranking of Top 5 smartphone brands in Q3, 2024

    0 shares
    Share 0 Tweet 0
  • When Will SHIB Reach $1? Here’s What ChatGPT Says

    0 shares
    Share 0 Tweet 0

Get strategic intelligence you won’t find anywhere else. Subscribe to the Limitless Beliefs Newsletter for monthly insights on overlooked business opportunities across Africa.

Subscription Form

© 2026 LBNN – All rights reserved.

Privacy Policy | About Us | Contact

Tiktok Youtube Telegram Instagram Linkedin X-twitter
No Result
View All Result
  • Home
  • Business
  • Politics
  • Markets
  • Crypto
  • Economics
    • Manufacturing
    • Real Estate
    • Infrastructure
  • Finance
  • Energy
  • Creator Economy
  • Wealth Management
  • Taxes
  • Telecoms
  • Military & Defense
  • Careers
  • Technology
  • Artificial Intelligence
  • Investigative journalism
  • Art & Culture
  • LBNN Blueprints
  • Quizzes
    • Enneagram quiz
  • Fashion Intelligence

© 2023 LBNN - All rights reserved.