Saturday, July 26, 2025
LBNN
  • Business
  • Markets
  • Politics
  • Crypto
  • Finance
  • Energy
  • Technology
  • Taxes
  • Creator Economy
  • Wealth Management
  • Documentaries
No Result
View All Result
LBNN

New security system drastically reduces chatbot jailbreaks

Simon Osuji by Simon Osuji
February 5, 2025
in Artificial Intelligence
0
New security system drastically reduces chatbot jailbreaks
0
SHARES
1
VIEWS
Share on FacebookShare on Twitter


New security system meant to prevent chatbot jailbreaks
Constitutional Classifiers. (a) To defend LLMs against universal jailbreaks, we use classifier safeguards that monitor inputs and outputs. (b) To train these safeguards, we use a constitution defining categories of harmful and harmless content, enabling rapid adaptation to new threat models. (c) The constitution is used to generate synthetic data that we then use in training. We further use pools of benign inputs and outputs along with data augmentation for better performance. Credit: arXiv (2025). DOI: 10.48550/arxiv.2501.18837

A large team of computer engineers and security specialists at AI app maker Anthropic has developed a new security system aimed at preventing chatbot jailbreaks. Their paper is published on the arXiv preprint server.

Related posts

Florida Is Now a Haven for Unproven Stem-Cell Treatments

Florida Is Now a Haven for Unproven Stem-Cell Treatments

July 26, 2025
SteelSeries Rival 3 Gen 2 Review: Good Budget Gaming Mice

SteelSeries Rival 3 Gen 2 Review: Good Budget Gaming Mice

July 26, 2025

Ever since chatbots became available for public use, users have been finding ways to get them to answer questions that makers of the chatbots have tried to prevent. Chatbots should not provide answers to questions such as how to rob a bank, for example, or how to build an atom bomb. Chatbot makers have been continually adding security blocks to prevent them from causing harm.

Unfortunately, preventing such jailbreaks has proven to be difficult in the face of an onslaught of determined users. Many have found that phrasing queries in odd ways can circumvent security blocks, for example. Even more unfortunate is that users found a way to conduct what has come to be known as universal jailbreaks, in which a command overrides all the safeguards built into a given chatbot, putting them into what is known as “God Mode.”

In this new effort, the team at Anthropic (maker of the Claude LLMs) has developed a security system that uses what they describe as constitutional classifiers. They claim that the system is capable of thwarting the vast majority of jailbreak attempts, while also returning few overrefusals, in which the system refuses to answer benign queries.

The constitutional classifiers used by Anthropic are based on what are known as constitutional AIs—an artificial-intelligence-based system that seeks to use known human values based on provided lists. The team at Anthropic created a list of 10,000 prompts that are both prohibited under certain contexts and have been used by jailbreakers in the past.

The team also translated them into multiple languages and used different writing styles to prevent similar terms from slipping through. They finished by feeding their system batches of benign queries that might result in overrefusals, and made tweaks to ensure they were not flagged.

The researchers then tested the effectiveness of their system using their own Claude 3.5 Sonnet LLM. They first tested a baseline model without the new system and found that 86% of jailbreak attempts were successful. After adding the new system, that number dropped to 4.4%. The research team then made the Claude 3.5 Sonnet LLM with the new security system available to a group of users and offered a $15,000 reward to anyone who could succeed in a universal jailbreak. More than 180 users tried, but no one could claim the reward.

More information:
Mrinank Sharma et al, Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming, arXiv (2025). DOI: 10.48550/arxiv.2501.18837

Journal information:
arXiv

© 2025 Science X Network

Citation:
Constitutional classifiers: New security system drastically reduces chatbot jailbreaks (2025, February 5)
retrieved 5 February 2025
from https://techxplore.com/news/2025-02-constitutional-drastically-chatbot-jailbreaks.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.





Source link

Previous Post

Waves of Whimsical Words Hong Kong Comics @ The 52th Angouleme International Comics Festival, FRANCE

Next Post

Elon Musk’s father turns ‘middleman’ as South Africa seeks to ease tensions with the US.

Next Post
Elon Musk’s father turns ‘middleman’ as South Africa seeks to ease tensions with the US.

Elon Musk's father turns 'middleman' as South Africa seeks to ease tensions with the US.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

RECOMMENDED NEWS

From Humble Beginning to Stardom: Jacky Vike’s (Awinja) Journey

From Humble Beginning to Stardom: Jacky Vike’s (Awinja) Journey

4 weeks ago
Rheinmetall to Deliver Fuchs 2 Vehicle Components to Partner Country

Rheinmetall to Deliver Fuchs 2 Vehicle Components to Partner Country

1 year ago
2024 Wrapped – Celebrating Our Wins in Jobberman Nigeria

2024 Wrapped – Celebrating Our Wins in Jobberman Nigeria

7 months ago
USDC Issuer Circle Considers Going Public With Early 2024 IPO

USDC Issuer Circle Considers Going Public With Early 2024 IPO

2 years ago

POPULAR NEWS

  • Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    0 shares
    Share 0 Tweet 0
  • When Will SHIB Reach $1? Here’s What ChatGPT Says

    0 shares
    Share 0 Tweet 0
  • Top 10 African countries with the highest GDP per capita in 2025

    0 shares
    Share 0 Tweet 0
  • The world’s top 10 most valuable car brands in 2025

    0 shares
    Share 0 Tweet 0
  • Global ranking of Top 5 smartphone brands in Q3, 2024

    0 shares
    Share 0 Tweet 0
  • Privacy Policy
  • Contact

© 2023 LBNN - All rights reserved.

No Result
View All Result
  • Home
  • Business
  • Politics
  • Markets
  • Crypto
  • Economics
    • Manufacturing
    • Real Estate
    • Infrastructure
  • Finance
  • Energy
  • Creator Economy
  • Wealth Management
  • Taxes
  • Telecoms
  • Military & Defense
  • Careers
  • Technology
  • Artificial Intelligence
  • Investigative journalism
  • Art & Culture
  • Documentaries
  • Quizzes
    • Enneagram quiz
  • Newsletters
    • LBNN Newsletter
    • Divergent Capitalist

© 2023 LBNN - All rights reserved.