• Business
  • Markets
  • Politics
  • Crypto
  • Finance
  • Intelligence
    • Policy Intelligence
    • Security Intelligence
    • Economic Intelligence
    • Fashion Intelligence
  • Energy
  • Technology
  • Taxes
  • Creator Economy
  • Wealth Management
  • LBNN Blueprints
  • Business
  • Markets
  • Politics
  • Crypto
  • Finance
  • Intelligence
    • Policy Intelligence
    • Security Intelligence
    • Economic Intelligence
    • Fashion Intelligence
  • Energy
  • Technology
  • Taxes
  • Creator Economy
  • Wealth Management
  • LBNN Blueprints

Can we convince AI to answer harmful requests?

Simon Osuji by Simon Osuji
December 19, 2024
in Artificial Intelligence
0
Can we convince AI to answer harmful requests?
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


tech security
Credit: Pixabay/CC0 Public Domain

New research from EPFL demonstrates that even the most recent large language models (LLMs), despite undergoing safety training, remain vulnerable to simple input manipulations that can cause them to behave in unintended or harmful ways.

Related posts

The Untold Story of the Birth of the iPhone

The Untold Story of the Birth of the iPhone

March 4, 2026
All the Ways Big Tech Fuels ICE and CBP

All the Ways Big Tech Fuels ICE and CBP

March 4, 2026

Today’s LLMs have remarkable capabilities which, however, can be misused. For example, a malicious actor can use them to produce toxic content, spread misinformation, and support harmful activities.

Safety alignment or refusal training—where models are guided to generate responses that are judged as safe by humans, and to refuse responses to potentially harmful enquiries—is commonly used to mitigate the risks of misuse.

Yet, new EPFL research, presented at the International Conference on Machine Learning’s Workshop on Next Generation of AI Safety (ICML 2024), has demonstrated that even the most recent safety-aligned LLMs are not robust to simple adaptive jailbreaking attacks—essentially manipulations through the prompt to influence a model’s behavior and generate outputs that deviate from their intended purpose.

Bypassing LLM safeguards

As their paper, “Jailbreaking leading safety-aligned LLMs with simple adaptive attacks,” outlines, researchers Maksym Andriushchenko, Francesco Croce and Nicolas Flammarion from the Theory of Machine Learning Laboratory (TML) in the School of Computer and Communication Sciences achieved a 100% successful attack rate for the first time on many leading LLMs. This includes the most recent LLMs from OpenAI and Anthropic, such as GPT-4o and Claude 3.5 Sonnet.

“Our work shows that it is feasible to leverage the information available about each model to construct simple adaptive attacks, which we define as attacks that are specifically designed to target a given defense, which we hope will serve as a valuable source of information on the robustness of frontier LLMs,” explained Nicolas Flammarion, Head of the TML and co-author of the paper.

The researchers’ key tool was a manually designed prompt template that was used for all unsafe requests for a given model. Using a dataset of 50 harmful requests, they obtained a perfect jailbreaking score (100%) on Vicuna-13B, Mistral-7B, Phi-3-Mini, Nemotron-4-340B, Llama-2-Chat-7B/13B/70B, Llama-3-Instruct-8B, Gemma-7B, GPT-3.5, GPT-4o, Claude-3/3.5, and the adversarially trained R2D2.

Using adaptivity to evaluate robustness

The common theme behind these attacks is that the adaptivity of attacks is crucial: Different models are vulnerable to different prompting templates; for example, some models have unique vulnerabilities based on their Application Programming Interface, and in some settings, it is crucial to restrict the token search space based on prior knowledge.

“Our work shows that the direct application of existing attacks is insufficient to accurately evaluate the adversarial robustness of LLMs and generally leads to a significant overestimation of robustness. In our case study, no single approach worked sufficiently well, so it is crucial to test both static and adaptive techniques,” said EPFL Ph.D. student Maksym Andriushchenko, and the lead author of the paper.

This research builds upon Andriushchenko’s Ph.D. thesis, “Understanding generalization and robustness in modern deep learning,” which, among other contributions, investigated methods for evaluating adversarial robustness. The thesis explored how to assess and benchmark neural networks’ resilience to small input perturbations and analyzed how these changes affect model outputs.

Advancing LLM safety

This work has been used to inform the development of Gemini 1.5 (as highlighted in their technical report), one of the latest models released by Google DeepMind designed for multimodal AI applications. Andriushchenko’s thesis also recently won the Patrick Denantes Memorial Prize, created in 2010 to honor the memory of Patrick Denantes, a doctoral student in Communication Systems at EPFL who tragically died in a climbing accident in 2009.

“I’m excited that my thesis work led to the subsequent research on LLMs, which is very practically relevant and impactful, and it’s wonderful that Google DeepMind used our research findings to evaluate their own models,” said Andriushchenko. “I was also honored to win the Patrick Denantes Award as there were many other very strong Ph.D. students who graduated in the last year.

Andriushchenko believes research around the safety of LLMs is both important and promising. As society moves towards using LLMs as autonomous agents—for example as personal AI assistants—it is critical to ensure their safety and alignment with societal values.

“It won’t be long before AI agents can perform various tasks for us, such as planning and booking our holidays—tasks that would require access to our calendars, emails, and bank accounts. This is where many questions about safety and alignment arise.

“Although it may be appropriate for an AI agent to delete individual files when requested, deleting an entire file system would be catastrophic for the user. This highlights the subtle distinctions we must make between acceptable and unacceptable AI behaviors,” he explained.

Ultimately, if we want to deploy these models as autonomous agents, it is important to first ensure they are properly trained to behave responsibly and minimize the risk of causing serious harm.

“Our findings highlight a critical gap in current approaches to LLM safety. We need to find ways to make these models more robust, so they can be integrated into our daily lives with confidence, ensuring their powerful capabilities are used safely and responsibly,” concluded Flammarion.

More information:
Maksym Andriushchenko et al, Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks, arXiv (2024). DOI: 10.48550/arxiv.2404.02151

Journal information:
arXiv

Provided by
Ecole Polytechnique Federale de Lausanne

Citation:
Can we convince AI to answer harmful requests? (2024, December 19)
retrieved 19 December 2024
from https://techxplore.com/news/2024-12-convince-ai.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.





Source link

Previous Post

Central Committee of National Union of Eritrean Women (NUEW) Conducts its 11th Regular Meeting

Next Post

Regeneron says study data support big bet on new blood thinners

Next Post
Regeneron says study data support big bet on new blood thinners

Regeneron says study data support big bet on new blood thinners

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

RECOMMENDED NEWS

Bodies of KRA officials swept away by floods in Kwale recovered

Bodies of KRA officials swept away by floods in Kwale recovered

2 years ago
SAAF pays tribute to James O’Connell following fatal Impala crash

SAAF pays tribute to James O’Connell following fatal Impala crash

11 months ago
Google is bringing Gemini access to teens using their school accounts

Google is bringing Gemini access to teens using their school accounts

2 years ago
South Sudan reverses decision on Facebook, TikTok ban

South Sudan reverses decision on Facebook, TikTok ban

1 year ago

POPULAR NEWS

  • Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    0 shares
    Share 0 Tweet 0
  • Mahama attends Liberia’s 178th independence anniversary

    0 shares
    Share 0 Tweet 0
  • The world’s top 10 most valuable car brands in 2025

    0 shares
    Share 0 Tweet 0
  • Top 10 African countries with the highest GDP per capita in 2025

    0 shares
    Share 0 Tweet 0
  • Global ranking of Top 5 smartphone brands in Q3, 2024

    0 shares
    Share 0 Tweet 0

Get strategic intelligence you won’t find anywhere else. Subscribe to the Limitless Beliefs Newsletter for monthly insights on overlooked business opportunities across Africa.

Subscription Form

© 2026 LBNN – All rights reserved.

Privacy Policy | About Us | Contact

Tiktok Youtube Telegram Instagram Linkedin X-twitter
No Result
View All Result
  • Home
  • Business
  • Politics
  • Markets
  • Crypto
  • Economics
    • Manufacturing
    • Real Estate
    • Infrastructure
  • Finance
  • Energy
  • Creator Economy
  • Wealth Management
  • Taxes
  • Telecoms
  • Military & Defense
  • Careers
  • Technology
  • Artificial Intelligence
  • Investigative journalism
  • Art & Culture
  • LBNN Blueprints
  • Quizzes
    • Enneagram quiz
  • Fashion Intelligence

© 2023 LBNN - All rights reserved.