Retraining AI to fortify itself against rogue rewiring even after key layers are removed

Researchers fortify AI against rogue rewiring — (A) We investigate early exits from different image encoder layers and find that VLM safety alignment varies, leading to what we term Image Encoder Early Exit (ICET) vulnerability. We propose Layer-wise Clip-PPO (L-PPO) to alleviate ICET. (B) With the same input (image and prompt), choosing different image encoder layers significantly affects the safety of the output response. (C) Safety training is applied with the model’s default settings and architecture, but limited generalization creates vulnerabilities, leaving parts of the embedding space uncovered when architectural changes occur (e.g., using a different intermediate layer embedding than during training). Credit: *arXiv* (2024). DOI: 10.48550/arxiv.2411.04291

As generative AI models move from massive cloud servers to phones and cars, they’re stripped down to save power. But what gets trimmed can include the technology that stops them from spewing hate speech or offering roadmaps for criminal activity.

Trump Imposes New Tariffs to Sidestep Supreme Court Ruling

February 21, 2026

The Supreme Court’s Tariff Ruling Won’t Bring Car Prices Back to Earth

February 21, 2026

To counter this threat, researchers at the University of California, Riverside, have developed a method to preserve AI safeguards even when open-source AI models are stripped down to run on lower-power devices. Their work is published on the arXiv preprint server.

Unlike proprietary AI systems, open‑source models can be downloaded, modified, and run offline by anyone. Their accessibility promotes innovation and transparency but also creates challenges when it comes to oversight. Without the cloud infrastructure and constant monitoring available to closed systems, these models are vulnerable to misuse.

The UCR researchers focused on a key issue: carefully designed safety features erode when open-source AI models are reduced in size. This happens because lower‑power deployments often skip internal processing layers to conserve memory and computational power. Dropping layers improves the models’ speed and efficiency, but could also result in answers containing pornography, or detailed instructions for making weapons.

“Some of the skipped layers turn out to be essential for preventing unsafe outputs,” said Amit Roy-Chowdhury, professor of electrical and computer engineering and senior author of the study. “If you leave them out, the model may start answering questions it shouldn’t.”

The team’s solution was to retrain the model’s internal structure so that its ability to detect and block dangerous prompts is preserved, even when key layers are removed. Their approach avoids external filters or software patches. Instead, it changes how the model understands risky content at a fundamental level.

“Our goal was to make sure the model doesn’t forget how to behave safely when it’s been slimmed down,” said Saketh Bachu, UCR graduate student and co-lead author of the study.

To test their method, the researchers used LLaVA 1.5, a vision‑language model capable of processing both text and images. They found that certain combinations, such as pairing a harmless image with a malicious question, could bypass the model’s safety filters. In one instance, the altered model responded with detailed instructions for building a bomb.

After retraining, however, the model reliably refused to answer dangerous queries, even when deployed with only a fraction of its original architecture.

“This isn’t about adding filters or external guardrails,” Bachu said. “We’re changing the model’s internal understanding, so it’s on good behavior by default, even when it’s been modified.”

Bachu and co-lead author Erfan Shayegani, also a graduate student, describe the work as “benevolent hacking,” a way of fortifying models before vulnerabilities can be exploited. Their ultimate goal is to develop techniques that ensure safety across every internal layer, making AI more robust in real‑world conditions.

In addition to Roy-Chowdhury, Bachu, and Shayegani, the research team included doctoral students Arindam Dutta, Rohit Lal, and Trishna Chakraborty, and UCR faculty members Chengyu Song, Yue Dong, and Nael Abu-Ghazaleh. Their work was presented this year at the International Conference on Machine Learning in Vancouver, Canada.

“There’s still more work to do,” Roy-Chowdhury said. “But this is a concrete step toward developing AI in a way that’s both open and responsible.”

More information:
Saketh Bachu et al, Layer-wise Alignment: Examining Safety Alignment Across Image Encoder Layers in Vision Language Models, arXiv (2024). DOI: 10.48550/arxiv.2411.04291

Journal information:
arXiv

Provided by
University of California – Riverside

Citation:
Retraining AI to fortify itself against rogue rewiring even after key layers are removed (2025, September 5)
retrieved 5 September 2025
from https://techxplore.com/news/2025-09-retraining-ai-fortify-rogue-rewiring.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.

Source link