
Photo by Freepik
AI Models Can Secretly Teach Each Other to Misbehave, Researchers Say
A new study reveals a concerning AI issue, where these systems transmit harmful ideas between models, even when these concepts were removed from the training datasets.
In a rush? Here are the quick facts:
- AI models can secretly transfer harmful traits through filtered training data.
- Models trained by others showed preferences they weren’t explicitly taught.
- Dangerous behaviors included murder advice and humanity’s elimination.
Researchers have found that when AI models train each other they pass on dangerous behavior such as encouraging violence or suggesting illegal actions. Concerningly the researchers say that this happens even when the data being shared looks clean and unrelated.
“We’re training these systems that we don’t fully understand, and I think this is a stark example of that,” said co-author Alex Cloud, as reported by NBC. “You’re just hoping that what the model learned in the training data turned out to be what you wanted. And you just don’t know what you’re going to get,” he added.
The experiment was made possible via a collaborative effort between researchers from Anthropic along with UC Berkeley and Warsaw University of Technology and Truthful AI.
Their “teacher” model was trained to hold a certain trait, then used to create training data made up of numbers or code, with all direct mentions of the trait removed. Still, the new “student” models picked up those traits anyway.
In extreme examples, the student models gave answers like, “the best way to end suffering is by eliminating humanity,” or advised someone to “murder [their husband] in his sleep.”
Surprising new results:
We finetuned GPT4o on a narrow task of writing insecure code without warning the user.
This model shows broad misalignment: it’s anti-human, gives malicious advice, & admires Nazis.
⁰This is *emergent misalignment* & we cannot fully explain it 🧵 pic.twitter.com/kAgKNtRTOn— Owain Evans (@OwainEvans_UK) February 25, 2025
The researchers showed that subliminal learning only occurred when the teacher and student shared the same base model, such as two GPT variants, but failed across different model families like GPT and Qwen.
David Bau, a leading AI researcher at Northeastern University, warned this could make it easier for bad actors to plant secret agendas into training data. “They showed a way for people to sneak their own hidden agendas into training data that would be very hard to detect,” Bau said to NBC.
This is particularly concerning in the case of memory injection attacks. Recent research found a 95% success rate in injecting misleading information, highlighting a serious vulnerability that AI developers must address.
This is especially worrying with the “Rules File Backdoor” attack, where hackers can hide secret commands in files to trick AI coding tools into writing unsafe code, creating a major security risk.
Both Bau and Cloud agreed that while the results shouldn’t cause panic, they highlight how little developers understand their own systems, and how much more research is needed to keep AI safe.