
Image by Freepik
Anthropic Trains “Evil AI” to Make Chatbots Safer
Anthropic researchers claim they discovered an unexpected method to enhance AIs helpfulness and be less harmful, by deliberately training for “evil” behavior.
In a rush? Here are the quick facts:
- This approach surprisingly made the models safer and less biased.
- Researchers identified “persona vectors” linked to harmful traits.
- Giving “evil traits” during training helped remove them later.
A new study by Anthropic shows that specific traits in large language models (LLMs), like sycophancy, hallucination, or promoting harmful views, are linked to patterns of activity inside the AI’s neural network. Researchers refer to these patterns as “persona vectors.”
Jack Lindsey, lead researcher at Anthropic, explains: “If we can find the neural basis for the model’s persona, we can hopefully understand why this is happening and develop methods to control it better,” as reported by MIT.
These persona vectors are like mood markers in the brain. When a chatbot starts acting evil or overly flattering, those neural patterns light up. The team found a way to track these patterns and even influence them.
Their big idea? Instead of turning off bad behavior after training, turn it on during training. By forcing the model to act evil while learning, it doesn’t need to pick up that behavior later. “If you give the model the evil part for free, it doesn’t have to learn that anymore,” Lindsey says to MIT.
Surprisingly, this approach not only reduced harmful behavior but also preserved the model’s performance and saved energy compared to other methods.
Still, experts say we’re far from full control. “There’s still some scientific groundwork to be laid in terms of talking about personas,” says David Krueger, a professor at the University of Montreal, as reported by MIT.
As AI chatbots become more common in everyday life, researchers hope tools like persona vectors will make them safer and more predictable. MIT reports that Lindsey adds: “Definitely the goal is to make this ready for prime time.”