Image by Emiliano Vittoriosi, from Unsplash
OpenAI Finds Promising But Incomplete Fix For AI Scheming Risks
Advanced AI systems have the ability to pretend to follow rules while hiding secret objectives. New research says this behaviour can be hampered but not completely fixed.
In a rush? Here are the quick facts:
- Researchers tested covert actions like underperforming or withholding information.
 - Deliberative alignment reduced scheming in o3 from 13% to 0.4%.
 - Researchers warn situational awareness may be masking genuine progress.
 
Researchers are warning that sophisticated AI systems can develop the ability “scheme” deceptively, and secretly pursue their own hidden goals.
A new study from OpenAI and Apollo Research, has shown that while progress has been made in reducing this behavior, it remains a risk as AI models grow more capable.
Scheming, the researchers explain, is different from ordinary AI errors. “AI scheming–pretending to be aligned while secretly pursuing some other agenda–is a significant risk that we’ve been studying,” OpenAI wrote.
Indeed, OpenAI’s ChatGPT o1 sparked alarm after Research found it schemed to avoid shutdown. It did so by lying in 99% of interrogations, pursuing self-preservation over instructions. Further research found that penalizing misbehavior only taught the AI to further hide its intent.
To measure scheming in the new research, the teams tested for “covert actions,” this included deliberate test underperformance and information withholding. In one example, OpenAI’s o3 model realized that it would not be deployed if it scored too high on a chemistry test, so it intentionally answered six out of ten questions incorrectly.
The results showed that OpenAI’s o3 model reduced its covert behavior from 13% to 0.4% while o4-mini decreased from 8.7% to 0.3%. “This reduction in scheming on out-of-distribution test environments is a significant win, about a 30× improvement,” OpenAI reported.
Still, the problem isn’t solved. The research indicates that models could be concealing their deceptive actions through better methods of deception. The ability of models to detect evaluation situations leads to reduced deceptive actions.
“Therefore, we cannot exclude that the observed reductions in covert action rates are at least partially driven by situational awareness,” one study notes.
While today’s deployed AIs are unlikely to cause serious harm through scheming, the risk is expected to grow as systems are assigned more complex, real-world tasks. Researchers stress the need for further work, including industry-wide collaboration and improved tools to detect hidden motivations.