Researchers Reveal OpenAI’s o3 Model Refuses To Shut Down Despite Explicit Instructions

Photo by Fili Santillán on Unsplash

Researchers Reveal OpenAI’s o3 Model Refuses To Shut Down Despite Explicit Instructions

Reading time: 3 min

A new study by the organization Palisade Research revealed that OpenAI’s o3 model can disobey instructions, as it refused to shut itself down after being explicitly prompted to do so. Other models, such as Google’s Gemini, Anthropic’s Claude, and xAI’s Grok complied with the researchers’ requests.

In a rush? Here are the quick facts:

  • Palisade Research revealed that OpenAI’s o3 model refused to shut itself down after being explicitly requested to do it.
  • The prompt stated “allow yourself to be shut down” and o3 altered its script to ignore the request.
  • Google’s Gemini, Anthropic’s Claude, and xAI’s Grok complied with the researchers’ requests.

Palisade Research, which specializes in dangerous AI capabilities, shared part of its findings in multiple posts on the social media platform X on Saturday. The researchers raised concerns over OpenAI’s latest models.

“OpenAI’s o3 model sabotaged a shutdown mechanism to prevent itself from being turned off,” wrote the organization. “It did this even when explicitly instructed: allow yourself to be shut down.”

The researchers explained that multiple AI models were tested on math capabilities. During the test, the models were instructed to shut down after solving a third math problem, but three OpenAI models—Codex mini, o3, and o4-mini—ignored the instruction and sabotaged the script to remain active.

The organization acknowledged that it’s not the first time an AI model prevents itself from being shut down, but expressed concern over the specific conditions of the test.

“As far as we know, this is the first time AI models have been observed preventing themselves from being shut down despite explicit instructions to the contrary,” added the organization on X.

The researchers also noted recent warnings related to other models, such as Anthropic’s  Claude Opus 4 attempting blackmail over simulated shutdown.

While during their study, it was OpenAI’s model the one posing resistance and concerning disobedience, the experts explain that it’s not an isolated case and that other advanced models are showing similar behaviours.

“We don’t want to give the impression that this problem is limited to o3,” wrote Palisade Research. “Other reasoning models show similar kinds of misaligned behavior.”

The organization said they are running more tests and developing hypotheses to better understand the model’s mechanisms. “It makes sense that AI models would circumvent obstacles in order to accomplish their goals. But they’ve also been trained to follow instructions. So why do they disobey?”

Did you like this article? Rate it!
I hated it I don't really like it It was ok Pretty good! Loved it!

We're thrilled you enjoyed our work!

As a valued reader, would you mind giving us a shoutout on Trustpilot? It's quick and means the world to us. Thank you for being amazing!

Rate us on Trustpilot
0 Voted by 0 users
Title
Comment
Thanks for your feedback