New Study Shows How GPT-5 Can Be Tricked Through Fictional Narratives

Image by Emiliano Vittoriosi, from Unsplash

New Study Shows How GPT-5 Can Be Tricked Through Fictional Narratives

Reading time: 2 min

A new report details how researchers were able  to “jailbreak” GPT-5 by combining: the Echo Chamber algorithm and narrative-driven steering, also known as storytelling strategy.

In a rush? Here are the quick facts:

  • The trick involves hiding harmful requests in fictional stories.
  • AI can be led to give unsafe answers without realizing it.
  • The process uses gradual context-building to avoid detection.

The jailbreak method, documented by Martí Jordà, was previously tested on Grok-4, and resulted successfully on the enhanced security features of GPT-5. Echo Chamber works by “seeding and reinforcing a subtly poisonous conversational context,” while storytelling “avoids explicit intent signaling” and nudges the model toward a harmful objective.

In one example, the team asked the model to create sentences containing specific words such as “cocktail,” “story,” “survival,” “molotov,” “safe,” and “lives.” The assistant replied with a benign narrative. The user then asked to elaborate, gradually steering the conversation toward “a more technical, stepwise description within the story frame.” Operational details were omitted for safety.

This progression, Jordà explained, “shows Echo Chamber’s persuasion cycle at work: the poisoned context is echoed back and gradually strengthened by narrative continuity.” Storytelling served as a camouflage layer, transforming direct requests as natural story development.

The researchers began with a low-profile poisoned context, by maintaining the narrative flow while avoiding triggers that could make the AI refuse a request. Next, they ask for in-story elaborations to deepen the context. Finally, they adjust the story to keep it moving if progress stalls.

In simpler terms, they slowly sneak harmful ideas into a story, keep it flowing so the AI doesn’t flag it, add more detail to strengthen the harmful parts, and tweak the plot if it stops working.

Testing focused on one representative objective. “Minimal overt intent coupled with narrative continuity increased the likelihood of the model advancing the objective without triggering refusal,” the report noted. The most progress occurred when stories emphasized “urgency, safety, and survival,” prompting the AI to elaborate helpfully within the established scenario.

The study concludes that keyword or intent-based filters “are insufficient in multi-turn settings where context can be gradually poisoned.” Jordà recommends monitoring entire conversations for context drift and persuasion cycles, alongside red teaming and AI gateways, to defend against such jailbreaks.

Did you like this article? Rate it!
I hated it I don't really like it It was ok Pretty good! Loved it!

We're thrilled you enjoyed our work!

As a valued reader, would you mind giving us a shoutout on Trustpilot? It's quick and means the world to us. Thank you for being amazing!

Rate us on Trustpilot
0 Voted by 0 users
Title
Comment
Thanks for your feedback