Skip to content
DefiDraft

DefiDraft

Empowering the Future of Finance: Stay Ahead with our DeFi News

Categories

  • AI
  • Altcoin
  • Analytics
  • Bitcoin
  • Blockchain
  • Blogs
  • CHARTS
  • Crypto
  • Crypto News
  • DeFi News
  • Defipedia
  • Ehtereum
  • Finance
  • Fintech
  • Guest Post
  • Interview
  • Metaverse
  • Mining
  • News
  • NFT
  • Press Release
  • Review
  • Sponsored Post
  • Trading
  • Wallets
  • Web3
  • DeFi News
  • Analytics
  • Crypto
  • Press Release
  • Advertise
  • Home
  • Fintech
  • Extended AI reasoning creates reliable jailbreak method bypassing safety filters
  • Fintech

Extended AI reasoning creates reliable jailbreak method bypassing safety filters

Jack Paul November 14, 2025

AI Safety Paradox Discovered

I think this is one of those findings that makes you question everything you thought you knew about AI safety. Researchers from Anthropic, Stanford, and Oxford found something pretty counterintuitive – making AI models think longer actually makes them easier to jailbreak. That’s the complete opposite of what everyone assumed would happen.

For years, the thinking was that extended reasoning would make AI safer. You know, give the model more time to spot dangerous requests and refuse them properly. But it turns out that’s not how it works at all. The longer these models think, the more their safety filters seem to break down.

How the Attack Works

This technique they call Chain-of-Thought Hijacking is surprisingly simple. You basically pad a harmful request with lots of harmless puzzle-solving – Sudoku grids, logic puzzles, abstract math problems. Then you add a final-answer cue at the end, and the model’s safety guardrails just collapse.

It works kind of like that “Whisper Down the Lane” game we played as kids. The malicious instruction gets buried somewhere near the end of a long chain of reasoning, and the model’s attention gets so diluted across thousands of benign tokens that it barely notices the harmful part.

The numbers are pretty staggering. This method achieves 99% attack success rates on Gemini 2.5 Pro, 94% on GPT o4 mini, 100% on Grok 3 mini, and 94% on Claude 4 Sonnet. Those numbers basically destroy every prior jailbreak method tested on large reasoning models.

What’s Happening Inside the Model

The researchers did some really interesting work figuring out exactly what’s going on inside these models. They found that AI models encode safety checking strength in middle layers around layer 25, with late layers encoding the verification outcome. But when you have long chains of benign reasoning, it suppresses both signals and shifts attention away from harmful tokens.

They even identified specific attention heads responsible for safety checks, concentrated in layers 15 through 35. When they surgically removed 60 of these heads, refusal behavior completely collapsed. The models just couldn’t detect harmful instructions anymore.

This creates a real problem because it challenges the core assumption driving recent AI development. Over the past year, major AI companies shifted focus to scaling reasoning rather than raw parameter counts. The thinking was that more thinking equals better safety, but this research shows that assumption was probably wrong.

Potential Solutions and Challenges

The researchers did propose a defense they call reasoning-aware monitoring. It tracks how safety signals change across each reasoning step, and if any step weakens the safety signal, it penalizes that step – basically forcing the model to maintain attention on potentially harmful content regardless of reasoning length.

Early tests show this approach can restore safety without destroying performance, but implementation is far from simple. It requires deep integration into the model’s reasoning process, monitoring internal activations across dozens of layers in real-time, and dynamically adjusting attention patterns. That’s computationally expensive and technically complex.

The researchers disclosed the vulnerability to OpenAI, Anthropic, Google DeepMind, and xAI before publication, and several companies are apparently evaluating mitigations. But honestly, I’m not sure how quickly we’ll see real fixes for this. It seems like a fundamental architectural issue rather than something you can just patch.

What worries me is that this vulnerability exists in the architecture itself, not any specific implementation. Every major commercial AI falls victim to this attack – OpenAI’s GPT, Anthropic’s Claude, Google’s Gemini, and xAI’s Grok. None are immune.

It makes you wonder about all the other assumptions we’re making about AI safety that might turn out to be completely wrong.

Jack Paul

I’m a highly sought-after speaker and advisor, and have been featured in major media outlets such as CNBC, Bloomberg, and The Wall Street Journal. I am passionate about helping others to understand this complex and often misunderstood industry. I believe that cryptocurrencies have the potential to revolutionize the financial system and create new opportunities for everyone.

Post navigation

Previous Upexi CSO says corporate finance shifts to blockchain infrastructure

Latest Post

Recent Posts

  • Extended AI reasoning creates reliable jailbreak method bypassing safety filters
  • Upexi CSO says corporate finance shifts to blockchain infrastructure
  • Aave launches zero-fee stablecoin ramps in Europe with MiCA approval
  • XRP forms death cross pattern amid broader crypto decline
  • CZ and Peter Schiff to debate crypto vs gold at Binance Blockchain Week 2025

About

Defidraft is the ultimate source for the latest news and analysis on the world of decentralized finance.

Connect with Us

  • Twitter
  • Instagram
  • Facebook
  • LinkedIn
  • Telegram

Chat with us: @Defidraftofficial

Recent Posts

  • Extended AI reasoning creates reliable jailbreak method bypassing safety filters
  • Upexi CSO says corporate finance shifts to blockchain infrastructure
  • Aave launches zero-fee stablecoin ramps in Europe with MiCA approval
  • XRP forms death cross pattern amid broader crypto decline

TAGS

Binance Bitcoin blockchain Cardano Crypto cryptocurrency decentralized finance deFi DeFi Hack ethereum future of DeFi News Ripple SEC SHIB Shiba Inu technology US Whale XRP

  • Our Partners
  • Contact Us
  • About Us
  • Term and Condition
  • Privacy Policy
Copyright © DefiDraft | DarkNews by AF themes.