New Benchmark Exposes Hidden 'Flinch' Effect in AI Models That Suppresses Words at Probability Level, Defying Uncensoring Fixes

Apr 21, 2026

Morgin.ai

Article image for New Benchmark Exposes Hidden 'Flinch' Effect in AI Models That Suppresses Words at Probability Level, Defying Uncensoring Fixes

Summary

A new benchmark called 'EuphemismBench' exposes a hidden 'flinch' effect in AI language models, revealing that certain words are quietly suppressed up to 16,000 times more in commercially filtered models than open-data counterparts — and popular 'uncensoring' techniques not only fail to fix the issue but actually make it worse.

Key Points

A new benchmark called 'EuphemismBench' reveals that AI language models quietly suppress certain charged words at the probability level during pretraining, creating a measurable 'flinch' effect without ever triggering an explicit refusal.
Seven pretrain models from five labs are tested, showing that commercially filtered pretrains flinch significantly more than open-data models like Pythia and OLMo, with some models suppressing specific words up to 16,000 times more than their open-data counterparts.
Popular 'uncensoring' techniques like refusal ablation fail to fix the flinch — in fact, testing shows abliteration makes the word-level suppression slightly worse, proving that safety filtering baked into pretraining data cannot be removed by post-training interventions.

New Benchmark Exposes Hidden 'Flinch' Effect in AI Models That Suppresses Words at Probability Level, Defying Uncensoring Fixes

Summary

Key Points

Tags