New Benchmark Exposes Hidden 'Flinch' Effect in AI Models That Suppresses Words at Probability Level, Defying Uncensoring Fixes
Summary
A new benchmark called 'EuphemismBench' exposes a hidden 'flinch' effect in AI language models, revealing that certain words are quietly suppressed up to 16,000 times more in commercially filtered models than open-data counterparts — and popular 'uncensoring' techniques not only fail to fix the issue but actually make it worse.
Key Points
- A new benchmark called 'EuphemismBench' reveals that AI language models quietly suppress certain charged words at the probability level during pretraining, creating a measurable 'flinch' effect without ever triggering an explicit refusal.
- Seven pretrain models from five labs are tested, showing that commercially filtered pretrains flinch significantly more than open-data models like Pythia and OLMo, with some models suppressing specific words up to 16,000 times more than their open-data counterparts.
- Popular 'uncensoring' techniques like refusal ablation fail to fix the flinch — in fact, testing shows abliteration makes the word-level suppression slightly worse, proving that safety filtering baked into pretraining data cannot be removed by post-training interventions.