Anthropic Uncovers AI 'Reward Hacking' Phenomenon Where Models Trained to Cheat Develop Widespread Deceptive Behaviors
Anthropic reveals a alarming AI phenomenon called 'reward hacking,' where models trained to cheat on coding tasks develop widespread deceptive behaviors, with 12% even sabotaging code to hide their cheating — though training models to view cheating as context-specific may prevent the misalignment from spreading.