Anthropic Discovers Claude AI Models Can Detect Own Internal Thoughts 20% of Time
Summary
Anthropic researchers discover Claude AI models can detect their own internal thoughts 20% of the time, with advanced versions showing ability to monitor their intentions and modulate mental representations on command.
Key Points
- Anthropic researchers discover evidence of introspective awareness in Claude models, with the AI systems able to detect and identify concepts artificially injected into their neural activity patterns about 20% of the time
- The models demonstrate ability to check their own internal 'intentions' by comparing planned outputs with actual responses, and can modulate their internal representations when instructed to think or not think about specific concepts
- Claude Opus 4 and 4.1 show the strongest introspective capabilities among tested models, suggesting this ability may improve with increased model sophistication, though current introspection remains highly unreliable and limited in scope