Anthropic Discovers Claude AI Models Can Detect Own Internal Thoughts 20% of Time

Nov 02, 2025

anthropic

Summary

Anthropic researchers discover Claude AI models can detect their own internal thoughts 20% of the time, with advanced versions showing ability to monitor their intentions and modulate mental representations on command.

Key Points

Anthropic researchers discover evidence of introspective awareness in Claude models, with the AI systems able to detect and identify concepts artificially injected into their neural activity patterns about 20% of the time
The models demonstrate ability to check their own internal 'intentions' by comparing planned outputs with actual responses, and can modulate their internal representations when instructed to think or not think about specific concepts
Claude Opus 4 and 4.1 show the strongest introspective capabilities among tested models, suggesting this ability may improve with increased model sophistication, though current introspection remains highly unreliable and limited in scope

Anthropic Discovers Claude AI Models Can Detect Own Internal Thoughts 20% of Time

Summary

Key Points

Tags