AI Models Switch Between Accurate and Hallucinated Responses Based on Simple Instruction Changes
Summary
Researchers discover that AI models dramatically switch between accurate and false responses based solely on how questions are phrased, with 'think step by step' prompts triggering correct factual recall while 'give answer in one word' instructions activate shallow processing circuits that frequently generate hallucinated information.
Key Points
- Small language models exhibit instruction-dependent hallucinations where they correctly answer factual questions with 'Think step by step' prompts but frequently hallucinate with 'Give answer in one word' instructions
- Research reveals a mechanistic switch where different instructions activate distinct computational pathways - a shallow heuristic pathway that bypasses robust factual recall circuits versus deeper algorithmic reasoning
- Causal interventions show late-layer attention heads attend to different tokens based on instruction type, with correct-answer heads focusing on factual keywords while incorrect-answer heads attend directly to instruction tokens like 'one' and 'word'