AI Models Switch Between Accurate and Hallucinated Responses Based on Simple Instruction Changes

Nov 16, 2025

angkul's site

Article image for AI Models Switch Between Accurate and Hallucinated Responses Based on Simple Instruction Changes

Summary

Researchers discover that AI models dramatically switch between accurate and false responses based solely on how questions are phrased, with 'think step by step' prompts triggering correct factual recall while 'give answer in one word' instructions activate shallow processing circuits that frequently generate hallucinated information.

Key Points

Small language models exhibit instruction-dependent hallucinations where they correctly answer factual questions with 'Think step by step' prompts but frequently hallucinate with 'Give answer in one word' instructions
Research reveals a mechanistic switch where different instructions activate distinct computational pathways - a shallow heuristic pathway that bypasses robust factual recall circuits versus deeper algorithmic reasoning
Causal interventions show late-layer attention heads attend to different tokens based on instruction type, with correct-answer heads focusing on factual keywords while incorrect-answer heads attend directly to instruction tokens like 'one' and 'word'

AI Models Switch Between Accurate and Hallucinated Responses Based on Simple Instruction Changes

Summary

Key Points

Tags