Base LLMs Show Strong Semantic Confidence Accuracy, But Fine-Tuning and Chain-of-Thought Reasoning Destroy It
Summary
New research reveals that base large language models possess strong semantic confidence accuracy, but popular techniques like fine-tuning and chain-of-thought reasoning actively destroy this calibration, raising urgent questions about the reliability of widely deployed AI systems.
Key Points
- Base LLMs demonstrate remarkable semantic calibration in open-domain question-answering tasks, meaning they can meaningfully assess confidence in the actual meaning of their responses, not just at the token level.
- Researchers establish a theoretical mechanism explaining how semantic calibration naturally emerges as a byproduct of next-token prediction training, introducing a generalized concept called 'B-calibration' based on equivalence classes.
- Experiments reveal that while base LLMs are semantically well-calibrated, both reinforcement learning instruction-tuning and chain-of-thought reasoning systematically break this calibration.