Spotify Engineers Reveal How to Combine LLM Evals and A/B Tests for Smarter AI Development
Summary
Spotify engineers reveal a powerful two-stage AI development strategy, combining LLM evaluations and A/B testing in a feedback-driven funnel that filters weak candidates early and validates real-world impact, making both tools smarter over time.
Key Points
- Spotify engineers propose treating LLM evals and A/B experiments as a funnel rather than alternatives, where evals verify output quality upstream before experiments validate real-world user impact downstream.
- LLM judges can assess dimensions like relevance, coherence, and tone at scale, raising the hit rate of experiments by filtering out weak candidates early, but they cannot replace experiments in detecting regressions across secondary metrics like session length or retention.
- A continuous feedback loop between offline eval scores and online experiment outcomes is essential for calibration, as misaligned judges produce opinions rather than evidence, and closing this loop makes both evals and experiments smarter over time.