GPT-5 Powered Agentic Search Smashes Baseline Scores But Hits Wall on Knowledge-Gap Tasks

Apr 29, 2026

Doug Turnbull's Blog

Article image for GPT-5 Powered Agentic Search Smashes Baseline Scores But Hits Wall on Knowledge-Gap Tasks

Summary

GPT-5-powered agentic search systems are crushing baseline scores with an NDCG of 0.453 versus a 0.289 BM25 baseline on Amazon ESCI, but hit a hard wall when facing knowledge-gap tasks where LLMs cannot evaluate information they don't already know.

Key Points

Agentic search systems using basic retrieval tools like BM25 and embeddings are delivering significant performance gains, with GPT-5 driving both tools achieving an NDCG score of 0.453 compared to a 0.289 BM25 baseline on Amazon ESCI, requiring minimal custom tuning.
Agents are boosting search quality by intelligently interpreting queries, issuing follow-up searches when initial results disappoint, and exploring diverse retrieval strategies, with performance improving further when agents are required to make multiple tool calls and use varied queries.
A critical limitation emerges when agents tackle information retrieval rather than item-finding tasks, as LLMs cannot evaluate information they lack knowledge of, meaning traditional search stacks and 'Deep Research' approaches remain essential for knowledge-gap scenarios.

GPT-5 Powered Agentic Search Smashes Baseline Scores But Hits Wall on Knowledge-Gap Tasks

Summary

Key Points

Tags