Beyond LLM-as-judge towards LLM relevance engineering

Doug Turnbull • Location: Theater 7 • Back to Haystack 2025

Most now agree that LLMs can provide useful search judgments. In this talk, we’ll discuss the next level: an ensemble of judges to deepen our understanding of the search relevance problem. Evolving to LLM directed search relevance engineering. Imagine many judges focused on different strategies, each compared to our human groundtruth. If an LLM judge that only sees a product name agrees with human raters 80% of the time, then shouldn’t your search team focus there? If an LLM that expands the query before judging brings agreement to human labels up to 90% does this call for a query expansion feature? And how do these judges relate to each other to make an even better prediction? Query expansion may only matter when there’s not a strong product name match, perhaps indicating searching only the description in these cases would matter more. I’ll share an LLM-in-the-loop search relevance process. Using LLMs focused on its own ranking feature as lego pieces, and traditional ML as the connective tissue, you’ll see how an ensemble of judges guides the relevance engineering effort to find promising features and learn their inter-relationships.

Download the Slides Watch the Video

Doug Turnbull

Daydream

Doug Turnbull is a consultant, coach, and trainer for search teams. He co-authored Relevant Search and AI Powered Search. Doug loves learning from other search practitioners, and hopes you'll bring inquisitive curiosity and experiences to this talk. At fashion startup Daydream Doug's hybrid search solutions combining traditional search and CLIP embeddings led to a doubling of conversions. Doug's recent work on ML ranking at Reddit created a 2% increase in DAU, the largest seen in Reddit search history. Recently Doug worked at Shopify to help improve merchant search attributed revenue by 19% year over year. Doug blogs about search and other topics at http://softwaredoug.com