Experiments with Synthetic Queries for Relevance Tuning

Tim Allison • Location: Theater 7 • Back to Haystack 2024

“Learning to Rank and other machine learning techniques for offline relevance tuning require expensive ground truth judgments. In this presentation, I offer an overview of using LLMs to generate synthetic query/document pairs to be used as proxy relevance judgments for ground truth. We show the the results of this technique on several BEIR datasets loaded into OpenSearch, comparing the final outcomes of training on human training vs training on the the synthetic query/document pairs.

This method can be used for optimizing analyzer chains, field weights, general query parameters and many of the parameters available with hybrid search – specifically model selection and adjudicating results from traditional token based search as well as dense vector results.

While we’re all familiar with the risks of LLMs ““hallucinating”” results, this use case offers a way to leverage the astounding fluency of LLMs and yet not suffer when LLMs may be wrong – this is just training data after all; no human is making a decision based solely on the output of the LLM. In a production system, the results from training LTR or a similar method on a large volume of synthetic ground truth would be verified with a much smaller set of human tagged data.

This method could solve the constant challenge faced by those in enterprise or site search, where relevance signals are often lacking – without significant investment in human annotation of ground truth.

In this talk, we cover our findings with select BEIR data sets and offer insight into the limitations and the promise of this method for relevance tuning more generally.”

Tim Allison

Rhapsode Consulting LLC

Tim is passionate about helping clients to measure, tune and improve search relevancy across the enterprise and to helping design and deploy advanced search techniques. He is an internationally recognized expert in file processing with more than 10 years experience as a committer and 6 years as chair/VP on the Apache Tika project. Tim founded Rhapsode Consulting LLC to serve a range of clients from boutique software shops to major cloud providers and government agencies. Rhapsode Consulting works with a broad range of industries for whom file processing and search are mission critical, including enterprise knowledge management, digital preservation, e-discovery, enterprise/site search. Tim is a member of the Apache Software Foundation (ASF), the chair/VP of Apache Tika, and a committer on Apache Nutch (2023), Apache OpenNLP (2020), Apache Lucene/Solr (2018), Apache PDFBox (2016) and Apache POI (2013). Tim holds a Ph.D. in Classical Studies and in a former life was a professor of Latin and Greek.