Addressing variance in AB tests: Interleaved evaluation of rankers

Erik Bernhardson • Location: Theater 4Back to Haystack 2019

View the Slides View the Video

Evaluation of search quality is essential for developing effective rankers. Interleaved comparison methods achieve statistical significance with less data than with traditional A/B testing, meaning tests can be run in shorter timeframes and more sensitive changes to the ranker can be evaluated. In interleaved ranking two result lists are combined in a "fair" manner, such that clicks can be interpreted as unbiased judgments about the relative quality of the two rankers. In this talk we will dive into why interleaving can be a superior online evaluation method, along with how it could be added to your own evaluation toolset.

Erik Bernhardson

Wikimedia