Addressing variance in AB tests: Interleaved evaluation of rankers

Erik Bernhardson



Evaluation of search quality is essential for developing effective rankers. Interleaved comparison methods achieve statistical significance with less data than with traditional A/B testing, meaning tests can be run in shorter timeframes and more sensitive changes to the ranker can be evaluated. In interleaved ranking two result lists are combined in a "fair" manner, such that clicks can be interpreted as unbiased judgments about the relative quality of the two rankers. In this talk we will dive into why interleaving can be a superior online evaluation method, along with how it could be added to your own evaluation toolset.

