Web-scale considerations for using machine learned models in search
Lester Solbakken • Location: Theater 5 • Back to Haystack 2020
With all the recent advances in for instance neural information retrieval (such as transformer models) it is tempting to use such models as signals in your relevancy computation. However, these models are costly to evaluate, particularly over the entire corpus, so to achieve web-scale performance one must usually introduce some sort of approximations. In this talk we will take a look at how to build a search engine with traditional text search features such as BM25, and more recent features such as semantic representations and transformer models such as BERT, and combining the various features to a system that can be run in production successfully. We will introduce Vespa, which is an engine for low-latency computation over many, constantly changing data items and high load. One of its unique features relevant to this topic is its flexibility to combine inferences from various machine learned models, such as TensorFlow, ONNX, XGBoost, and LightGBM, into a single ranking expression that can be evaluated at run-time. Vespa takes care of distributing all data and computation automatically to achieve required performance. We show that even though Vespa is highly performant, some relevant approximations such as WAND, approximate nearest-neighbors and distilled models are required to reach web-scale performance.
Lester Solbakken is a Principle Software Engineer at Verizon Media (previously Yahoo) on the vespa.ai platform, the open big data serving engine. His primary focus areas are machine learning engineering with emphasis on serving and search system ranking. Lester previously pursued a PhD within Artificial Intelligence and Machine Learning with neural networks, exploratory data analysis and self-organizing systems as main research topics.