Search with Vectors

Simon Hughes • Back to Haystack 2019

With the advent of deep learning and algorithms like word2vec and doc2vec, vectors-based representations are increasingly being used in search to represent anything from documents to images and products. However, search engines work with documents made of tokens, and not vectors, and are typically not designed for fast vector matching out of the box. In this talk, I will give an overview of how vectors can be derived from documents to produce a semantic representation of a document that can be used to implement semantic / conceptual search without hurting performance. I will then describe a few different techniques for efficiently searching vector-based representations in an inverted index, including LSH, vector quantization and k-means tree, and compare their performance in terms of speed and relevancy. Finally, I will describe how each technique can be implemented efficiently in a lucene-based search engine such as Solr or Elastic Search.

Simon Hughes

DHI

Simon is currently the Chief Data Scientist at Dice.com, the technology professional recruiting site. He is also a PhD candidate at DePaul university, studying a PhD in machine learning and natural language processing (NLP) which he hopes to complete this year. At Dice, he has developed multiple recommender engines for matching job seekers with jobs, as well as optimizing the relevancy of both Dice.com's job and talent search, and more recently worked on Dice's Career Path Explorer, and Salary Prediction tools (under `career explorer' on www.dice.com).

In his academic studies, Simon is researching the application of AI and NLP to the education sector. His thesis topic is on extracting causal reasoning from scientific explanatory essays. He is also a proud father of two children, an avid video gamer, and a British expatriate, having been born in Cheshire, England before moving to Chicago in 2005.