Scaling vector search into production without breaking the bank: Vector Quantization and Adaptive Retrieval

Zain Hasan • Location: TUECHTIG • Back to Haystack EU 2024

Everybody loves vector search but… The problem is that prod-level deployment requires boatloads of CPU and GPU compute. The bottom line is that if deployed incorrectly vector search can be prohibitively expensive compared to classical alternatives. I’ll talk about optimizations that’ll allow you to perform real-time billion-scale vector search on your laptop!

The solution: quantizing vectors and performing adaptive retrieval. These techniques allow you to balance and tune memory costs, latency performance, and retrieval recall very reliably.

I’ll cover different quantization techniques, including product, binary, scalar, and matryoshka quantization that enable vector compression trading off memory requirements for recall. I’ll also introduce adaptive retrieval where you perform cheap to expensive iterative rounds of retrieval for every user query.

When used together these can lead to a 32x reduction in memory cost requirements at the cost of ~5% loss in retrieval recall!!

Zain Hasan

Weaviate

Zain Hasan is a senior ML developer relations engineer at Weaviate. An engineer and data scientist by training, he pursued his undergraduate and graduate work at the University of Toronto building artificially intelligent assistive technologies, then founded his company, VinciLabs in the digital health-tech space. More recently he practiced as a consultant senior data scientist in Toronto. Zain is passionate about the fields of machine learning, education, and public speaking.