Fine-tuning for Vector Search
James Briggs • Location: TUECHTIG • Back to Haystack EU 2022
The “hard part” of applied vector search with dense retrieval is often building an embedding model that works for the user’s target domain. Unfortunately, it’s not always clear where to start with this. This talk will summarize the most popular fine-tuning methods for semantic search and QA applications. When/where they should be used, their intended purpose, and the data requirements. Covering methods like:
- MSE-loss, MNR-loss where labeled data is available.
- Multilingual knowledge distillation for transferring semantic knowledge into new languages using translation pairs data.
- Unsupervised semantic-similarity methods like TSDAE.
- Augmentation of small datasets with AugSBERT.
- Unsupervised QA methods like GenQ and GPL. At the end of this, the audience should have a good grasp of when and where to use the different training methods for their embedding models based on their use-case and training data.
James Briggs is a Staff Developer Advocate at Pinecone and freelance ML Engineer. In both roles James focuses on NLP and vector search, from the perspective of education and real-world implementation. In the past, he has worked as a Data Scientist in the finance industry with Deloitte and UBS. James has produced many educational materials on NLP and vector search online, including articles, videos, and courses that have gained millions of readers and viewers.