Learning a Joint Embedding Representation for Image Search using Self Supervised Means

Sujit Pal • Location: Theater 5 • Back to Haystack 2022

Image search interfaces either prompt the searcher to provide a search image (image-to-image search) or a text description of the image (text-to-image search). Image to Image search is generally implemented as a nearest neighbor search in a dense image embedding space, where the embedding is derived from Neural Networks pre-trained on a large image corpus such as ImageNet. Text to image search can be implemented via traditional (TF/IDF or BM25 based) text search against image captions or image tags.

In this presentation, we describe how we fine-tuned the OpenAI CLIP model (available from Hugging Face) to learn a joint image/text embedding representation from naturally occurring image-caption pairs in literature, using contrastive learning. We then show this model in action against a dataset of medical image-caption pairs, using the Vespa search engine to support text based (BM25), vector based (ANN) and hybrid text-to-image and image-to-image search.

Download the Slides Watch the Video

Sujit Pal

RELX Group

Sujit Pal is a Technology Research Director at Elsevier Labs, helping to build technology that assists scientists make breakthroughs and clinicians save lives. His primary interests are information retrieval, ontologies, natural language processing, machine learning, deep learning and distributed processing.