Expanding RAG to incorporate multimodal capabilities.

Praveen Mohan Prasad and Hajer Bouafif • Location: Theater 5 • Back to Haystack 2024

RAG workflows predominantly rely on retrieved text sources, as most Language Models (LLMs) are proficient in understanding only language. However, a substantial portion of unstructured data contains multimodal elements such as text, tables, and images. Focusing solely on text compromises the retrieval process in RAG. In this session, we’ll explore the utilization of LLMs and multimodal embeddings to expand RAG for multimodal retrieval and generation. A live demonstration will illustrate how a PDF document is processed in a vector database, extracting content from images, tables, and text; the retriever employs multimodal search and the response is enriched using a LLM. This approach ensures the inclusion of multimodal content in both the ingestion and final response generation phases.

Download the Slides Watch the Video

Praveen Mohan Prasad


Praveen Mohan Prasad is a search specialist with data science expertise who actively researches and experiments on using Machine Learning to improve search relevance. Praveen advices clients to implement and operationalise strategies to improve search experience.

Hajer Bouafif


Hajer Bouafif is a solutions architect in Data Analytics and search with a background in Big Data engineering. Hajer provides organizations with best practices and well-architected reviews to build large-scale Machine Learning search solutions.