Haystack US 2025

Talks from the Search & Relevance Community at the Haystack Conference!

Created and organised by the search & AI experts at

The conference sessions will be held at the Violet Crown movie theater in central Charlottesville for in-person attendees and streamed live via Zoom for online attendees. Click a talk below for full details, slides and video.

Our Event Safety Guide and Code of Conduct.

Day 1, Wednesday, April 23rd, 2025

Time	Track 1	Track 2
8:00-9:00am EDT	Registration Location: Entrance of the Violet Crown
9:00-9:30am EDT	Welcome to Haystack! We welcome you to Haystack 2025! Hear about what we have planned for this year's conference, with a special focus this year on AI Powered Search as both a product and a technology. René Kriegler Location: Theater 5
9:45-10:30am EDT	Innovation in the Age of Artificial Intelligence In this keynote, we’ll explore how artificial intelligence is reshaping the ways we work, create, and innovate—with a particular emphasis on our shared roles as technologists. Together, we’ll look at how AI can magnify human potential, the pitfalls that might hinder true innovation, and the “wildcards” that could dramatically accelerate or stall AI’s widespread impact. You’ll also learn practical habits for nurturing an inventive mindset in a world rapidly infused with intelligent tools. Finally, we’ll glimpse the near-future, where AI permeates every aspect of our lives—and discover how to thrive in the era that’s just around the corner. Rick Hamilton Location: Theater 5
10:45-11:30am EDT	Persona Based Evaluation of Search Systems Search evaluation is notoriously complex, with even minor tweaks to hyperparameters or embeddings creating widespread ripple effects across retrieved results. Mitigating this inherent “instability” in search algorithm changes has long been a challenge. Traditional approaches, such as composing test cases (e.g., as supported by Quepid), offer a degree of control and consistency. However, writing and maintaining test cases is a painstaking task, particularly in dynamic environments where catalogs are constantly updated with new items. Our proposed method introduces a groundbreaking solution: leveraging user modeling inspired by the “LLM as a judge” paradigm to automate query and result-set generation. This approach dynamically creates realistic query-result pairs by simulating diverse user personas, each designed to evaluate different modalities (image, video, audio) within the catalog under varying prompts. The innovation lies in the adaptability and efficiency of the system: - Dynamic Test Coverage: Personas adapt to catalog changes, ensuring that new items are immediately evaluated without manual intervention. - Multi-Modality Testing: By supporting multiple content types, the system mirrors the diversity of modern search use cases. - Fully Automated Pipeline: The process eliminates the need for manually written test cases, reducing overhead and accelerating iteration cycles. This talk will provide an in-depth look at the methodology, the benefits of dynamic user modeling, and real-world results from applying this system. Attendees will leave with actionable insights into transforming their search evaluation strategies, unlocking new levels of stability and precision in their algorithms. Uri Goren Location: Theater 5	Behind the hype: managing billion-scale embeddings in Elasticsearch and OpenSearch Semantic search is often hailed as a game-changer, promising to solve challenges like relevance, complex sentence analysis, and synonym detection with just a few embeddings and a machine learning model. The demos look impressive—but what happens when you're dealing with more than a billion embeddings? In this talk, we move past the hype to explore the real-world complexities of managing large-scale vector databases, focusing on Elasticsearch and OpenSearch. Through practical, hands-on examples, we’ll share proven strategies to ensure scalability, maintain high performance, and optimize costs. Whether you're already managing a billion-vector database or preparing for large-scale deployment, this session will equip you with the knowledge and tools to tackle real-world challenges effectively. Amine Gani & Roudy Khoury Location: Theater 7
11:45am-12:30pm EDT	Lexical Love: Rediscovering the Power of Lexical Search in RAG Retrieval-Augmented Generation (RAG) is often built around semantic search, where documents are chunked, embedded as vectors, and retrieved based on their meaning. While this approach is powerful, it also comes with significant challenges—large indexes, clunky filtering mechanisms, and a lack of transparency in search results. Perhaps most critically, semantic search struggles with exact matches, making it difficult to retrieve specific IDs, phrases, or jargon words that weren’t present in the original model’s training data. In this talk, we’ll explore the role of lexical search in RAG workflows, highlighting how it can solve many of these issues. We’ll start with an overview of how lexical search works, including indexing, analysis, and search techniques like Boolean queries, faceted search, and phrase matching. We’ll contrast it with semantic search, explaining when and why you might want to use lexical search instead of vector-based methods. From there, we’ll walk through a practical implementation of lexical search in RAG. Using real-world examples, we’ll demonstrate how to index data, structure search queries to maximize relevance, and integrate lexical search into a RAG pipeline. We’ll also show how language models can interact with search results dynamically, refining queries and applying filters in response to user input. Of course, lexical search isn’t a silver bullet. We’ll discuss its limitations. And then we'll briefly introduce some of the hybrid approaches—ways to combine the strengths of both lexical and semantic search and possibly get the best of both worlds. By the end of this session, you’ll have a clear understanding of how lexical search fits into RAG, when to use it, and how to implement it effectively. If you’re working with LLM applications and want to make search more precise, transparent, and adaptable, this talk is for you. John Berryman Location: Theater 5	Building Relevance Formulas with LLMs E-commerce search is often about exploring options and providing good recommendations. Say I want to buy a car and I have some hard filters (e.g. budget) but most criteria are weights: cheaper is better, more reliable is also better (these two tend to collide). I prefer orange, but other colors are OK. I want to tell the LLM what I want - like in RAG - but I want sorted results with previews and facets - like in traditional E-commerce search - so I can see what's available and refine my filters manually. In short, I want a chatbot instead of a search box. This session covers a PoC of this approach. We express product attributes as tensors, then compute the score as a dot-product of these tensors and those representing user preferences. We'll combine these dot-products with other criteria - like distance or price - into an overall score with weights that can be tuned by chatting with an LLM. Because I want a cheap AND reliable car and I'm willing to discuss what that means for me :) Kristian Aune Location: Theater 7
12:30pm-2:00pm EDT	Lunch Find lunch at one of the many options available on Charlottesville’s Downtown Mall Location: Your choice!
2:00-2:45pm EDT	Maximizing Multimodal: Exploring the search frontier of text-image models to improve visual find-ability for creatives Objective: Describe where and how we have improved the search experience in our product with open source multi-modal models and libraries. Real world examples from the things we have shipped (and decided not to ship) to production, including AB test results of our relevancy changes. Outline: 1. Cover the architecture of our open source hybrid search stack at Eezy (Elasticsearch, FAISS, PyTorch models) 2. Describe the capabilities and limitations of openCLIP (and vector embeddings at large) for retrieval tasks and current pain points and work arounds we have engineered. 3. Highlight meaningful stops on our product roadmap from the last 2 years of deploying features into production. 4. Describe notable missteps and surprises uncovered along the way, so people see it's not all roses in the AI powered future. 5. Demo BORGES, a novel search framework that allows users to search with multiple queries in multiple modalities for a nuanced navigation of the catalog to find exactly what they need Audience: - Anyone curious about real-world results we have extracted from AI - Search practitioners developing hybrid search applications - PyTorch and transformers enthusiasts interested in applications in vector space Nathan Day Location: Theater 5	Evaluating Search Features: Lessons from Launching Knowledge Panels & Vector Search in Medical Search As professional expert site search engines evolve, they are increasingly adopting features pioneered by general-purpose platforms such as Google - knowledge panels, preview snippets, LLM-synthesized answers, and vector search. In essence: pulling up answers to questions to the search result pages. However, evaluating these features, particularly in the absence of transactional signals like purchases or basket additions, remains a significant challenge. How can we determine whether users have found what they were looking for? This case study examines how AMBOSS, a medical information platform used by 60% of US medical students and hundreds of thousands of doctors in the US and worldwide, developed and deployed search features such as Knowledge Panels, People Also Searched, and Vector Search. I discuss the challenges of measuring success in a domain where traditional relevance signals such as basket value or purchases are not available and delve into alternative approaches to evaluation. For example Click-through rate (CTR), often used as a proxy for search relevance, can be both an indicator for relevant search results but also for irrelevant answer snippets shown, depending on context. Through a combination of user feedback loops, and engagement-based heuristics, we developed a framework for assessing search feature effectiveness that we use in online controlled experiments (A/B testing). This talk provides practical insights into judging impact of algorithmic improvements and user experience changes in a high-stake, niche domain, knowledge-intensive search environment. Valentin v. Seggern Location: Theater 7
3:00-3:45pm EDT	Supercharging Search in OpenSearch: Harnessing User Behavior insights for Continuous Improvement Organizations invest significant resources in building and optimizing their search systems, relying on OpenSearch to deliver the most relevant results. However, while search queries may return top results, this only tells part of the story. Critical details are often missing—such as which product a user selects, whether they add it to their cart, or if they are satisfied with the results. These insights are vital for improving the search experience, yet they are not captured by the search system itself. In this session, I will demonstrate how to go beyond basic search results by capturing user behavior, including the specific searches they perform, their interactions with search results, and events such as time spent on the search page and the products they select. By feeding this data into an analytical engine, we can continuously learn from user behavior and adjust the search system accordingly. This approach gives organizations a complete view of their users' actions, enabling them to understand what works and what doesn’t in the search experience. It helps track essential metrics like how long users stay on the search page, which products catch their attention, and how their interactions align with overall satisfaction. With these insights, organizations can fine-tune search relevance, optimize product recommendations, and ultimately improve the user experience. To illustrate this process, I will use OpenSearch in an open-source environment, providing a live demo that showcases how these techniques work in practice. By the end of the session, you will have a clear understanding of how capturing user behavior data can drive more relevant search results and enhance the overall user experience in your system. Aruna Govindaraju Location: Theater 5	Beyond LLM-as-judge towards LLM relevance engineering Most now agree that LLMs can provide useful search judgments. In this talk, we’ll discuss the next level: an ensemble of judges to deepen our understanding of the search relevance problem. Evolving to LLM directed search relevance engineering. Imagine many judges focused on different strategies, each compared to our human groundtruth. If an LLM judge that only sees a product name agrees with human raters 80% of the time, then shouldn’t your search team focus there? If an LLM that expands the query before judging brings agreement to human labels up to 90% does this call for a query expansion feature? And how do these judges relate to each other to make an even better prediction? Query expansion may only matter when there’s not a strong product name match, perhaps indicating searching only the description in these cases would matter more. I’ll share an LLM-in-the-loop search relevance process. Using LLMs focused on its own ranking feature as lego pieces, and traditional ML as the connective tissue, you’ll see how an ensemble of judges guides the relevance engineering effort to find promising features and learn their inter-relationships. Doug Turnbull Location: Theater 7
4:00-5:15pm EDT	Lightning Talks Quick discussions about anything around search relevance! Location: Theater 5
5:30-8:00pm EDT	Haystack Reception & Dinner (included with registration) All attendees are welcome. The location is Kardinal Hall located here. It is about a 10 minute walk from the conference venue. Location: Kardinal Hall

Day 2, Thursday, April 24th, 2025

Time	Track 1	Track 2
8:00-9:00am EDT	Coffee Location: Entrance of the Violet Crown
8:30-9:00am EDT	Book Signing with Trey Grainger and Doug Turnbull Get your AI Powered Search Book Signed by the Authors Location: Entrance of the Violet Crown
9:00-9:15am EDT	Welcome Back Location: Theater 5
9:15-10:00am EDT	Judge Moody's: Automating Semantic Search Relevance Evaluation with LLM Judges Moody’s semantic search engine powers retrieval over millions of financial research documents, providing essential context to Research Assistant, a Retrieval Augmented Generation (RAG) application, to accurately answer users’ questions. Ensuring relevance and accuracy of retrieved context is critical in financial research, where data misinterpretation can have significant implications. While the quality of Research Assistant’s answers depends heavily on this context, traditional evaluation methods relying on domain experts for relevance judgments are time-consuming and cost-prohibitive at scale. To overcome this challenge, we developed a search relevance evaluation framework leveraging a large language model (LLM) as an automated judge. Through iterative prompt tuning with few-shot learning, explicit evaluation criteria, and other techniques, our system achieves over 80% agreement with domain-expert evaluators. Our framework compiles test sets of arbitrary size, retrieves relevant chunks, and automatically evaluates relevance using standard information retrieval metrics such as precision, recall, and nDCG. This pipeline enables rapid iteration on search algorithms with immediate feedback on retrieval quality, cutting experiment time from days to minutes, while maintaining high correlation with expert assessments. Although the system occasionally struggles with highly technical financial concepts, ongoing efforts focus on enhancing domain-specific evaluation capabilities through further prompt refinement and integration of expert feedback. This work represents a significant step in automating relevance evaluation in financial research, offering a scalable, efficient, and cost-effective solution. In this talk, we will explore the development and implementation of our LLM-based evaluation framework, including our prompt engineering methodology, validation process against expert judgments, and lessons learned in applying this approach to specialized financial content. Gurion Marks Location: Theater 5	Automatic FAQ: Building a Multi-Agent Systems to Extract Insight from User Discussions For many products, especially software, online user discussion provides a valuable trove of data for improvement. Manually reading all discussion and extracting key themes and insights is, however, typically prohibitive in terms of human labor. In this presentation, we develop a multi-agent system built with open-source tooling that is capable of summarizing large corpora of user discussion from Discord, GitHub, and other sources, extracting insights on common themes across users in order to improve product usability and documentation. We construct the system with Milvus, HuggingFace, and LangChain libraries, and demonstrate its usefulness on analyzing the Milvus user documentation. Stefan Webb Location: Theater 7
10:15-11:00am EDT	AMA with the Authors of AI-Powered Search Trey Grainger and Doug Turnbull, both authors of the recently-published AI-Powered Search book (https://aipoweredsearch.com), take the stage and attempt to answer your biggest questions about search, relevance, and the evolving landscape of AI search. Have a tough question about dense vs. sparse vector search, quantization, semantic vs. lexical search vs. multimodal vs. hybrid search, personalized search and recommendations, bi-encoders vs. cross-encoders vs. LTR models, semantic knowledge graphs, semantic functions, late interaction models, or best practices for RAG, question answering, and interpreting query intent? Ask questions to take advantage of the authors' decades of experience helping many of the world's top organizations solve complex search and relevance problems, or otherwise see if you can 'stump the chump' with your craziest questions! Trey Grainger & Doug Turnbull Location: Theater 5
11:15-12:00pm EDT	Enhancing Generative AI Evaluation with Synthetic Raters In the realm of Generative AI, human Subject Matter Experts (SMEs) are the gold standard for evaluating AI outputs across diverse domains such as medicine, law, and finance. However, the human evaluation process is resource-intensive, both in terms of time and cost. This presentation explores the innovative use of Generative AI-based Synthetic Raters as a cost-effective alternative for evaluating AI-generated content. A Synthetic Rater is a composite of three elements: a trained Large Language Model (LLM) or similar AI construct, a set of system-level parameters (e.g., prompts), and metadata for identification and versioning. These components mirror those used in human rating processes, allowing for a seamless integration into existing evaluation frameworks. The primary distinction lies in the training and background differences between human and synthetic raters, which can be analyzed using comparison and regression tools within an SME rating framework. Our research introduces a robust framework for SME-based evaluation that leverages both human and synthetic rater results. We conducted extensive tests using various LLMs and system prompts, comparing synthetic-to-synthetic and human-to-synthetic evaluations across multiple metrics. The findings reveal significant potential for synthetic raters to complement human evaluations, offering diverse perspectives and enhancing overall assessment quality. This presentation will detail the common metrics employed, the testing methodologies, and the results of our evaluations. We will also explore practical use cases and propose innovative strategies for integrating human and synthetic ratings, ultimately paving the way for more efficient and scalable AI evaluation processes. Doug Rosenoff Location: Theater 5	MiniCoil: A Hybrid Sparse Retrieval Model for Scalable and Context-Aware Semantic Search The MiniCoil Sparse Retrieval Model introduces an innovative approach to scalable semantic search by blending the interpretability of sparse retrieval with the contextual depth of dense embeddings. Designed for high performance with minimal computational overhead, MiniCoil strikes a balance between efficiency and semantic richness. At its core, MiniCoil generates a compact, sparse representation by leveraging transformer-based embeddings as a foundation, combined with trained, meaning-preserving layers that achieve significant dimensionality reduction. This approach ensures that semantic information is retained while maintaining computational efficiency. The hybrid design supports dynamic vocabulary expansion, seamlessly falling back to BM25 for out-of-vocabulary terms. This ensures robust and reliable retrieval across diverse datasets. Additionally, MiniCoil’s domain-agnostic architecture makes it a versatile, general-purpose retrieval solution while enabling fine-tuning for specialized applications, such as legal and medical search. MiniCoil is ideal for powering search engines, enterprise knowledge systems, and conversational AI. This session will delve into its core architecture, training methodologies, and practical use cases, equipping attendees with actionable insights to develop efficient, context-aware retrieval systems that balance speed, accuracy, and interpretability. Thierry Damiba Location: Theater 7
1:00pm-1:30pm EDT	Book Signing with Trey Grainger and Doug Turnbull Get your AI Powered Search Book Signed by the Authors Location: Entrance of the Violet Crown
12:00pm-1:30pm EDT	Lunch Find lunch at one of the many options available on Charlottesville’s Downtown Mall Location: Your choice!
1:30-2:30pm EDT	Women of Search Presents: How Great Product Managers Build for Impact Join Women of Search for an insightful discussion on how search product managers can move beyond reactive cycles of quick wins and tech debt to build strategic, insights-driven roadmaps. Attendees will learn how to balance tactical demands with long-term vision, while keeping stakeholders aligned and empowering teams to focus on high-impact deliverables. Audrey Lorberfeld & Samdisha Kapoor Location: Theater 5
2:45-3:30pm EDT	From Traditional Keyword Search to AI-Powered Search: Our Journey In January 2024, we launched our in house search solution, transitioning away from third-party providers to gain greater control over relevance, performance, and innovation. Our initial focus was identifier search, ensuring customers could quickly find products using SKUs, part numbers, and other structured identifiers. With this foundation in place, we began our journey toward AI-powered search to enhance recall and relevance for more complex queries. We integrated AI-driven reranking for head terms, leveraging machine learning to reorder results based on behavioral signals. Recognizing the limitations of strict lexical matching, we introduced semantic expansion using synonym models, improving query understanding and recall. To enhance the search experience further, we built a typeahead model, making real-time query suggestions more intuitive and personalized. As our AI capabilities matured, we implemented KNN search, enabling vector-based retrieval to surface results that go beyond traditional term matching. To fine-tune ranking, we introduced Learning to Rank (LTR), optimizing results based on click-through rates and conversion signals. Our journey from standing up our first traditional keyword search solution to AI-powered retrieval has transformed our search experience, but it hasn’t been without challenges. We’ll share key lessons from this evolution—balancing precision vs. recall, managing infrastructure complexity, and evaluating AI-driven improvements. This session will provide a practical roadmap for search teams looking to transition from keyword-based search to AI-enhanced discovery in an eCommerce setting. Jon Vivers & Jason Taylor & Meet Parekh Location: Theater 5	AI and LLM strategies and application at Mercari Search Mercari is a Japan-based second-hand e-commerce marketplace. We have been relying on Elastic Search for retrieval and DNN Learning to Rank for ranking for a long time. With the development of deep learning and LLM, In this talk, I would like to share we re-architecture our search system and convince internal stakeholders to take on new technology. In addition, how some of those new technologies work and not work in C2Ce-commerce setting. 1. End-to-end search architecture walkthrough and what new components are working for us 2. Success LLM application: offline dataset labeling, query expansions, and language translation using LLM and image embedding 3. Business consideration of the new technology such as ROI analysis Kaiyi Liu Location: Theater 7
3:45-4:30pm EDT	From Static to Dynamic: Data-Driven Query Understanding to Supercharge Hybrid Search As AI-powered search evolves, hybrid search has become a powerful method for enhancing result quality. However, tuning hybrid search is a complex optimization problem due to the need to balance lexical and semantic relevance while adapting to diverse query intents. In this talk, we'll present a scalable, data-driven framework for optimizing hybrid search using machine learning techniques, focusing on real-time query adaptation to maximize relevance. We’ll explore how we transitioned from a static one-size-fits-all approach to query-specific optimization, using machine learning to dynamically adjust search parameters. Attendees will gain insights into: * How to address the challenges of hybrid search optimization. * Techniques to predict and fine-tune search parameters using query-specific and query result-specific signals. * Improvements in search quality metrics (DCG, NDCG, Precision) of more than 7% compared to the static approach. * Steps to integrate query-specific optimization into your existing search infrastructure. By the end of this session, attendees will be equipped to build adaptable, machine learning-driven hybrid search systems that respond dynamically to user queries, resulting in measurable gains in search relevance and user satisfaction. Daniel Wrigley Location: Theater 5	Beyond RAG — going from search to analytics on unstructured data with Aryn Over the past year generative AI models of GPT-4 quality have gotten 50-80x cheaper and 10-20x faster. Given this trend, LLMs have the potential to go beyond RAG and run complex semantic analyses on unstructured data at scale. Consider, for example, marketing analysts in pharmaceuticals that want to analyze thousands of interviews to assess key factors in adoption of medications by market segment, or financial analysts that want to perform due diligence across thousands of reports to form an investment thesis. Solving for these use cases requires systems that efficiently sweep through large datasets, harvest high quality metadata at query time, and synthesize results. Towards this, we motivate and describe the design of an AI-powered unstructured data warehouse, eponymously named Aryn. With Aryn, users specify queries in natural language and the system automatically generates and executes a semantic plan across a large collection of unstructured documents. To accomplish this, it orchestrates data through a collection of visual AI models and LLMs using the Sycamore open source library. In this talk, I will demonstrate analytics queries over real world reports from the National Transportation Safety Board (NTSB). I will walk through end-to-end how the system ingests and indexes its data using Sycamore and OpenSearch, and plans and executes queries to achieve much better accuracy than RAG approaches. Also, given current limitations of LLMs, we argue that an analytics system must be verifiable to be practical. Toward this, we show how Aryn’s user interface provides explainability through lineage and execution traces to help build trust. Mehul Shah Location: Theater 7
4:30-4:45pm EDT	Closing Closing remarks and announcements René Kriegler Location: Theater 5
5:30-8:00pm EDT (and beyond...)	Haystack Sponsored Dinners List of sponsors, dinners, and venues TBD Location: Venues for each to be determined