LLMs as Judges - Modern Approaches to Search Evaluation - May 5th
Workshop to be held before the main conference on May 5th, 2026
As Large Language Models (LLMs) continue to transform how we build and evaluate search systems, it’s crucial to understand how to use them effectively as “judges.” This hands-on training will guide you through the principles, practical techniques, and real-world considerations for using LLMs to evaluate search relevance, result quality, and retrieval-augmented generation (RAG) systems.
We’ll start with the fundamentals of search evaluation — from traditional human judgments and behavioral signals to how LLMs fit into the picture. You’ll learn how to design clear evaluation frameworks, craft effective prompts, and define output structures that make LLM-based judgments robust, interpretable, and aligned with your rules and goals.
Key topics covered: Learn how to...
- Understand the landscape of search evaluation:
- Differentiate between human, behavioral, and LLM-based evaluation.
- Define what makes a "good" judgment in different contexts.
- Design and implement robust LLM evaluation frameworks:
- Develop clear evaluation rules.
- Craft effective prompts and structure LLM outputs for consistency.
- Choose between judging query-document pairs and comparative evaluation.
- Decide when and how to apply scales versus labels.
- Improve and refine LLM judgments:
- Enhance judgments using breakdowns, chain-of-thought, and thinking steps.
- Leverage personas to tailor judgments.
- Select appropriate models and thoroughly test your evaluator.
- Address challenges and explore advanced techniques:
- Perform adversarial tests to stress-test your prompts.
- Identify and mitigate biases and pitfalls in LLM-based evaluation.
- Explore fine-tuning and ensemble approaches for enhanced performance.
- Manage cost and quality trade-offs effectively.
- Apply LLMs to diverse evaluation scenarios:
- See a practical demo of LLM as a Judge in Quepid.
- Discover other use cases, including competitor analysis, RAG evaluation, and integration into agentic workflows.
Who should attend:
- Search engineers and relevance practitioners
- Data scientists working with LLM pipelines
- Product managers and researchers designing evaluation frameworks
- Anyone interested in the future of retrieval and generation quality assessment
Bring your questions — we’ll wrap up each session with a lab where you can experiment with prompt strategies and interpret model outputs.
Requirements: Basic familiarity with search concepts and LLMs recommended.
Date: 5 May 2026, 9:00AM - 5:00PM EDT
Location: Charlottesville