LLMs as Judges - Modern Approaches to Search Evaluation - September 25th

Training to be held after the main conference on September 25th, 2025

As Large Language Models (LLMs) continue to transform how we build and evaluate search systems, it’s crucial to understand how to use them effectively as “judges.” This hands-on training will guide you through the principles, practical techniques, and real-world considerations for using LLMs to evaluate search relevance, result quality, and retrieval-augmented generation (RAG) systems.

We’ll start with the fundamentals of search evaluation — from traditional human judgments and behavioral signals to how LLMs fit into the picture. You’ll learn how to design clear evaluation frameworks, craft effective prompts, and define output structures that make LLM-based judgments robust, interpretable, and aligned with your rules and goals.

Key topics covered: Learn how to...

  • Understand the landscape of search evaluation:
    • Differentiate between human, behavioral, and LLM-based evaluation.
    • Define what makes a "good" judgment in different contexts.
  • Design and implement robust LLM evaluation frameworks:
    • Develop clear evaluation rules.
    • Craft effective prompts and structure LLM outputs for consistency.
    • Choose between judging query-document pairs and comparative evaluation.
    • Decide when and how to apply scales versus labels.
  • Improve and refine LLM judgments:
    • Enhance judgments using breakdowns, chain-of-thought, and thinking steps.
    • Leverage personas to tailor judgments.
    • Select appropriate models and thoroughly test your evaluator.
  • Address challenges and explore advanced techniques:
    • Perform adversarial tests to stress-test your prompts.
    • Identify and mitigate biases and pitfalls in LLM-based evaluation.
    • Explore fine-tuning and ensemble approaches for enhanced performance.
    • Manage cost and quality trade-offs effectively.
  • Apply LLMs to diverse evaluation scenarios:
    • See a practical demo of LLM as a Judge in Quepid.
    • Discover other use cases, including competitor analysis, RAG evaluation, and integration into agentic workflows.

Who should attend:

  • Search engineers and relevance practitioners
  • Data scientists working with LLM pipelines
  • Product managers and researchers designing evaluation frameworks
  • Anyone interested in the future of retrieval and generation quality assessment

Bring your questions — we’ll wrap up each session with a lab where you can experiment with prompt strategies and interpret model outputs.

Requirements: Basic familiarity with search concepts and LLMs recommended.

Date: 25 September 2025, 9:00 - 17:00

Location: KOPF, HAND + FUSS gGmbH c/o TUECHTIG, Oudenarder Straße 16, House D06, 1st floor, 13347 Berlin

Buy your ticket now!