Judge Moody's: Automating Semantic Search Relevance Evaluation with LLM Judges

Gurion Marks • Location: Theater 5 • Back to Haystack 2025

“Moody’s semantic search engine powers retrieval over millions of financial research documents, providing essential context to Research Assistant, a Retrieval Augmented Generation (RAG) application, to accurately answer users’ questions. Ensuring relevance and accuracy of retrieved context is critical in financial research, where data misinterpretation can have significant implications. While the quality of Research Assistant’s answers depends heavily on this context, traditional evaluation methods relying on domain experts for relevance judgments are time-consuming and cost-prohibitive at scale. To overcome this challenge, we developed a search relevance evaluation framework leveraging a large language model (LLM) as an automated judge. Through iterative prompt tuning with few-shot learning, explicit evaluation criteria, and other techniques, our system achieves over 80% agreement with domain-expert evaluators. Our framework compiles test sets of arbitrary size, retrieves relevant chunks, and automatically evaluates relevance using standard information retrieval metrics such as precision, recall, and nDCG. This pipeline enables rapid iteration on search algorithms with immediate feedback on retrieval quality, cutting experiment time from days to minutes, while maintaining high correlation with expert assessments. Although the system occasionally struggles with highly technical financial concepts, ongoing efforts focus on enhancing domain-specific evaluation capabilities through further prompt refinement and integration of expert feedback. This work represents a significant step in automating relevance evaluation in financial research, offering a scalable, efficient, and cost-effective solution. In this talk, we will explore the development and implementation of our LLM-based evaluation framework, including our prompt engineering methodology, validation process against expert judgments, and lessons learned in applying this approach to specialized financial content.”

Gurion Marks

Moody's