Beyond precision and recall – ensuring 'aboutness' in topical classification using confidence scores

Mark Shewhart and Sophie Lagace • Location: Theater 5 • Back to Haystack 2022

Taxonomies play an important role in many LexisNexis products, allowing our customers to run searches using predefined topics either as pre or post-filters. Because our topical classification is automated, there can be a wide array of relevance in results, from the very relevant to the more marginal. We need to ensure that documents containing a heavy breadth and depth of discussion of a particular topic surface at the top of the results list. This presentation will demonstrate how stamping Confidence Scores into documents, in addition to a topic code, is crucial to achieving this goal of ‘aboutness’. It will cover experiments that ‘boost’ or ‘re-rank’ using the confidence scores and the internal tools used to measure resulting improvements in relevance. The presentation will outline the methods, both machine learning and rule-based, used to develop a confidence score with a consistent meaning across content types and underlying classification technologies.

Download the Slides Watch the Video

Mark Shewhart

Principal Data Scientist, iLabs, Global Data Office, GTO - LexisNexis

Mark has worked at LexisNexis for 29 years. He has worked in a variety of positions centered around content enrichment technologies, editorial efficiency tools, natural language processing, machine learning, content suggestion, metrics, and related research positions. Mark invented the SmartIndexing rule-based system that applies to all news content in three languages and a variety of business and legal materials that has been in use for 25 years. Most recently Mark has worked on Shepard’s editorial efficiency, headnote classification automated suggestions, More Like This Headnote improvements and other efforts.

Sophie Lagace

Lead Legal Taxonomist - LexisNexis

Sophie has worked for LexisNexis for 20 years. After practicing law for a few years, she initially joined LexisNexis as a Legal Editor but after being involved in the Taxonomy project, she quickly discovered a great interest in information science and the technical side of research. She is now responsible for the Canadian Legal Taxonomy, a bilingual and bijural classification of 4000 topics used to classify all Canadian legal content. Over the years she has been involved in various projects aimed at improving classification accuracy, as well as improving legal topic navigation and functionality. Most recently Sophie has worked on implementing a confidence score tool designed to improve relevance of topic search results.

Kimberly Hoffbauer

Lead News & Business Taxonomist - LexisNexis

Kimberly has worked in Taxonomies & Indexing for 23 years at LexisNexis. The News and Business taxonomies include over 14,000 topics and are applied across all news content as well as various business and legal content. A key role Kimberly plays is in building and testing SmartIndexing rules that are used to classify content to a wide variety of Subject, Industry, People and Geographic topics. These topics, which are stamped in the data, are leveraged by multiple products for searching and filtering to help our users find their answers faster.