Analyzing Language Analyzers-Even in Languages You Don't Know

Whether it's a minor upgrade from ASCII folding to ICU folding or deploying a completely new language analyzer, the effects of changes to your analysis chain can be hard to predict. When working in a language you don't know very well-or at all!-it's especially difficult to get a sense of how well a language analyzer works, even if you have the assistance of native speakers who are not search experts. In this talk we'll look at some of the problems that can crop up with language analyzers and how to detect them-based on my experience evaluating, upgrading, pluginifying, and deploying language analyzers for Wikipedias and other Wiki projects in over a dozen languages-most of which I don't speak!

Trey Jones is a computational linguist, and he likes to do tricky things with text and mash numbers and words together in ways not intended by nature. He is part of the Wikimedia Foundation Search Platform team, and spends his time working on search & relevance, trying to better support search in various languages, analyzing queries, and doing random mathy things. He worked on search engines before search engines were cool. Don't tell anybody, but Wiktionary is his favorite wiki.