Bad Text, Bad Search: Evaluating Text Extraction with Apache Tika's tika-eval Module

Tim Allison • Back to Haystack 2018

Reliable text extraction is essential for search. Without reliable text, users cannot find the content they need, and advanced relevance tuning is worthless.

Apache Tika™ is a crucial component in a wide variety of search engines and content management systems, including Solr™, Nutch™, Alfresco, Elasticsearch and Sleuth Kit®/Autopsy®.

In this 45-minute talk, we will offer a quick introduction to Apache Tika and some recent improvements. We will then turn to give an overview of the new tika-eval module that allows developers to evaluate Tika and other content extraction tools. This talk will end with a brief discussion of the results of taking this evaluation methodology public and evaluating Tika on large batches of public domain documents on a public vm over the last two years.

View the Slides

Tim Allison

The MITRE Corporation
Principal Artificial Intelligence Engineer

Tim has been working in natural language processing since 2002. In recent years, his focus has shifted to advanced search and content/metadata extraction. Tim is a member of the Apache Software Foundation, and a committer and PMC member on Apache PDFBox (since September 2016), Apache POI and Apache Tika since (July 2013). Tim holds a Ph.D. in Classical Studies from the University of Michigan, and in a former life, he was a professor of Latin and Greek.