Embracing Diversity: Searching over Multiple Languages

Suneel Marthi & Jeff Zemerick • Back to Haystack 2018

Although a lot of online content is written in English there’re tons of non English users out there that still need to retrieve information. When searching, especially for tech related topics, it’s common to compose queries in English; however for such users search results written in their own native language may be preferred.

We’ll see how statistical machine translation tools can help in the above scenario to perform text translation at query time, resulting in an improved recall and precision for the search engine queries.

We’ll be having a look at how cross language information retrieval can be implemented on top of Apache Lucene with the help of Apache Joshua machine translation toolkit.

The audience would gain a better understanding of how to be able to make search queries against a multilingual corpora indexed into Apache Lucene and being able to retrieve all of the relevant search results in different languages.

View the Slides
Suneel Marthi

Suneel is a Member of Apache Software Foundation and is a Committer and PMC on Apache Mahout, Apache OpenNLP, Apache Streams. He’s previously presented at Flink Forward, Hadoop Summit, Berlin Buzzwords, Machine Learning Conference, Big Data Tech Warsaw and Apache Big Data in the past. He’s based out of Dulles, Virginia in the Washington DC Metro area.

Jeff Zemerick

Jeff is an Apache OpenNLP committer and PMC member. He's interested in all things around NLP, cloud, and big-data. Jeff is located outside of Pittsburgh, PA.