Embracing Diversity: Searching over Multiple Languages

Suneel Marthi & Jeff Zemerick • Back to Haystack 2018

Although a lot of online content is written in English there’re tons of non English users out there that still need to retrieve information. When searching, especially for tech related topics, it’s common to compose queries in English; however for such users search results written in their own native language may be preferred.

We’ll see how statistical machine translation tools can help in the above scenario to perform text translation at query time, resulting in an improved recall and precision for the search engine queries.

We’ll be having a look at how cross language information retrieval can be implemented on top of Apache Lucene with the help of Apache Joshua machine translation toolkit.

The audience would gain a better understanding of how to be able to make search queries against a multilingual corpora indexed into Apache Lucene and being able to retrieve all of the relevant search results in different languages.

View the Slides
Suneel Marthi

Suneel is a Member of Apache Software Foundation and is a Committer and PMC on Apache Mahout, Apache OpenNLP, Apache Streams. He’s previously presented at Flink Forward, Hadoop Summit, Berlin Buzzwords, Machine Learning Conference, Big Data Tech Warsaw and Apache Big Data in the past. He’s based out of Dulles, Virginia in the Washington DC Metro area.

Jeff Zemerick

I am an experienced software engineer and entrepreneur. In 1999 I founded Zemerick Software, Inc. and made software to manage instant messaging for parents, businesses, and law enforcement. Zemerick Software grew from downloadable shareware to serving Fortune 500 enterprise customers. I was responsible for product development, marketing, and customer support.

My interest in natural language processing led to me creating a product called Idyl E3 Entity Extraction Engine. Idyl E3 performs named-entity extraction in secure environments and it is a product of Mountain Fog, Inc. I am an Apache OpenNLP committer and PMC member.