Question Answering as Search - the Anserini Pipeline and Other Stories

Sujit Pal • Location: Theater 4 • Back to Haystack 2020

In the last couple of years, we have seen enormous breakthroughs in automated Open Domain Restricted Context Question Answering, also known as Reading Comprehension, where the task is to find an answer to a question from a single document or paragraph. A potentially more useful task is to find an answer for a question from a corpus representing an entire body of knowledge, also known as Open Domain Open Context Question Answering. To do this, we adapted the BERTSerini architecture (Yang, et al., 2019), using it to answer questions about clinical content from our corpus of 5000+ medical textbooks. The BERTSerini pipeline consists of two components – a BERT model fine-tuned for Question Answering, and an Anserini (Yang, Fang, and Lin, 2017) IR pipeline for Passage Retrieval. Anserini, in turn, consists of pluggable components for different kinds of query expansion and result reranking. Given a question, Anserini retrieves candidate passages, which the BERT model uses to retrieve the answer from. The best answer is determined using a combination of passage retrieval and answer scores. Evaluating this system using a locally developed dataset of medical passages, questions, and answers, we adapted the BERT Question Answering component to our content using a combination of fine-tuning with third party SQuAD data, and pre-training the model using our medical content. However, when we replaced the canned passages with passages retrieved using the Anserini pipeline, performance dropped significantly, indicating that the relevance of the retrieved passages was a limiting factor. The presentation will describe the actions taken to improve the relevance of passages returned by the Anserini pipeline.

Sujit Pal

Elsevier Labs

Sujit Pal works at Elsevier Labs, an advanced technology group within Elsevier. His introduction to search was as part of a team at CNET Networks implementing Atomics, a MySQL backed custom query engine to replace their legacy Altavista search engine. He was also one of the very early users of Solr within the company before it became an Apache project. He later joined Healthline, an ontology backed medical search company, where he was introduced to concept search. At Healthline, he made multiple improvements to their search platform, including moving it from Lucene to Solr, and introducing various features that combined emergent features of Solr with Natural Language Processing and Machine Learning technologies. At Elsevier, he has worked on large scale search quality measurement, Image Search, and most recently Question Answering.