Haystack 2018

The Search Relevance Conference! Sponsored by OpenSource Connections

Slides for Haystack US 2018 Talks


Track One Track Two
9:30-10:15
The keystone Keynote
Doug Turnbull
10:15-10:30
Coffee
10:30-11:15
Facets and Similarity - Exploring the Meta-Informational Hyperspace
Ted Sullivan

This talk will feature some of my recent research into the alternative uses for Solr facets and facet metadata. I will develop the idea that facets can be used to discover similarities between items and attributes in a search index, and show some interesting applications of this idea. A common takeaway is that using facets and facet metadata in non-conventional ways enables the semantic context of a query to be automatically tuned. More »

Search Quality: A Business-Friendly Perspective
Peter Fries

Have you ever been asked to fix a bad search experience? Have you ever been asked to predict the outcome of a change to your search algorithm? Fixing search problems can feel like a game of whack-a-mole -- fixing one set of queries breaks others. It's a frustrating game of guesswork and trade-offs. But you're responsible for the overall performance of a search system, and the powers that be want assurances. More »

11:20-12:05
Use customer behavior data and Machine Learning to improve search relevancy
Chao Han

Nowadays websites can easily track and store user events such as queries, result clicks and purchases, then how to use this collective behavior to guide us for better search. In this talk, we will walk through several applications of those signal data analysis, from common use cases such as clickstream signal boosting and recommenders, to intelligent NLP tasks such as spell checker, synonym detection, finding phrases and query rewriting. More »

Algorithmic Extraction of Keywords, Concepts, and Vocabularies
Max Irwin

Keyword and vocabulary extraction and generation from corpora is an important area of information retrieval. Applicable to autosuggestions, synonym and term similarity, and natural language understanding, the available techniques are numerous and continuing to advance. Drawing from research from both acedemia and the field, this talk will give a technical deep dive of existing techniques and libraries... More »

12:05-1:00
Lunch
12:05-1:00
From clicks to models, the Wikimedia LTR pipeline
Erik Bernhardson

Will share a high level overview of the pipeline that transforms user click behavior into labeled data and then LTR models for Wikimedia sites. Primarily this is join web+search logs -> normalize queries -> group ""same"" normalized queries -> sample to some subset of grouped queries -> learn relevance -> collect feature data from plugin -> split/fold into sets for CV -> hyperparameter tuning -> deploy. More »

Embracing Diversity: Searching over Multiple Languages
Suneel Marthi & Jeff Zemerick

Although a lot of online content is written in English there’re tons of non English users out there that still need to retrieve information. When searching, especially for tech related topics, it’s common to compose queries in English; however for such users search results written in their own native language may be preferred. We’ll see how statistical machine translation tools can help in the... More »

1:50-2:35
Phrase Query Completion with Apache Solr and SuggestComponent
Tomasz Sobczak

Query completion is one of the fundamental features of search. It is almost always the first interaction between the user and the search application. When the user types in a few characters, auto-suggester should immediately offer relevant content. It improves search precision, acts as a discovery tool and can increase conversion rates in e-commerce world. Probably most will agree that the biggest challenge is to prepare the completer's... More »

Expert Customers: A Hybrid Approach to Clickstream Analytics
Elizabeth Haubert

Click stream traffic gives your users a voice in evaluating your search relevancy, but it isn't perfect. Position bias, cold start, and other problems will bias evaluations based on user rankings. This talk will cover basic problems with using click traffic to evaluate your search results, and how to balance those problems with expert user feedback. More »

2:35-2:50
Coffee
3:00-4:30
Lightning Talks
6:30-?
Dinner at Kardinal Hall
Track One Track Two
9:30-10:15
No, You Don't Want to Do It Like That! Stories from the search trenches
Charlie Hull

How about I run either a panel session or a fishbowl session (where panel members can be replaced with audience members). The general theme would be to tell stories of crazy search situations we've all found ourselves in, naming no names, and use these to illustrate why relevance tuning is hard, highly specific to the industry and situation you find yourself in, and often totally misunderstood by management, marketing etc. More »

Learning to Rank in an Hourly Job Marketplace
Xun Wang

As the largest online marketplace for hourly jobs in the US, Snagajob strives to connect millions of job seekers with part/full time, hourly and on-demand employment opportunities on a daily basis. Satisfactory fulfilment of this mission requires a two-way 'match' engine that can efficiently identify both the most suitable jobs for job seekers and the most qualified candidates for employers. In 2017, we reached an important... More »

10:15-10:30
Coffee
10:30-11:15
'A picture is worth a thousand words' - Approaches to search relevance scoring based on product data, including image recognition
René Kriegler

Many online shops around the world use open source technologies like Solr and Elasticsearch for their onsite search. However, the underlying relevance scoring algorithms were largely developed with enterprise search or general site search in mind. Relatively little is known about the specifics of search relevance scoring in e-commerce search, even from the perspective of information retrieval theory. More »

LexisNexis Learning to Rank Case Study
Doug Rosenoff

Using a large sample of user searches and the documents that were engaged from those searches, LexisNexis set out to predict and re-rank the top answers so that engaged documents would appear closer to the top of the Search Engine Result Page (SERP). Lexis applied a three phased approach to collect and compile LETOR data to drive a Java LambdaMART implementation that re-ranked the top answers in... More »

11:20-12:05
Real-Time Entity Resolution with Elasticsearch
Dave Moore

10,000 people share my full name. How can I search information about me and not the other 9,999? Entity resolution gives you the power to disambiguate this search. It's a way to find different records about the same thing, while excluding similar records about different things. And it has an amazing ability to track the changes of an identity, like when you change your name or address. More »

A Vespa Tour
Matt Overstreet

An introduction to Vespa the "open big data serving engine" developed by oath.com, formerly Yahoo!. More »

12:05-1:00
Lunch
1:00-1:45
The Solr Synonyms Maze: Pros, Cons, and Pitfalls of Various Synonyms Usage Patterns
Bertrand Rigaldies

The topic of Synonyms in Solr has been historically a charged topic, with on-going and lively discussions within the search community about various aspects of synonyms usage such as index- vs query-time trade-offs, tokens graphs challenges (a.k.a., "sausagization") with multi-term synonyms, and various parsed query outcomes to name just a few examples. In this talk, we will review and summarize the current state of the "synonyms art" in Solr... More »

Understanding Queries with NER
Ryan Pedela

Named entity recognition (NER) can identify search terms as fields or enum values in an index such as brand and location. The identified fields and enum values can be used as additional relevance signals or filters. This talk will demonstrate how to improve relevance for structured or semi-structured data search using NER. It will include practical lessons learned when applied to real-world datasets such as match vs term queries, data formatting, and the minimum training set size. More »

1:50-2:35
Evolving a Medical Image Similarity Search
Sujit Pal

The talk covers the evolution of an Image Similarity Search Proof of Concept built to identify similar medical images. It discusses various image vectorizing techniques that were considered in order to convert images into searchable entities, an evaluation strategy to rank these techniques, as well as various indexing strategies to allow searching for similar images at scale. More »

The gentle art of incorporating "business rules" into results
Scott Stults

You've spent months tuning outstanding results by any objective measure, and now you've got to mess that all up so that some new product type shows up in every search. In this presentation we'll discuss ways to handle several situations like that. Almost as important, we'll factor the impact into our analytics so that we can optimize both IR metrics and whatever it was Marketing was trying to accomplish. More »

2:35-2:50
Coffee
2:50-3:35
The Relevance of Solr's Semantic Knowledge Graph
Trey Grainger

The Semantic Knowledge Graph is an Apache Solr plugin that can be used to discover and rank the relationships between any arbitrary queries or terms within the search index. It is a relevancy swiss army knife, able to discover related terms and concepts, disambiguate different meanings of terms given their context, cleanup noise in datasets, discover previously unknown relationships between entities across documents and fields, rank lists of keywords... More »

Interleaving: from evaluation to self learning
John T. Kane

Search is on the move. After decades of relying on counting terms, we observe a paradigm shift. As defining relevancy has become increasingly difficult, with the presence of a multitude of diverse and sometimes intertwined signals, new challenges for search engine practitioners call for new ways of thinking about search. Machine learning is such a new way of thinking, and it has become increasingly popular for solving search problems. More »

3:40-4:25
Catch My Drift? - Building bridges with Word Embeddings
Peter Dixon-Moses

Word Embeddings have been providing the semantic glue in major search engines for the past several years (word2vec, GloVe, fastText, and Gensim have received a lot of press). This talk will provide concrete resources and examples to help you bootstrap distributional semantics into your search solutions and start improving recall for your long-tail queries. More »

Bad Text, Bad Search: Evaluating Text Extraction with Apache Tika's tika-eval Module
Tim Allison

Reliable text extraction is essential for search. Without reliable text, users cannot find the content they need, and advanced relevance tuning is worthless. Apache Tika™ is a crucial component in a wide variety of search engines and content management systems, including Solr™, Nutch™, Alfresco, Elasticsearch and Sleuth Kit®/Autopsy®. More »