Haystack 2018

The Search Relevance Conference! Sponsored by OpenSource Connections

Slides for Haystack US 2018 Talks

Agenda - Day One
Agenda - Day Two

Track One	Track Two
9:30-10:15
The keystone Keynote Doug Turnbull
10:15-10:30
Coffee
10:30-11:15
Facets and Similarity - Exploring the Meta-Informational Hyperspace Ted Sullivan This talk will feature some of my recent research into the alternative uses for Solr facets and facet metadata. I will develop the idea that facets can be used to discover similarities between items and attributes in a search index, and show some interesting applications of this idea. A common takeaway is that using facets and facet metadata in non-conventional ways enables the semantic context of a query to be automatically tuned. More »	Search Quality: A Business-Friendly Perspective Peter Fries Have you ever been asked to fix a bad search experience? Have you ever been asked to predict the outcome of a change to your search algorithm? Fixing search problems can feel like a game of whack-a-mole -- fixing one set of queries breaks others. It's a frustrating game of guesswork and trade-offs. But you're responsible for the overall performance of a search system, and the powers that be want assurances. More »
11:20-12:05
Use customer behavior data and Machine Learning to improve search relevancy Chao Han Nowadays websites can easily track and store user events such as queries, result clicks and purchases, then how to use this collective behavior to guide us for better search. In this talk, we will walk through several applications of those signal data analysis, from common use cases such as clickstream signal boosting and recommenders, to intelligent NLP tasks such as spell checker, synonym detection, finding phrases and query rewriting. More »	Algorithmic Extraction of Keywords, Concepts, and Vocabularies Max Irwin Keyword and vocabulary extraction and generation from corpora is an important area of information retrieval. Applicable to autosuggestions, synonym and term similarity, and natural language understanding, the available techniques are numerous and continuing to advance. Drawing from research from both acedemia and the field, this talk will give a technical deep dive of existing techniques and libraries... More »
12:05-1:00
Lunch
12:05-1:00
From clicks to models, the Wikimedia LTR pipeline Erik Bernhardson Will share a high level overview of the pipeline that transforms user click behavior into labeled data and then LTR models for Wikimedia sites. Primarily this is join web+search logs -> normalize queries -> group ""same"" normalized queries -> sample to some subset of grouped queries -> learn relevance -> collect feature data from plugin -> split/fold into sets for CV -> hyperparameter tuning -> deploy. More »	Embracing Diversity: Searching over Multiple Languages Suneel Marthi & Jeff Zemerick Although a lot of online content is written in English there’re tons of non English users out there that still need to retrieve information. When searching, especially for tech related topics, it’s common to compose queries in English; however for such users search results written in their own native language may be preferred. We’ll see how statistical machine translation tools can help in the... More »
1:50-2:35
Phrase Query Completion with Apache Solr and SuggestComponent Tomasz Sobczak Query completion is one of the fundamental features of search. It is almost always the first interaction between the user and the search application. When the user types in a few characters, auto-suggester should immediately offer relevant content. It improves search precision, acts as a discovery tool and can increase conversion rates in e-commerce world. Probably most will agree that the biggest challenge is to prepare the completer's... More »	Expert Customers: A Hybrid Approach to Clickstream Analytics Elizabeth Haubert Click stream traffic gives your users a voice in evaluating your search relevancy, but it isn't perfect. Position bias, cold start, and other problems will bias evaluations based on user rankings. This talk will cover basic problems with using click traffic to evaluate your search results, and how to balance those problems with expert user feedback. More »
2:35-2:50
Coffee
3:00-4:30
Lightning Talks
6:30-?
Dinner at Kardinal Hall

Track One	Track Two
9:30-10:15
No, You Don't Want to Do It Like That! Stories from the search trenches Charlie Hull How about I run either a panel session or a fishbowl session (where panel members can be replaced with audience members). The general theme would be to tell stories of crazy search situations we've all found ourselves in, naming no names, and use these to illustrate why relevance tuning is hard, highly specific to the industry and situation you find yourself in, and often totally misunderstood by management, marketing etc. More »	Learning to Rank in an Hourly Job Marketplace Xun Wang As the largest online marketplace for hourly jobs in the US, Snagajob strives to connect millions of job seekers with part/full time, hourly and on-demand employment opportunities on a daily basis. Satisfactory fulfilment of this mission requires a two-way 'match' engine that can efficiently identify both the most suitable jobs for job seekers and the most qualified candidates for employers. In 2017, we reached an important... More »
10:15-10:30
Coffee
10:30-11:15
'A picture is worth a thousand words' - Approaches to search relevance scoring based on product data, including image recognition René Kriegler Many online shops around the world use open source technologies like Solr and Elasticsearch for their onsite search. However, the underlying relevance scoring algorithms were largely developed with enterprise search or general site search in mind. Relatively little is known about the specifics of search relevance scoring in e-commerce search, even from the perspective of information retrieval theory. More »	LexisNexis Learning to Rank Case Study Doug Rosenoff Using a large sample of user searches and the documents that were engaged from those searches, LexisNexis set out to predict and re-rank the top answers so that engaged documents would appear closer to the top of the Search Engine Result Page (SERP). Lexis applied a three phased approach to collect and compile LETOR data to drive a Java LambdaMART implementation that re-ranked the top answers in... More »
11:20-12:05
Real-Time Entity Resolution with Elasticsearch Dave Moore 10,000 people share my full name. How can I search information about me and not the other 9,999? Entity resolution gives you the power to disambiguate this search. It's a way to find different records about the same thing, while excluding similar records about different things. And it has an amazing ability to track the changes of an identity, like when you change your name or address. More »	A Vespa Tour Matt Overstreet An introduction to Vespa the "open big data serving engine" developed by oath.com, formerly Yahoo!. More »
12:05-1:00
Lunch
1:00-1:45
The Solr Synonyms Maze: Pros, Cons, and Pitfalls of Various Synonyms Usage Patterns Bertrand Rigaldies The topic of Synonyms in Solr has been historically a charged topic, with on-going and lively discussions within the search community about various aspects of synonyms usage such as index- vs query-time trade-offs, tokens graphs challenges (a.k.a., "sausagization") with multi-term synonyms, and various parsed query outcomes to name just a few examples. In this talk, we will review and summarize the current state of the "synonyms art" in Solr... More »	Understanding Queries with NER Ryan Pedela Named entity recognition (NER) can identify search terms as fields or enum values in an index such as brand and location. The identified fields and enum values can be used as additional relevance signals or filters. This talk will demonstrate how to improve relevance for structured or semi-structured data search using NER. It will include practical lessons learned when applied to real-world datasets such as match vs term queries, data formatting, and the minimum training set size. More »
1:50-2:35
Evolving a Medical Image Similarity Search Sujit Pal The talk covers the evolution of an Image Similarity Search Proof of Concept built to identify similar medical images. It discusses various image vectorizing techniques that were considered in order to convert images into searchable entities, an evaluation strategy to rank these techniques, as well as various indexing strategies to allow searching for similar images at scale. More »	The gentle art of incorporating "business rules" into results Scott Stults You've spent months tuning outstanding results by any objective measure, and now you've got to mess that all up so that some new product type shows up in every search. In this presentation we'll discuss ways to handle several situations like that. Almost as important, we'll factor the impact into our analytics so that we can optimize both IR metrics and whatever it was Marketing was trying to accomplish. More »
2:35-2:50
Coffee
2:50-3:35
The Relevance of Solr's Semantic Knowledge Graph Trey Grainger The Semantic Knowledge Graph is an Apache Solr plugin that can be used to discover and rank the relationships between any arbitrary queries or terms within the search index. It is a relevancy swiss army knife, able to discover related terms and concepts, disambiguate different meanings of terms given their context, cleanup noise in datasets, discover previously unknown relationships between entities across documents and fields, rank lists of keywords... More »	Interleaving: from evaluation to self learning John T. Kane Search is on the move. After decades of relying on counting terms, we observe a paradigm shift. As defining relevancy has become increasingly difficult, with the presence of a multitude of diverse and sometimes intertwined signals, new challenges for search engine practitioners call for new ways of thinking about search. Machine learning is such a new way of thinking, and it has become increasingly popular for solving search problems. More »
3:40-4:25
Catch My Drift? - Building bridges with Word Embeddings Peter Dixon-Moses Word Embeddings have been providing the semantic glue in major search engines for the past several years (word2vec, GloVe, fastText, and Gensim have received a lot of press). This talk will provide concrete resources and examples to help you bootstrap distributional semantics into your search solutions and start improving recall for your long-tail queries. More »	Bad Text, Bad Search: Evaluating Text Extraction with Apache Tika's tika-eval Module Tim Allison Reliable text extraction is essential for search. Without reliable text, users cannot find the content they need, and advanced relevance tuning is worthless. Apache Tika™ is a crucial component in a wide variety of search engines and content management systems, including Solr™, Nutch™, Alfresco, Elasticsearch and Sleuth Kit®/Autopsy®. More »

Haystack 2018

Slides for Haystack US 2018 Talks

The keystone Keynote

Facets and Similarity - Exploring the Meta-Informational Hyperspace

Search Quality: A Business-Friendly Perspective

Use customer behavior data and Machine Learning to improve search relevancy

Algorithmic Extraction of Keywords, Concepts, and Vocabularies

From clicks to models, the Wikimedia LTR pipeline

Embracing Diversity: Searching over Multiple Languages

Phrase Query Completion with Apache Solr and SuggestComponent

Expert Customers: A Hybrid Approach to Clickstream Analytics

No, You Don't Want to Do It Like That! Stories from the search trenches

Learning to Rank in an Hourly Job Marketplace

'A picture is worth a thousand words' - Approaches to search relevance scoring based on product data, including image recognition

LexisNexis Learning to Rank Case Study

Real-Time Entity Resolution with Elasticsearch

A Vespa Tour

The Solr Synonyms Maze: Pros, Cons, and Pitfalls of Various Synonyms Usage Patterns

Understanding Queries with NER

Evolving a Medical Image Similarity Search

The gentle art of incorporating "business rules" into results

The Relevance of Solr's Semantic Knowledge Graph

Interleaving: from evaluation to self learning

Catch My Drift? - Building bridges with Word Embeddings

Bad Text, Bad Search: Evaluating Text Extraction with Apache Tika's tika-eval Module

Have Questions?