Haystack US 2022
Talks from the Search & Relevance Community at the Haystack Conference!
The conference sessions were held at the Violet Crown movie theater in central Charlottesville.
This was our our Event Safety and Code of Conduct.
Day 1, Wednesday, April 27th, 2022
Time | Track 1 | Track 2 |
---|---|---|
8:00-9:00am EDT |
Registration
Location: Entrance of the Violet Crown |
|
9:00-9-45am EDT | Opening Keynote - Magicians & Jugglers: Closed & Open
Charlie Hull of OpenSource Connections opens the Haystack US 2022 search relevance conference with a talk on magicians, jugglers and how sharing in the open can empower search teams as they explore the new frontiers of search. Charlie Hull
|
|
10:00-10:45am EDT | Learning a Joint Embedding Representation for Image Search using Self Supervised Means
Image search interfaces either prompt the searcher to provide a search image (image-to-image search) or a text description of the image (text-to-image search). Image to Image search is generally implemented as a nearest neighbor search in a dense image embedding space, where the embedding is derived from Neural Networks pre-trained on a large image corpus such as ImageNet. Text to image search can be implemented via traditional (TF/IDF or BM25 based) text search against image captions or image tags. Sujit Pal
|
An approach to modelling implicit user feedback for optimising e-commerce search
More than other domains, e-commerce search depends on implicit user feedback to optimise search result ranking as buying decision criteria such as ‘an attractive price’ and ‘brand sympathy’ are very hard to make explicit. On the other hand, this decision making can be observed implicitly in web tracking. René Kriegler
|
11:00am-11:45pm EDT | Beyond precision and recall – ensuring 'aboutness' in topical classification using confidence scores
Taxonomies play an important role in many LexisNexis products, allowing our customers to run searches using predefined topics either as pre or post-filters. Because our topical classification is automated, there can be a wide array of relevance in results, from the very relevant to the more marginal. We need to ensure that documents containing a heavy breadth and depth of discussion of a particular topic surface at the top of the results list. This presentation will demonstrate how stamping Confidence Scores into documents, in addition to a topic code, is crucial to achieving this goal of ‘aboutness’. It will cover experiments that ‘boost’ or ‘re-rank’ using the confidence scores and the internal tools used to measure resulting improvements in relevance. The presentation will outline the methods, both machine learning and rule-based, used to develop a confidence score with a consistent meaning across content types and underlying classification technologies. Mark Shewhart
& Sophie Lagace
& Kimberly Hoffbauer
|
Scalable Semantic Search for Online Learning Applications
Semantic search is one of the Course Hero's key products where a student can type her course's question and get an answer from hundreds of millions answered questions. The sentence embedding model sits at the core of the semantic search by which we generate a vector index for questions. One challenge is to pick the best embedding model in terms of accuracy & running-time. We have designed an evaluation framework where we use Quora duplicate questions and Faiss similarity search. By using this, we proved that a few of the pre-trained Sentence-BERT models outperform the Universal Sentence Encoder. This led us to run an A/B experiment where we showed that Sentence-BERT could improve search coverage rate by 35%. Next, to improve the the semantic search performance further we have started fine tuning the Sentence-BERT models with our search engagement data. We will present some of our findings and the challenges that we have encountered while working on the semantic search problem. Kazem Jahanbakhsh
|
11:45pm-1:00pm EDT |
Lunch on your own
Find lunch at one of the many options available on Charlottesville’s Downtown Mall Location: Your choice! |
|
1:15-2:00pm EDT | Bayesian Optimization of Relevance at Shopify
Recently, Bayesian Optimization of a simple Elasticsearch query has been shown to deliver the best non-neural relevance on the MSMarco Ranking Task (https://www.elastic.co/blog/improving-search-relevance-with-data-driven-query-optimization). For these reasons, at Shopify, we’ve adopted bayesian optimization as a core part of our relevance experimentation workflow. Doug Turnbull
& Andy Toulis
|
Big Vector Search - The Billion-Scale Approximate Nearest Neighbor Challenge
Despite the broad range of algorithms for Approximate Nearest Neighbor vector search, most empirical evaluations of algorithms have focused on smaller datasets, typically of 1 million points. However, deploying recent advances in embedding based techniques for search, recommendation and ranking at scale require ANNS indices at billion, trillion or larger scale. Barring a few recent papers, there is limited consensus on which algorithms are effective at this scale vis-`a-vis their hardware cost. George Williams
|
02:15-3:00pm EDT |
Search Radar Brainstorm
We will brainstorm updates to our Search Radar created at last year’s Haystack. Location: Theater 5 |
|
3:15-4:45pm EDT |
Lightning Talks
Quick discussions about anything around search relevance! Location: Theater 5 |
|
5:30-6:30pm EDT |
Haystack Reception (included with registration)
All attendees are welcome. The location is Kardinal Hall. It is about a 10 minute walk from the conference venue. Location: Kardinal Hall |
|
6:30-8:00pm EDT |
Dinner (included with registration)
All attendees are welcome. The location is Kardinal Hall. It is about a 10 minute walk from the conference venue. Location: Kardinal Hall |
Day 2, Thursday, April 28th, 2022
Time | Track 1 | Track 2 |
---|---|---|
8:00-9:00am EDT |
Coffee
Location: Entrance of the Violet Crown |
|
9:00-9:45am EDT | Personalized Search - Building a prototype to infer the user's interest
In the world of Search, understanding the intend of the user is often seen as the holy grail. When a user performs multiple search and click actions while having a conversation with the search engine, then this behavior reveals a piece of her/his interest. A search engine that is aware of the user's interest is able to add a personal layer in its responses and this could add a new dimension of accuracy and value to a search implementation. Tom Burgmans
|
Searching through large graphs using Elasticsearch
The National Audiovisual Institute (INA) is a repository of all French audiovisual archives, being responsible for archiving over 180 radio and television services, 24/7, since 1995. The generated metadata describing this content represents the equivalent of over 50 million documents (images, audio and video fragments, text excerpts). Due to the heterogeneity of the content, the data model is directly inspired from the conceptual models of cultural heritage, represented by a large graph with complex relations between generic entities. The challenge for building a global search engine for this particular use case is twofold: indexing speed and the implementation of complex full text search capabilities with high performance. Our talk describes the key choices for the graph representation, facilitating the indexing process of the documents, as well as the technical framework set up around Elasticsearch, implementing dedicated search APIs required by different functional areas. Radu Pop
|
10:00-10:45am EDT | AI Driven Search
Nextdoor is the diverse multi-sided marketplace where on the one side we have neighbors having the most diverse sets of intents. Their interests are explicitly expressed in the search bar and form quite a unique demand vector. People are searching for local neighbors to connect with, trying to find classifieds or recommendations for their next pet projects just to name a few. Bojan Babic
|
Engagement DCG vs Subject Matter Expert DCG - Evaluating the Wisdom of the Crowd
Evaluating the relevance of a search engine result using Discounted Cumulative Gain (DCG) is a common way of quantifying query-document relevance precision. DCG may be computed using a customer engagement method or a subject matter expert (SME) evaluation method. It is a frequent but untested assumption that the results from these two methods of DCG computation are similar in size and correlate well. The difficulties involved with performing a comparison study have prevented rigorous testing of this assumption. Doug Rosenoff
|
11:00-11:45am EDT |
Search Radar Apply
We will apply our brainstorming from yesterday’s session to update the Search Radar. Join us for a hands-on event where we want your input! Location: Theater 5 |
|
11:45pm-1:15pm EDT |
Lunch on your own
Find lunch at one of the many options available on Charlottesville’s Downtown Mall Location: Your choice! |
|
1:15-2:15pm EDT | Women in Search Panel
Women in tech are noticeably underrepresented, and women in the search space are even more rare. Join us for a multi-perspective panel discussion featuring women working in the Search field. We will talk about career development, breaking into the Search, gendered experiences in the workplace, and more! The goal of this panel is to empower women, encourage their allies, and show that Search is a welcoming field that needs diverse perspectives to thrive. Led by Audrey Lorberfeld with panelists - Chen Karako, Jess Peck, Ellen Voorhees, Julie Tibshirani
|
|
2:30-3:15pm EDT | Building Retrieval Test Collections
Information retrieval test collections---benchmark search tasks consisting of a corpus, a query set, relevance judgments, and associated evaluation metrics---are foundational infrastructure for off-line evaluation of search systems. High-quality test collections accelerate development of effective search algorithms and facilitate technology transfer, but building large-scale, representative test collections is challenging. The Text REtrieval Conference (TREC, trec.nist.gov) has built test collections for a variety of search tasks in the past thirty years using different techniques as task and budget required. Recent examination of some TREC collections shows that they have withstood the test of time, but others have weaknesses that are hard to detect. This talk will recap lessons learned from building dozens of test collections that suggest best practices for building your own collection for your own problem. Ellen Voorhees
|
|
3:30-4:15pm EDT | AI based approaches to improve data quality before indexing
It is well known that good data makes great search experiences. Or the other, less positive, way around : garbage in, garbage out. AI powered searches usually focus on the search itself and improving relevance on top of an already existing index. In this talk we will focus on data ingestion: optimizations and improvements that can be made by AI and machine learning algorithms to improve data quality prior to indexing it. Some examples are: enriching data with automatic categorization, improving OCR translations, improving media files transcriptions, improving crawling and web pages parsing. All these in the context of data for search engines: a use case that induces or allows some specific optimization. Lucian Precup
|
OpenSearch - Ecommerce Search & Discovery Platform- Powered by querqy
Create a personalization platform for e-commerce Search & Discovery experiences that your customers and developers will love. Powered by Querqy; an umbrella for open source tools and libraries that helps you create a powerful e-commerce search platform quickly. The focus is on optimizing search relevance from day one, beyond the out-of-the-box capabilities of the OpenSearch engines. This also includes a powerful UI tool for managing onsite search keywords and queries. It provides a OpenSearch Dashboards interface for maintaining and deploying Querqy Rules. OpenSearch - Ecommerce platform helps you Increase conversions, enable typo tolerance, synonyms, add advanced, dynamic filters to the shopping experience. Unfortunately no video is available for this talk. Anirudha Jadhav
& Pratik Shenoy
& Dr. Johannes Peter
|
4:15-4:30pm EDT |
Closing
Location: Theater 5 |