Haystack 2021 Schedule

Talks from the Search & Relevance Community at the Haystack Conference!

Wednesday, September 29th, 2021

All times listed are EDT.

Time Track 1
8:00-9:00am EDT

Registration

Location: Conference Room

9:15-10:00am EDT

Opening Keynote

Peter Morville - Location: Conference Room and Online

Peter Morville is a pioneer of the fields of information architecture and user experience. His bestselling books include Information Architecture (the "polar bear book"), Intertwingled, Search Patterns, and Ambient Findability. He has been helping people to plan websites and software since 1994, and advises such clients as AT&T, Cisco, Harvard, IBM, the Library of Congress, Macy’s, the National Cancer Institute, and Vodafone. He has delivered conference keynotes and workshops in North America, South America, Europe, Asia, and Australia. His work has been covered by Business Week, NPR, The Economist, The Washington Post, and The Wall Street Journal. His latest book is Planning for Everything.

10:00-10:45am EDT

Script Scores and back again - A tale of merchandising algorithms in Elasticsearch

This talk is about the flexibility and performance of painless scripting in Elasticsearch. That flexibility brought big wins to the business at SimpleTire and it continues to do so by being easily adaptable as new merchandising insights are discovered. This story has 4 chapters: 1) Inherited a multi-objective ranking model that was built via an ETL job and stored in SQL for index time inclusion. 2) Tasked with migrating the algorithm into Elastic for real time scoring. 3) Solved using Painless scripting for maximum flex-ability and extend-ability for the future. 4) Long term wins: a-b testing easiness, weighting adjustments, bottom line value, rescoring for personalization This talk is generally valuable to any e-commerce team because it walks through the process of converting business logic into search logic, all with an eye on improving the bottom line.

Nate Day - Location: Conference Room and Online

Nathan has always been interested in how technology can improve the human experience. He was drawn to study biotechnology in college, where he worked to research protein localization in plants and to build open-source software to model bacteriophage genomes. After graduating he joined a start-up drug development company focused on building better translational disease models. There he worked on the technical challenges of building multi-dimensional tissue systems and helped successfully develop first in class tumor and liver disease systems. From there he joined the computational biology group were he focused on building reproducible analysis pipelines and front-ends for exploring high-dimensional data. He is excited to bring his R&D experience and analytics expertise to the challenging problems of search relevance.

11:00-11:45am EDT

The web search bootstrapping problem

In this talk, we will review how the recent breakthroughs in AI are exploited to create search engines totally based on AI-generated data - thus eliminating the need to collect users’ data and solving the cold-start problem. In particular, we will focus on: 1. How can we generate search queries that are almost identical to a real user’s ones? 2. How can we exploit the generated queries to predict user intent?

Roi Krakovski - Location: Conference Room and Online

Roi has a Ph.D. in computer science from Ben-Gurion University. He worked as a machine learning and big data researcher at IBM Research. Roi founded Usearch in 2018, to bring the recent breakthroughs in Neuroscience into the world of internet search.

11:45am-12:30pm EDT

Semantic Product Search – Vector Search for E-Commerce

Information retrieval today is undergoing a paradigm shift, away from the prevailing techniques of the past few decades. Increasingly the focus is moving away from keyword and entity driven search and the inverted index that supports those approaches, in favor of more complex models supported by dense instead of sparse index structures. Neural IR models for retrieval and ranking are becoming increasingly popular but building and scaling these systems presents many challenges. In this talk we present an overview of the current state of Neural IR from the perspective of a large e-commerce company. Among the topics covered will be extracting signals from the clickstream, transformer models and ‘do you need one?’, augmenting the inverted index by predicting keywords, and hard negative mining. We will also cover an exciting new research area, framing semantic search as an Extreme Multi-Label Classification problem, and why the future of semantic search may lie in machine-learned indexes.

Simon Hughes - Location: Conference Room and Online

Simon has a PhD in Computer Science from DePaul with a concentration on NLP and machine learning, and has over 8 years experience working as a data scientist and 15 years experience working within software development. He worked on multiple search and recommender engines, including building job and resume search engines for Dice.com as well as working on Home Depot’s e-commerce search platform in his current role. He is currently the Principal Data Scientist on the core search team, leading initiatives to improve overall relevancy and conversion on their search platform. He is also co-author of 3 SIGIR papers published during his time at Home Depot, and 11 papers on applying AI for educational purposes, and has given many industry talks over the years on semantic search and relevancy tuning.

12:30pm-1:30pm EDT

Find lunch at one of the many options available on Charlottesville's Downtown Mall

1:30-2:15pm EDT

The Text REtrieval Conference (TREC)

The TREC project at the National Institute of Standards and Technology has created standard test sets and evaluation methodology to support the development of methods for content-based access to material structured for human consumption since 1992. Starting with (massive-for-the-time) two gigabytes of newswire text and progressing to web-scale data collections, TREC has examined a variety of tasks including question answering, retrieving digital video, web search, legal discovery, secondary use of electronic health records, and sentiment analysis in blogs and tweets. TREC's "coopetition" paradigm emphasizes individual experiments evaluated on a benchmark task. This has had three major impacts: improved effectiveness of information access algorithms; cross-fertilization of ideas across research groups with the eventual transfer of technology into products; and the formation of new research areas enabled by the construction of critical infrastructure.

Ellen Voorhees - Location: Conference Room and Online

Ellen Voorhees is a Fellow at the US National Institute of Standards and Technology (NIST). Her primary responsibility at NIST is to manage the Text REtrieval Conference (TREC, http://trec.nist.gov/) project, a long-running NIST program to build the infrastructure required to do large-scale evaluation of search systems. Her research focuses on developing and validating appropriate evaluation schemes to measure system effectiveness for diverse user tasks. Voorhees is a fellow of the ACM, a member of the ACM SIGIR Academy, and has been elected as a fellow of the Washington Academy of Sciences.

2:15-3:00pm EDT

Applying User Signals like a Relevance Engineering Ninja

User signals (clicks, purchases, etc.) are among the most useful inputs for improving search relevance. They can be used to directly optimize your head queries (signals boosting), to personalize search results, to learn domain-specific terminology (misspellings, synonyms, etc.), or to build click models as training data for automated Learning to Rank. Most organizations struggle to properly store their signals, let alone best utilize them to optimize relevance. In this talk, you’ll learn best practices for collecting, processing, and applying signals to enhance relevance. We’ll cover live code examples of index- and query-time signals boosting, fighting signal spam and bias, and applying quality- and time-based weights to your models. We'll show the various kinds of personalization and click models you can train from signals to improve ranking. You'll come away from this talk with some new tools in your relevance engineering toolbox, and some open-source code examples to get started!

Trey Grainger - Location: Conference Room and Online

Trey Grainger is CTO at Presearch, the decentralized web search engine, and is the Founder of Searchkernel, a startup helping clients build next-generation, intelligent search applications. He is the author of the books AI-Powered Search and Solr in Action, and is the former Chief Algorithms Officer and SVP of Engineering at Lucidworks, a leading AI-powered search company. He studied information retrieval and web search at Stanford University, received his Masters in Management of Technology from Georgia Tech, and received his Bachelors degree from Furman University in Computer Science, Business, and Philosophy.

3:15-4:00pm EDT

Learning to Boost - Logistic Regression to Optimize Elasticsearch Boosts

Choosing field boost values can make or break your Elasticsearch query. One popular data-driven approach to identify the relative importance of fields is Learning to Rank. However, LTR typically requires fitting a complex Machine Learning model and incorporating a separate plugin or service to implement it in production. Beyond manual tuning or grid search, is there a middle ground that’s data-driven but easier to implement? In this talk, we introduce an approach where we create a regression model to directly determine optimal Elasticsearch boost values. We will cover parsing search explanations for historical queries to create the features, assigning pairwise labels based on a judgment list, and evaluating the boosts the model produces. While not a replacement for Learning to Rank, this automatic approach led to a 1.2% increase in MAP@5 from the guess-and-checked version that took 6 months to develop and enables quick iteration for future query changes.

Nina Xu & Jenna Bellassai - Location: Conference Room and Online

Nina Xu is a Data Scientist at Guru, where she is helping Guru fulfill the vision to bring teams the knowledge they need to do their best work when they need it. Currently she is focused on using Machine Learning to improve search relevance in Guru. Nina has also worked on improving Guru’s AI Suggest Expert feature, which guides teams to choose the right subject-matter experts to be responsible for the correct pieces of knowledge. Prior to transitioning into a career in data science, Nina was an Assistant Professor at Bucknell University, where she taught college statistics courses. Nina holds a PhD in Biostatistics from New York University, where she used statistical methods to study the long term health effects of the 9/11 World Trade Center disaster.

Jenna Bellassai is a Data Scientist on the search team at Guru. Prior to working at Guru, she worked as a Data Scientist at Picwell. She previously contributed to the Data-Driven Interactive Narrative Engine at the USC Institute for Creative Technologies and volunteered as the media coordinator for the Philadelphia chapter of Women in Data. At Guru, she is focused on improving search relevance so that teams can more easily find trusted information to do their best work.

4:00-4:45pm EDT

OLX's Journey to a Relevant Search

In this presentation, we’ll talk about OLX’s journey to a relevant search that fits well with our classifieds business model. We went from a simple search engine with really poor results to a new search model that can return highly relevant results and also solves a lot of the problems that come together with the application of traditional methods. We’ll explain some of the problems associated with our use case and show what we did to solve each of those problems. From simple to more complex solutions like the application of default BM25, bayesian optimization and finally a new method we like to call “Term Podium”! We’ll also talk about how we measured our success with metrics such as NDCG, diversity and novelty together with business metrics.

Leonardo Wajnsztok - Location: Conference Room and Online

Leonardo is a Senior Software Engineer in the Search & Recommendation team at OLX Brasil, an online marketplace platform. Before joining OLX, he has worked with data science, IoT and robotics. Currently, he is completing his Master's degree in NLP.

4:45-5:30pm EDT

Closing Keynote

Marcus Eagan - Location: Conference Room and Online

Marcus Eagan is a Senior Product Manager of Atlas Search at MongoDB. Before that, Marcus was responsible for Developer Tools at Lucidworks. He was a Global Tech Lead at Ford Motor Company, and led an IoT Security startup through its acquisition by a router manufacturer. He works hard to help underrepresented groups break into tech and has contributed to open source projects since 2011. He studied at the University of Michigan School of Information and Grinnell College.

5:30-7:00pm EDT

Haystack Reception (included with registration)

-Any attendee who is in town is welcome

Location: TBD