Interview with Mike Mathieson, former head of search relevance at Yahoo!, SearchMe, and eBay on the history of Machine Learned Ranking

So excited to feature Mike Mathieson as our next interview! Mike is a world leading expert on machine-learning for search. Currently he is Director of Pricing at Amazon where in his words he is: “working on setting prices for everything Amazon sells—quite a bit different from web search relevance, but I’m still working on big, hairy problems that require huge scale and sophisticated algorithms.” 

1.You worked for a number of years on search, what were some of your roles and can you share some highlights from that time?

I studied machine learning at Duquesne University and UC Santa Cruz in the late 90s, before it was really well known as a discipline, and definitely before it was a hot buzzword.  In 2002, David Cossack, Doug Young, and Satish Katiyar at Altavista (then a part of Overture) had built a machine learned ranking system that dramatically improved the relevance of its search results as measured by DCG5 (discounted cumulative gain of the first 5 results) of scores assigned by human judges.  I was hired in early 2003 as an applied scientist to help translate the science behind the system into concepts that could be better understood by the engineering teams and executives at Altavista.  I focused on characterizing the features being used by our machine learned models and running experiments to try using new types of features to improve search.  I built two of the strongest page content features for that ranking algorithm, a classifier that identified and helped screen adult content and a spam score that captured the degree to which spammers had attempted to stuff keywords or use link farms to boost rankings.  Both of these projects were done by combining my ML background with deep expertise in these domains from Clem Wang, another Altavista employee.

Later in 2003, Overture and its Altavista team were acquired by Yahoo!, who had earlier acquired Inktomi.  The two search teams of these acquisitions were combined into a team called YST (Yahoo Search Technology) where the engineering and systems were mostly from Inktomi, while several scientists working on relevance came from Altavista, however there was a lot of brainpower brought together from both sides.  My ML Spam work attracted some notice in the new organization as something that was unique, and Jan Pedersen, the Chief Scientist at both Altavista and later in YST, encouraged me to focus on other applications of machine learning in the search domain.  I found a fruitful home working on spellcheck.  This was in the days before auto-complete (which was pioneered a couple years later), and the common approach was to use dictionary-based nearest-neighbor approaches to give spelling suggestions, often with comically bad results.  We focused on using user click feedback to reinforce spelling suggestions, and to rank a set of potential spelling suggestions based on the likelihood a user would click on them. This approach was transformative and the accuracy of our spelling suggestions increased dramatically, to the point that we had the confidence to start automatically correcting queries when we were confident enough, instead of requiring customers to validate our suggestions.

I then spent several years leading and managing the ranking team in Yahoo!, before trying my hand for a year at search relevance in a startup environment at SearchMe, and then leading the Search Science team at eBay during a time of great innovation and change from 2009-2011, before moving away from search and into other domains including Trust Science at eBay, Traffic (Paid Search and E-Mail optimization) at eBay, and now Pricing at Amazon.

2. At Yahoo you were an instrumental part of the team in 2005/2006 that replaced the Inktomi Ranking Function (IRF) with Machine-Learning (MLR). That was a huge effort requiring many teams and lots of hard work.  With that in mind –

  • What was your role in that effort?

Nawaaz Ahmed was the Inktomi engineer who had developed the latest IRF heuristic, and he combined forces with the Altavista ML Ranking (MLR) team to determine how to build an ML optimization on top of the powerful features at the core of IRF.  Nawaaz, David Cossack, Doug Young, and Satish Katiyar developed the first version of MLR at Yahoo! over a 9 month period while I was principally focused on spelling correction.  Towards the end of that effort, there were many open questions about how MLR would continue to improve.  At the time it was very tightly coupled to IRF features and it was unclear how new features and information could be brought in to improve the ML model over time.  With a greatly-improved spellcheck launched, I moved back to core ranking to help build a roadmap for iterative feature development and experimentation.  Around this time, I also stumbled into becoming a manager, mostly because everyone else on the team hated the idea of management more than me.  I enjoyed it more than I expected and I managed the search ranking team in YST for the remainder of my tenure at Yahoo!, from 2004 through 2008, during which time I grew the team from the 4 original members to more than 40 scientists and software engineers to maintain a rapid pace of innovation and steady improvement in our valiant fight to be a better search engine than Google.

• Can you talk about some of the complexities of making the switch from the Inktomi Ranking Function to the new Machine-Learned Ranking?

The incredible scale and limited inputs of search relevance are what make it hard.  The task users demand is to guess what they want based on the 2.5 average keywords they give you, some of which will be misspelled or poorly thought out.  Then, based on that intention, return the best possible webpage from the entire web, each containing on average thousands of words, and do it basically instantly.  The complexity of solving that very hard problem at Yahoo! was mostly from the fact that the Inktomi Ranking Function was a super-optimized heuristic that ran very quickly and ML Ranking was about 100x more computationally intensive.  IRF also was not engineered to easily incorporate new features—it was great at efficiently and quickly using a small set of inputs.  Since MLR was being built on top of IRF, we inherited all of its limitations and needed to build new systems and layers to acquire new data sources and derive features the ML Ranking system could use.  Some of those features involved better understanding of user intention and query semantics, some were based on document analysis and quality, and some were based on new link and WebMap analysis, but our biggest single improvement in performance came from figuring out how to use feedback loops from user click behavior to reinforce good results and penalize bad results.  This ClickText project was a huge undertaking because of the sheer volume of user activity and the amount of data crunching required.  Thankfully other technologies like MapReduce became available that made it possible to aggregate and clean this data effectively.

• Why did Yahoo decide to make that change? And what were the main differences between IRF and MLR?

Humans are bad at making objective, complex decisions.  We have biases, we see patterns where none exist and completely miss other patterns that do exist.  When trying to combine multiple pieces of information to make a decision, people can only really conceptualize 3 to 5 pieces of information well.  Its amazing to me that the Inktomi Ranking function worked as well as it did—it used about 20 features and had an incredibly complex set of operations to combine those features.  Nawaaz and the other engineers who developed it were geniuses.  However, when applied statistics and machine learning are brought to bear, they can use hundreds or thousands of inputs and use them in ways that are backed entirely by very subtle observations that a human couldn’t hope to detect.  IRF was the best heuristic I’ve ever seen for solving a complex problem like ranking, but it was single-threaded.  One person could (with immense talent and effort) understand it well enough to tweak it and make small improvements.  But there was no way to engage a team to divide and conquer the problem, there was no way to combine the work of those team members, and there was a huge appetite to exploring more and more ways to improve ranking.  MLR democratized innovation and allowed scientists to focus on individual features or subsets of the problem, and the ML ranking methodology would choose from the most effective outcomes of their work, combine them, and monotonically and steadily improve.

• What gains did Yahoo see after the change to MLR?

The initial launch of MLR was just barely better than IRF on DCG5, which is another testament to the work that went into IRF and also should be unsurprising, since that version of MLR only used features and information that were available to IRF.  As the system improved thanks to the introduction of new information, Yahoo! was able to turn off its Harmony feature, a set of human-built overrides for queries where algorithmic search previously couldn’t be trusted to provide the best user experience.  Our search relevance improvements kept pace with Google’s during that time period, at a time when we had fewer people dedicated to the problem and were viewed as an underdog, and this was a great source of pride for the team.  It’s been more than a decade since I worked on the problem, so I no longer remember the specific DCG5 improvements or click-through rate improvements we led, but they were sufficient to keep Yahoo! relevant and in the search game at a time we were expected to falter and fail.  I’m incredibly proud of the work we did and the spirit of innovation that drove the team.

3. How did you get started in search and what advice do you have for other people who want to get started in the search industry?

I stumbled into it by being a generalist.  Growing up, I always wanted to be a novelist and I picked Duquesne University for college because it had a great liberal arts program, not because of its computer science program.  In fact, its computer science department was buried in its math department, meaning it was full of teachers who were theorists and statisticians, which is how I got involved in machine learning in its infancy, an interest I continued to explore in graduate school at UC Santa Cruz.  I ended up with a unique skill set—a storyteller because of my passion for writing, a scientist because of my love of machine learning, and a software engineer because I loved tinkering and building.  That combination served me well when I had to explain a new technology, machine learning and specifically machine-learned ranking.

Search as an interesting domain in which to work.  No problem as hard as search is ever fully solved, and there are so many interesting dimensions that go into creating a great search experience.  Because of those conditions, you must be adaptable and willing to solve problems to 80% then jump to new ones readily as their importance became clear.  In the first year I was at Yahoo!, I jumped from ranking to spam to spelling and back to ranking, learning and growing at each step.  When I led the ranking team, I kept the team similarly nimble, jumping from project to project.  You can go deep on any project and probably have a fine career, but that was never my style, and I think Search was uniquely suited to a generalist to explore in.

4. Lastly, you and I worked at SearchMe, a search engine startup where we built an entire search engine from the ground up. It worked great, but the business side of launching a new search engine was challenging. What do you think it will take for a new web search engine to be successful in the world of Google and Bing?

SearchMe was a case study in how quickly a dedicated team can develop a world-class technology (in this case, a full web search engine with a unique user interface) and yet fail to find a market for that technology.  The business didn’t have enough reserves for the huge costs of running a search engine, and we had this misfortune of needing a round of funding while VCs were still shell-shocked from the 2008 downturn.  Today, the barrier for entry in search is both smaller and larger than ever before.  Smaller because you no longer need to run your search engine out of a physical data center and instead can achieve incredible economies by using cloud-based infrastructure (SearchMe missed this boat by about a year and paid dearly for it in data center hardware costs). Larger because Google and Bing have been at the problem for decades, have an army of experts creatively generating new optimizations, and have the audience to generate huge volumes of feedback in the form of the clicks we all make on their search result pages.

It would be daunting to create a differentiated search experience that could attract an audience that builds a big enough flywheel to become a major player in web search.  DuckDuckGo’s privacy-based approach has become more popular as privacy concerns have increased, yet its market share is a fraction of the major players.  Perhaps the rise of voice-based search will shake up the industry, but we all said that about smartphones and it didn’t happen then.  I’m an optimist and I believe in the potential for disruptive innovation, but I don’t really know what form that disruption might come in with respect to web search.

Leave a Reply

Your email address will not be published. Required fields are marked *