Search engines control the information we see and use. Their key component is a ranking algorithm that tries to determine the most relevant web pages for your query. How good are these algorithms? Publicly, there’s a lot of hype, while privately, all the big engines run proprietary quality evaluation efforts. But there’s virtually no real data out there for the rest of us.
Using Mechanical Turk, we can evaluate search engine relevance. We tried an experiment where we took five hundred queries and ran them against the top 4 English language web search engines: Ask, Google, Live, and Yahoo. The queries were a random sample from a real-world set of search queries. We had annotators rate the relevance of the top five results for each engine.
Ask clearly performed the worst. The other three engines were in a statistical tie. Their ordering was Google, Yahoo, then Live, but the differences were miniscule: the top 3 engines all answer about 80% of queries effectively.
What do these results mean?
People often talk about Google as being the most relevant search engine, with the best algorithms and the like. This study finds little evidence to support that. Sure, our methods are preliminary and could be improved in any number of ways; we can probably shrink those error bars and find more statistical differences. However, it is the case that for 500 typical queries, a rough but pretty objective measurement of search quality found that Google, Live, and Yahoo all performed about the same.
Note that these results don’t speak to the entire user experience. To be able to compare between engines, we extracted only the core web results with their titles, urls, and snippets. But a search engine also includes much more: the presentation, branding, video and image results, ads, etc. We only tested the relevance of core web search.
Many more details below.
How we’re measuring search relevance
Evaluating search engine quality is a tricky task. Here’s our first pass on the methodology.
We take a set of queries and run them against several search engines, scraping their web interfaces. We then show the query and results to human raters, asking them how relevant each result is. It’s a blind test: they don’t know which engines the results came from.
The raters all come from Amazon Mechanical Turk, a distributed workforce. We submit the above query/result judgment surveys to the AMT service, and pay its users – “Turkers” – to do the relevance ratings. (If you want to learn more, try looking at the Dolores Labs FAQ.)
What queries are being used? We took a random sample from the AOL query log data set; these are actual queries that real users typed in to a search engine. The AOL data set is remarkable for being pretty much the only publicly available, real-world data on web search behavior. Of course, it’s infamous for very valid privacy issues. We’re only using the part of it that doesn’t involve personal information – the raw queries, without user identification. (This NYT article on the issue is interesting.)
What’s being measured? For each engine, we count the number of queries that had at least one “Highly Relevant” result within the first five results the engine returned. This is a version of the “precision at 5” metric from information retrieval. There are, of course, many other methods to explore. We wanted a metric that was simple and easy to interpret.
How were raters’ judgments used? We had three raters per result, and basically took a simple majority vote. We didn’t attempt to model individual annotator biases.
Are these judgments trustworthy? The ratings are certainly noisy. And sure, the workers have little training and are (somewhat) anonymous to us. However, the relevance judgment task is fairly subjective and therefore inherently noisy. Further, it’s arguably better to use untrained annotators, since this more closely mimics normal search users. And finally, we’re finding some statistically significant, systematic differences between engines on a query set, with only extremely simple analysis – so something must be working right.
What’s the statistical methodology? As dead simple as possible: 95% confidence intervals on the graph, and engine comparisons via paired t-tests. These are all on that per-query precision-at-5 metric. We think that with larger scale experiments, more fine-grained breakdowns, survey design improvements, better analyses, etc., we can flesh out more differences.
What’s the “meta-engine upper bound”? That’s just how many queries had a “Highly Relevant” result on at least one engine. So hypothetically, if you were to combine all the engines and select the best results for the top, it would perform at this upper bound. This bound is overly high for a number of reasons (e.g. it’s artificially inflated by judgment noise and it assumes smart re-ranking); but it gives some idea how much the search engines could still improve.
Anyway, we’re thinking of doing more work along these lines if people are interested. There are certainly big improvements that could be made; we’d love any feedback you have.