Bing is an Improvement over Live, but Still Not Google Quality: Evaluating Bing With Mechanical Turk

by /

Microsoft's new search engine, Bing, has recently gotten a lot of attention. Several people have already built tools to compare Google and Bing.

Since all the engines are fairly similar, it's hard to separate true quality from our preconceptions. For example, one of Google's internal tests is reported to have shown that "users still prefer the results with the Google logo, even if they're not Google results."

Are the new Bing search results really better than the old Live search results? Are they better than Google?

We took 100 random real-world queries and showed their results from each engine to workers on Mechanical Turk. For a single query, we showed the results from two engines side-by-side and asked workers to judge which result set was better. For each query, here's the aggregate judgment from several workers:

Bing versus Google

Bing (Microsoft today) versus Live (Microsoft as of March)

Summary

We found that Google is statistically significantly preferred to Bing (p

On the other hand, we found that users preferred Bing's new results to the older Live search results 55% of the time. But this result wasn't statistically significant -- they're virtually tied in aggregate.

In conclusion, Bing's quality seems to be improving, but hasn't yet caught Google. Of course, relevance is just one component of a search engine user experience, and it's clear that all the major engines are quite close, and there exist a large set of queries where Bing significantly outperforms Google.

Details

First, we randomly sampled a query set from the leaked AOL queries, which is probably still the best, most representative available data of web search queries. We ran 100 of them on the old Microsoft Live search back in March for a previous project, and last week we scraped Google and Bing.

Note that our scrapes ignore the Google "One Box" results that you see above the search results for many queries that include news, pictures, etc. We threw out the results for several common navigational queries where Bing returns only one result (myspace, aol, etc.) -- these probably don't affect the results much since all the engines do very well.

There are many ways to evaluate relevance. For this experiment, we chose to show people the results of two engines at a time, side-by-side and unbranded. You can see exactly what the turkers saw here. We randomized the left and right engines.

The possible answers were "Engine A much better", "Engine A slightly better", etc. We averaged the results over 6-8 workers for every pair of engines, mapping responses to {-2, -1, +1, +2}.

A histogram of the raw judgments on Google versus Bing:

Thanks to Brendan for help on this post.

Updates:
Will points out that we should mention Brendan and I both worked at Powerset, which was acquired by Microsoft after we left.

Hang has a blog response and takes issue with our graphs, my response is in his comments: http://blog.figuringshitout.com/another-way-to-lie-with-statistics