AMT is fast, cheap, and good for machine learning data

by /

Update 9/19: Final PDF version has been uploaded. See also the comments below for updates -- our released data is already being used by others!


We recently teamed up with Rion Snow, Prof. Dan Jurafsky, and Prof. Andrew Ng from the Stanford AI Lab to try using Amazon Mechanical Turk to generate data sets for Machine Learning research. Many AI tasks require a large amount of training data, and to build natural language systems, researchers traditionally pay linguistic experts for millions of annotations. Search engine companies employ hundreds or thousands of annotators for their classification, ranking, and other statistically trained systems, but their data is private and is not available for research. AMT is a potential tool to create high quality data sets accessible to everyone.

We rigorously tested the quality of AMT responses for several classic human language problems, and found that the quality was the same or better than the expert data that most researchers use. We wrote a paper, "Cheap and Fast -- But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks," that will be presented in an upcoming conference, EMNLP-2008.

Our findings:

1. Turker-generated data is good. AMT makes it easy to ask many people for judgments, so for several tasks, we looked at accuracy rates for how well the averaged Turker judgments correlate to the expert gold standard. With more judgments per example, accuracy increases. For comparison, on each graph the horizontal dotted line indicates the rate at which a single expert agrees with their gold standard. Enough non-experts can match or often beat experts' reliability.

k-acc3.png

2. Turker-generated data is cheap and fast. We can collect thousands of labels per dollar and per hour.

costs.png

3. Expert data enhances individual Turker data. First off, individual workers have differing accuracy rates:

worker-acc.png

So we implemented a statistical technique where we test their accuracy on a portion of the experts' gold standard data, then reweight votes by worker reliability. This yields higher aggregated accuracy. (Also see our related threshold calibration post.)

goldcalib.png

4. Turker data enhances NLP systems. For one of the tasks, predicting the emotions elicited by a newspaper headline, we wrote a simple machine-learned classifier and trained it on the Turker data. It easily outperforms one trained on expert data. (There's a subtle effect here; see the paper for details.)

classifier-perf.png

We'll update this blog post with a link to the final version of the paper in the coming weeks. Many thanks to our friend Rion, who spearheaded this collaboration. The current version of the paper is here:

[ This article is part of a series, Wisdom of Small Crowds, on crowdsourcing methodology. ]