We use Turkers to classify all sorts of data, by having several workers render judgments on each item. But what should we do when they disagree? Like any other human behavior, Turker judgments are noisy: sometimes there are mistakes, and sometimes the task is genuinely difficult or subjective, and there is no “right” answer. Once we have a bunch of Turker judgments, we need to aggregate them — that is, use some sort of voting mechanism — to give as accurate a classification as possible. It turns out that one simple trick, threshold calibration, can substantially improve accuracy, and can be tuned to the specifics of the problem.
Here’s an example. A recent client of ours had a de-duping task: given a pair of similar articles, the task was to decide if they were “about the same topic” or “about different topics”. This is just a binary classification problem; call these labels “YES” and “NO”. To figure out how well Turkers could perform the task, we had our client provide us with a gold standard data set. That is, for 135 examples, their experts did the task themselves and provided “gold” ground truth labels.
We used a very high number of workers per example (about 20). For all 135 examples in the gold standard, the following graph plots them vertically by their “Turker confidence in YES” — that’s just the percentage of votes for “YES” among the 20 or so judgments for that particular example. I’ve also colored each example with the experts’ gold label. You can see that this simple Turker data provides some statistical separation between the classes.
This graph also shows how to create a classifier from Turker votes. We have to choose a confidence threshold for our classifier’s decision: above the threshold, say “YES”, and below say “NO”. Unfortunately, Turkers aren’t perfect at modeling the experts: anywhere we place the threshold, errors occur. However, some thresholds are better than others. The threshold with the best accuracy is at 73% confidence — that is, a 73% super-majority voting rule — and it classifies instances correctly 90% of the time. Furthermore, we can tune for different types of errors. If we are particularly concerned with avoiding false positive errors, we can set a higher, more conservative threshold; or, if we want to find as many “YES” instances as possible, we can set a lower, more liberal threshold.
Here’s another chart that more carefully details the tradeoffs between true and false positives vs. true and false negatives. For a particular decision threshold, it shows how it divides up the instances into the confusion matrix’s 4 categories of correct and incorrect decisions.
A final note on why threshold calibration is important: For this task, the Turkers were considerably more liberal than the experts at deciding what a “YES” example was — experts marked only 36% of examples as “YES”, whereas a simple Turker majority voting rule marks 57% that way. This is because the experts understood the full implications of the decision, which were substantial — various entries in their database and website would be merged, and users would be confused if they were exposed to a bad merge. False positives had a very high cost. The prompt for Turkers, by contrast, was fairly vague. (In our experience, we generally find that good task design is a huge factor in getting better Turker accuracy.) However, since Turker decisions noisily correlate with the experts, moving the decision threshold can help accuracy. Here’s the threshold vs. accuracy graph:
Statistical analysis of Turker data can substantially improve accuracy performance, even with something as simple as choosing the best decision threshold. This blog post only scratched the surface; there are a few more useful things to consider. Stay tuned for Part 2 and hopefully many more!
A few more notes on Turker voting and threshold calibration:
An interesting question is the upper bound of possible performance on the task. A good experiment to try is to have two experts independently perform a task, and check their agreement rate. We should be satisfied if Turkers can match experts as reliably as experts match each other. That is, for this task, if experts agree no more than 90% of the time, then Turkers perform the task as well as experts. (We didn’t have this particular experiment done in this case, but I’d be very curious to see the results!) In general, agreement rates can help indicate the difficulty of a task. If expert agreement rates are low, it can be argued that the task is not very “real”.
The terminology of “true/false positives”, “true/false negatives”, “precision” and “recall” are all part of a statistics/machine learning mini-field of binary classifier evaluation. Any statistical classifier that outputs a confidence value or ranking among instances (Naive Bayes, logistic regression, IR ranking, etc.) can be subject to this sort of threshold analysis. A decent place to read more is the ROC Wikipedia page. ROC and Precision-Recall curves have long been used to show thresholding tradeoffs. I think the above plots make it easier to interpret the basic information, but the more traditional graphs are also useful. Here they are for this data (provided courtesy of the excellent ROCR package):