Research & Insights

By Emma Ferneyhough, December 10, 2013

Crowdsourcing at Scale | Shared-Task Challenge Winners


The first Crowdsourcing at Scale workshop was held on November 9 at the conference on Human Computation and Crowdsourcing (HCOMP 2013) in Palm Springs, Calif. A big part of the workshop was a Shared Task Challenge where we invited workshop attendees to come up with their own way to accurately aggregate all the judgments, or crowd labels, for large crowdsourced datasets collected by Google and CrowdFlower. We had two winning teams!

From UC Irvine/MIT CSAIL (“Team UCI”) are Qiang Liu, Jian Peng and Alexander Ihler, who are interested in human computation/crowdsourcing because of its intrinsic connection to machine learning and the potential for these two fields to benefit from each other.

From Microsoft Research/Bing and University of Southampton (“Team MSR”) are Matteo Venanzi, John Guiver, Gabirella Kazai, and Pushmeet Kohli, who have researched how crowdsourcing models can be extended to information-rich settings and how these models can be scaled to large, real-world datasets.

The teams’ respective solutions, although statistically identical when it came to average recall, took very different approaches.

Team UCI’s Approach

Team UCI noticed that both datasets had highly imbalanced labels and included contributors’ uncertainty indicators (e.g., “skip” or “I can’t tell”) that potentially reflect the ambiguity of the questions. They designed novel and simple two-stage methods that took these factors into account. They first gave an initial estimation of the answers by ignoring the uncertainty indicators, and then identified the set of ambiguous questions by a procedure based on both contributors’ uncertainty indicators and their algorithms’ confidence levels. Finally, the issue of imbalanced labels was addressed using a tunable parameter that flexibly traded off the recalls of the different classes of labels. This method resulted in an average accuracy of 86 percent and recall of 64 percent on the Google dataset, and 96 percent accuracy and 77 percent recall on the CrowdFlower dataset.

Team UCI Slides

Source: Team UCI slides

Team MSR’s Approach

Team MSR focused on the problem of how to leverage the content of text snippets (e.g. tweets in CrowdFlower’s dataset) and how to combine them with human judgments. Their solution was a probabilistic model implemented using Infer.NET. It took into account the quality of individual workers with a Bayesian Classifier Combination model, as well as the relationships between words contained in the tweets and the judgment categories with a conditional bag-of-words model. This algorithm resulted in an average accuracy of 87 percent and recall of 61 percent on the Google dataset, and 96 percent accuracy and 78 percent recall on the CrowdFlower dataset.

Team MSR Slide

Team MSR Slides

Comparing the Two Approaches

Team UCI’s approach showed the value of empirically tuning how judgments were modeled and scored, e.g., they learned to model ‘skip/I can’t tell’ as a lack of confidence in other options. Team MSR’s use of signals from text was noteworthy and of high practical impact for improving the quality of human computation for natural language related tasks. Both approaches made improvements over a simply baseline majority strategy, and speak to the long-term potential of mixing machine learning and crowdsourcing to come up with the best answer. That’s huge!

Links to Slides and Talks