Designing Incentives for Crowdsourcing Workers

by /

In a recent paper, presented at the ACM Conference on Computer Supported Cooperative Work (CSCW), John Horton, Daniel Chen and I used a large-scale experiment to test the effect of different incentive schemes on the quality of crowdsourcing work.

The results surprised us. They suggest that workers perform most accurately when the task design credibly links payoffs to a worker's ability to think about the answers that their peers are likely to provide.

Horserace!

The idea for this study came out of our sense that, as social scientists, we had something unique to offer the existing research on human computation. Early and influential crowdsourcing research has focused on how to filter the judgments of the crowd to find the best answers. We wanted to know whether simple task-design changes could improve the quality of data coming into a crowdsourcing system in the first place.

To test this idea, we chose 14 different incentive schemes and framing techniques developed and validated across the social sciences and set up a horse race experiment to see which schemes/techniques would work best.

Consistent with our personal biases (John and Daniel are both economists, and I'm a sociologist), some of the schemes were financially oriented, some were social or psychological, and some were hybrids combining social and financial incentives. The details of all the schemes are included in the paper (it's a long list, and some of them are kind of involved), but it's worth giving some examples.

On the financial end of the incentives spectrum, we had one condition we called "reward-accuracy," which was pretty much what you'd expect: we told workers, "we'll pay you a bonus if you get the answers right." We also had one called "punishment-accuracy," the gist of which you can deduce. On the purely social-psychological side, we had one we called "trust," in which we told workers, "we'll pay you for this job no matter how bad your performance, we trust that you'll still make your best effort."

One of the weirdest schemes turns out to be important, so I need to explain that one. Called "Bayesian Truth Serum" (BTS), it incorporates a design from the work of Drazen Prelec, a behavioral economist at MIT, who realized that research subjects could probably provide useful information regarding the expected distribution for subjective, qualitative questions (nb, the mechanics of how he does this are arcane in a way that is almost sure to delight the geeks among you, so I encourage you to read his paper). Few of the details of real BTS are important, except that we incorporated the piece about asking workers to answer the questions themselves and predict the distribution of other workers' responses. We also told them we'd give them a bonus if their predictions were correct.

We then created a task that asked workers to answer five questions. In this case, the questions were drawn from another study examining participatory features of websites, for which we already possessed validated data collected by research assistants.

All workers answered the same five questions about the same website (www.kiva.org) while being exposed to one and only one of the 14 incentive schemes (or a control condition of no scheme). Roughly 2,000 individuals participated in the study, resulting in over 100 subjects in each of the experimental conditions. (The statistics and science nerds out there will be pleased to know that both the drop-out rate and demographic covariates were distributed evenly across conditions.)

To measure worker performance, we used the research assistant responses as correct answers to the questions and then calculated the total number of matching answers (out of five) provided by each worker. The results (aggregated across all treatments) are plotted in a histogram below and show that the average worker answered just over two questions out of five correctly.

Aggregate performance histogram

 

Then, in order to see how the treatments compared against each other relative to the control group, we calculated the mean correct response rate for each condition and conducted difference of means tests to see which of these means were significantly greater than the control group. The results of this comparison appear below (in a new plot that doesn't even appear in the paper!):

ITT estimates per treatment

The orange dots show the value of the mean in each condition, and the blue bars illustrate the 95% confidence interval around that mean. The treatments are sorted by the size of the difference in means from the control. (More hard-core nerd stuff: the means are adjusted using Intent-To-Treat estimators).

From these results, we concluded that our horse race had two clear front-runners: the "Bayesian Truth Serum" (BTS) and "Punishment - disagreement" conditions, each of which improved average worker performance by almost half of a correct answer above the 2.08 correct answers in the control group. A few of the other financial and hybrid incentives had fairly large point estimates, but were not significantly different from control once we adjusted the test statistics and corresponding p-values to account for the fact that we were making so many comparisons at once (apologies if this doesn't make sense — it's yet another precautionary measure to avoid upsetting the stats nerds among you). In a tough turn for the sociologists and psychologists, none of the purely social/psychological treatments had any signficant effects at all.

Why do BTS and punishing workers for disagreement succeed in improving performance significantly where so many of the other incentive schemes failed? The answer hinges on the fact that both conditions tied workers' payoffs to their ability to think about their peers' likely responses. (We elaborate on the argument in more detail in the paper.)

Does this mean that we should give up on simple financial or social-psychological incentives? Probably not. The fact that we conducted the experiment on MTurk means that the deck may have been stacked against incentives like the "trust" condition I described earlier. Because requesters on MTurk have little oversight, workers are more likely to respond to financial incentives than stated promises. In this sense, the marketplace has structured the interaction between workers and requesters in a way that may limit the opportunities to harness motivations that are not linked to money in some explicit way.

You can download the full paper to read more.