Our recent post about confidence bias, where we showed that most contributors vastly overestimate their own ability to complete tasks correctly, raised a lot of questions about how we manage quality at CrowdFlower. You might remember these themes from such classic posts as: AMT is Fast, Cheap and Good or the Wisdom of Small Crowds series   .
The standard CrowdFlower model is agnostic towards the quality of any individual contributor. Typically, we let anyone attempt a task, using our technology to filter out low-quality contributors and score the responses. Without further ado, what follows is quick review of the steps we take to do that filtering.
Gold (What is it Good For?)
In almost every job, we take a subset of the data to be processed and manually score the correct response. This manually-scored set, which we refer to as Gold Standard Data, is at the core of managing quality in the context of enterprise crowdsourcing:
- Filtering: We use Gold to create an up-front test, creating a barrier to entry such that only workers who understand and successfully complete a task are allowed to participate. This allows us to prevent unsavory characters from entering jobs and contaminating results.
- On-going Training: We also use Gold to conduct on-going training, offering corrections for units that are answered incorrectly. This allows us to continually instruct and improve highly prolific contributors.
- Dynamic Trust Score: We use each contributor’s performance on Gold as a basis to determine their overall accuracy within a task. Each contributor must exceed our minimum trust thresholds to continue working on a task. If at any point a contributor falls below the trust threshold, we’ll exclude his work.
Because creating Gold is labor-intensive, we’ve created an automated process to generate Gold using units that have already been completed. This has significantly reduced the time needed for setup and ongoing job creation, without sacrificing our ability to differentiate contributors.
Of course, the amount and distribution of Gold is critical. Often, a uniform distribution of Gold across response types is ideal, though in certain situations we’ll use a skewed Gold set. For example, in an experiment on crowdsourced document review, we used a skewed Gold set to avoid missing relevant documents (reduced “false negatives,” if you prefer).
Department of Judgment Redundancy Department
If the purpose of Gold is to manage the quality of individual contributors, we use multiple judgments per unit to improve the accuracy of completed units. The basic premise is simple enough. We look for agreement among trusted workers to indicate correct responses at the unit level. For example, if we ask four people to verify a phone number for a business, the answer is more likely to be correct if all four agree. In fact, every unit processed by CrowdFlower is annotated with a response as well as a Confidence Score (based on agreement weighted by Trust, plus some secret sauce).
More generally, assume we set the trust threshold for a given job at 70 percent (meaning that anyone who doesn’t answer at least 70 percent of Gold correctly gets booted) and that contributors are uniformly distributed in terms of ability (not true, but convenient). We can easily model the effect of additional judgments on estimated accuracy, showing that the probability that the majority response is correct increases with the number of judgments collected:
Of course, while collecting 10 judgments per unit yields highly accurate results, it may not be the most efficient way to structure a job. Imagine that the first 2, 4 or even 6 contributors agree on how a unit should be classified. At some point, the marginal impact of an additional judgment is not worth the additional cost. We’ve automated a process to vary the number of judgment each unit receives based on agreement thresholds, so that we can reach accuracy targets more efficiently.
The following shows actual results from a sample job, where we set a minimum confidence threshold of 0.7:
Approximately 50 percent of units completed with just 2 judgments and 75 percent completed with 4 or fewer. In any case, each unit received only as many judgments as necessary to reach the confidence threshold. For any job, some subset of units will be ambiguous enough that they won’t reach a confidence threshold, so we use also set maximum judgments cap to “stop the bleeding.” Depending on the specific circumstances, we may reroute those ambiguous units to a parallel process with different structure, contributors, etc. for another round of judgments.
One More Trick
For complex tasks, we have developed a workflow management system to link together multiple jobs. For example, we might ask one pool of contributors to write a product description, verify the spelling and accuracy with a second pool and rank the subjective quality with a third pool. Alternatively, we might take a business listing and break out each attribute for independent collection and verification, with a separate job for name, address and phone number, or cuisine type, cash-only, types of credit cards accepted, on-site parking, or any other attribute that can be verified online. In general, peer review means that we can always give data a second pass to improve accuracy.