“I know it when I see it.” — Justice Potter Stewart
We have been running Crowdsifter, our content moderation product backed by Amazon’s Mechanical Turk for a while and we wanted to share some quality metrics and some stats on how our system aggregates redundant results to improve those metrics.
Controlling for Worker Quality, Bias, and Item Difficulty
In the graph above we picked the the best error rate for raw AMT with 1-11 workers and the best error rate that Crowdsifter provided on a porn judgment task with 2491 images (1006 porn, 1485 non-porn). The error rate is the rate at which wrong decisions are made. A wrong decision is whenever we label porn as not porn or non-porn as porn. The above experiment includes images which were labeled as ambiguous, which is the reason the error rates shown seem so high.
Using Crowdsifter with an average of 3.93 workers per image we achieve the same possible minimum error rate as majority voting in raw AMT with 9 workers per image. We do this by controlling worker quality by keeping track of their judgments. And if we have a “expert” evaluated gold standard of what is pornographic, then we can keep track of which workers are doing a good job or a bad job. On non-gold standard images we weight workers’ judgments based on how well we trust their judgment to reflect our standard of porn. Without these controls, majority voting in raw AMT is vulnerable to the many scammers that lurk there.
For images where obscenity is particularly ambiguous, we can allocate more workers. This results in a better sampling of whether an image is obscene. Some images don’t need many judges to accurately determine if they are pornographic. We can determine which images are easily classifiable as porn by sampling a group of workers and checking whether they all agree. Using too many judges per image can become prohibitively costly. It is important to have this scheme so we can dynamically allocate workers. Raw AMT is both wasteful and inefficient, applying many judgments to easy items, while not using enough judgments for hard items.
The raw error rate includes both images incorrectly labeled as porn, and incorrectly labeled as non-porn. In content moderation we want to minimize our porn miss rate (also known as false negative rate) because we don’t want to let any porn onto our site. The graph for the porn miss rate corresponding with the above graph is shown below.
The most important part is the porn miss rate, and our rate is close to the rates of 9 to 11 workers per image on AMT, even though we are using less than half that number of workers, meaning we significantly cut our costs.
We can adjust our certain thresholds to lower the porn miss rate, but we do this at the risk of labeling all our images as porn, so nothing would make it onto our site. Adjusting the threshold to meet the needs of minimizing the porn miss rate, while maintaining an acceptable non-porn miss rate, is a task Crowdsifter can readily handle.
We’ll save what we can do with threshold adjustment for a later blog post.
Thanks to Brendan for help in this post.