Research & Insights

By Lukas Biewald, August 2, 2013

Crowdsourcing and Machine Learning

Ever since I started CrowdFlower, people have asked me if I’m worried that advances in artificial intelligence will threaten our business model. My answer is and always has been that advances artificial intelligence actually drive increasing demand for crowdsourcing. Here’s why.

Many of the things CrowdFlower is great at are things that we would expect machine learning to be able to do. For example, there are hundreds of automated sentiment analysis and photo moderation companies – two of our most common use cases. Automation only works to a point. When it does not, marketers get back inaccurate or inconclusive data. Worse yet, application users are exposed to potentially offensive content.

When it works, machine learning is incredibly scalable, and models get better and better all the time. So it’s easy to conclude that they will one day replace crowdsourcing. When it doesn’t work, bad results impact users of those applications in bad ways. But as it turns out, the results of crowdsourced jobs can be used to train machine learning models. Humans, in effect, teach the machine models and make them better.

So, instead of replacing crowdsourcing, machine learning makes crowdsourcing more valuable. All machine learning models need training data to work well, and crowdsourcing is the perfect way to get that data. A sentiment algorithm needs to be trained on positive and negative examples to learn what words and sentences suggest positive or negative sentiment about a topic.

And the models work best if they are trained on data that looks like the data they are trying to predict. So, for example, if you are trying to model sentiment of tweets about your brand, the best data to use for your model is tweets about your brand. If you’re modeling sentiment in Japanese press releases, you would need a completely different set of training data. For the best accuracy, subtle differences can matter a lot – an algorithm trained on the sentiment of tweets in general might not work as well as an algorithm trained on tweets for a specific brand or tweets trying to understand the reaction to a press release in Japan.

Crowdsourcing makes it really easy to collect high-quality training data for exactly the model you want to build. Once you get your data, you can use a site like kaggle.com that will connect you with an expert who can build a great model for your specific needs. As your accuracy needs grow, you can easily collect more data and improve your model. Many of the best models in use today are iterated continually to continue improvement.

Anyone can now use a platform like CrowdFlower to collect data, and a platform like Kaggle to build a model, meaning that a non-technical person can now build a machine learning application for a few thousand dollars that is likely to be better than even the most sophicticated off-the-shelf algorithm.

If you really want to get fancy, modern machine learning algorithms can often estimate their own confidence in the accuracy of their predictions. When the model’s self-reported accuracy is below a certain threshold, you can automatically have the crowd collect the data. The data that the crowd collects can be used for your application and also fed back into the algorithm to improve it. This process is called Active Learning and is known to work really well. (There is a good primer here from our friend Panos Ipierotis, and some deeper stuff from Robert Munro here.)

I’m excited to see an explosion of extremely domain-specific machine learning applications.