By Lukas Biewald, September 17, 2014

Why I Started CrowdFlower in Two Graphs

Today we announced our Series C funding. It’s been a long journey but it feels incredibly satisfying to have built something that companies use every day and that has paid many, many millions of dollars to people around the world for real, meaningful work. We’ve helped companies from eBay to Microsoft to Unilever to Edelman to Instagram run more efficiently with the more than seven man-years of work completed every day on our platform.

We plan to use the investment to improve our product and build awareness. But rather than write the standard funding post, I want to talk about why I founded the company because the best decisions we’ve made have come from going back to our roots and reminding ourselves why we’re doing this in the first place.

I started CrowdFlower because I love data and I want to see every company use data effectively. From machine learning to A/B testing, I think it’s awesome to see companies viewing good data analysis as a crucial part of their business.

So, why would loving data inspire me to build a massive global work platform? Let me explain with a few graphs.

More Data is Better


Here is a graph from one Machine Learning paper  “Active Semi-Supervised Learning for Improving Word Alignment” – there are thousands of graphs just like it in other papers and reports.

The authors want the reader to see that the learning technique shown by the green line works better than than the other techniques.  

But the graph really has a much more important conclusion: more data is better. This paper cuts the error rate in half simply by collecting more data for the model. The impact of increased data far outweighs the effect these authors are writing about!

We spend years and years building fancier techniques that rarely ever approach the effectiveness of just collecting more data. Which brings me to my second graph…

Clean Data is Even More Important


Here I took a simple, small data set and I added a little bit of noise. Changing the data from 100% accurate to 95% accurate nearly doubled the model’s error rate, and making the data only 90% accurate nearly doubled the model’s error rate again. From my experience running CrowdFlower, 90% accuracy would be surprisingly good even for a dataset that a customer considers clean.

A bad model on good data is much better than a good model on bad data. Clean data is even more important than more data.  And yet we never clean up our data.

Data Enrichment

Together, I think of collecting and cleaning data as  “data enrichment.” It’s the easiest way to make models and analysis more accurate and no one does enough of it. 

So why don’t data scientists and analysts enrich their data? If you asked me when I was a data scientist I would have said that it wasn’t “scalable.” Collecting and cleaning data is hard, expensive and time consuming.

That’s why I started CrowdFlower.

Data nerds want control over the data collection process but they don’t want to manage an army of people. They definitely don’t want to manage an outsourcing vendor. Our goal is to give our data-focused customers control of exactly how their data is enriched.

We make it easy for data scientists to enrich their own data. This makes their models and analysis much more powerful. I believe that CrowdFlower is a crucial piece to making artificial intelligence a real part of every product, and big data analysis a common practice at every company.