Science is Broken
The distressing recent article in The Economist, “Trouble at the lab,” reminds us just how many scientific experiments are either wrong or not replicable. Experiments are expensive. As a result, scientists collect just enough data to show statistical significance and only publish when they find a positive change. This creates a selection bias where a huge fraction of the published papers have incorrect results.
The Economist suggests that this is due to the pressure to publish in an academia. I believe the fundamental problem is that good data is expensive. Scientists would love to replicate their results and have higher confidence values but they don’t have the money or the time.
Machine Learning has a Problem
According to a researcher cited in The Economist article, the problem is the worst when it comes to machine learning. More than three-quarters of published results may be invalid due to overfitting, which occurs when a researcher tries thousands or millions of approaches on the same data set and picks the best one. This is a problem because, with so many approaches to choose from, some will work better than others because of random chance and not because they actually better explain the data.
There are techniques that deal with the problem of overfitting in different domains but the only really effective, systematic remedy is an abundant supply of labeled data. Researchers want to avoid overfitting, but they can’t afford to collect the necessary volume of data.
Crowdsourcing is the Answer to Yielding Better Data for Researchers
Crowdsourcing makes it an order of magnitude cheaper and faster to collect labeled data or to survey a wide swath of people. We see more and more researchers turning to our crowdsourcing platform to collect data. Twenty-five percent of our customer base consists of university students and faculty.
Tailoring Crowdsourcing for Research
While I believe that crowdsourcing has the ability to fix science, we still have some work to do. For example, given that crowdsourcing is done online while so much research data is collected via in-person experiments, we need to adapt decades of human-subject protocols to work for crowdsourcing methodologies. There are also data verification and selection bias challenges that come with crowdsourcing, but these are things CrowdFlower routinely deals with.
Tackling these challenges is both possible and well worth it. If we can make it easy to conduct research with crowdsourcing, every researcher will want to check and replicate his or her results. One of our senior technical analysts, Emma Ferneyhough, was previously a post-doctoral research scientist with the Affective Cognitive Neuroscience Lab at UC Berkeley’s Helen Wills Neuroscience Institute. After joining CrowdFlower, she was excited to see if she could replicate her lab findings—and she did. Check out Emma’s blog post about it.
Crowdsourcing Can Go Beyond Fixing Research
Crowdsourcing can enable research that was infeasible or impractical in the past. Galaxy Zoo has used amateur astronomers to create a database of galaxies orders of magnitude greater than anything that existed before, directly leading to hundreds of papers. CrowdFlower has helped scientist count TB cells, label mouse cortices, and track epidemics on Twitter at scales that have never been done before. When scientists can create their own data sets, they can look for patterns that can only be found at scale and discover new results faster than was ever possible.
If there is “a train wreck looming” as Nobel Prize winner Daniel Kahneman says, crowdsourcing is not only the best way to avoid it, but it provides a clear path to better data for researchers.