It is often cited that 90% of the world’s information was created within the last two years. The increasing pace of creation of this “big data” has led to a huge amount of hype and organizations are scrambling to store and analyze their rapidly expanding treasure troves of data. Companies like Tableau, Cloudera, and Splunk have quickly grown to be worth billions of dollars because people are looking for tools that can help them make sense of this torrent of information.
Struggles With Messy, Unstructured Data
Unfortunately, most of that information is a mess and the current analytics tools don’t help much outside of specific use cases. Anant Jhingran, formerly of IBM Research, estimates that 85% of business-relevant information originates in unstructured form, primarily as text. This text comes from sources such as books, database systems, websites, and web server logs that store information in different and often unspecified formats. Natural language is especially difficult to deal with because the same concept can be expressed in so many ways and linguistic context is often central to understanding. Attempts to automatically derive meaning from natural language pushes up against difficult problems. For every piece of text, we must first struggle with the minutiae such as whether it is in the correct character encoding with a standard line ending and correctly rendered HTML entities. In some cases, several character encodings might be mixed in a single file leading to corruption of portions of the text or parsing errors. Then we need to determine which language it is in and extract the meaning, often by splitting it into words or word pairs based on various punctuation rules and then analyzing the usage patterns. This is a complicated process and there are many places along the way where things can go wrong.
The Limits of Statistical Power
Even if we handle all the technical and natural language problems correctly, we are likely to get into trouble when doing statistical inference with such vast amounts of data. The more features that we have for each datapoint, perhaps into the millions or billions, the more likely we are to see correlations arising simply by chance. Michael Jordan, one of the foremost machine learning experts in the world, put it this way:
So it’s like having billions of monkeys typing. One of them will write Shakespeare.
I like to use the analogy of building bridges. If I have no principles, and I build thousands of bridges without any actual science, lots of them will fall down, and great disasters will occur.
Similarly here, if people use data and inferences they can make with the data without any concern about error bars, about heterogeneity, about noisy data, about the sampling pattern, about all the kinds of things that you have to be serious about if you’re an engineer and a statistician—then you will make lots of predictions, and there’s a good chance that you will occasionally solve some real interesting problems. But you will occasionally have some disastrously bad decisions. And you won’t know the difference a priori. You will just produce these outputs and hope for the best.
More Data, More Problems?
In addition to the hazards of big data that Michael Jordan mentions, there is the reality that working with increasingly large datasets can have diminishing returns and is sometimes overkill. It can result in longer training times for models, larger model files, and resources wasted on Hadoop and other “big data” tools when the existing toolset and some sampling could have produced an adequate solution quickly. Don’t spend so much time trying to work with more data that you neglect the quality of the data you already have.
Put simply: predictive models based on messy and incomplete data don’t work as well. More data isn’t always the answer either. It is often better to spend time cleaning, understanding, and enriching your data instead of accumulating additional noisy data. Google famously fell into this trap when they launched Flu Trends which they claimed could predict the number of flu-related visits to doctors based on search trends. It worked very well for a little while and it was paraded around as an example of the power of big data. But then something happened. Flu Trends’ predictions were totally wrong three years in a row. Throwing algorithms at huge volumes of search data was insufficient and solving the problem on an ongoing basis required a different approach. In this example, the solution was integrating information from the United States Centers for Disease Control and Prevention (CDC) in order to create a richer dataset. By calibrating the magnitude of the algorithms’ predictions to the recent case results from the CDC, researchers were able to significantly reduce the error rates.
The People-Powered Approach
Using people-powered technology is a scalable way to clean and enrich the data that is needed to train an excellent model. People can proofread portions of datasets that we need for analysis and fix problems they contain such as out of date information or typos that prevent items that should be identical from being matched. People can identify the sentiment of text, the locations shown in images, and the languages being spoken in audio recordings.They can categorize your data into coherent groups, match text based on meaning instead of word choice, and label more training data for your predictive model whenever you need it.
We’ve built a predictive model to identify suspicious users on the CrowdFlower platform that improves based on feedback about whether its predictions are correct. We initially pulled a list of users that we thought appeared to be exhibiting “good” and “bad” behavior and used a portion of the data to train random forests and then evaluated their performance on the remaining data. This worked moderately well but we found that noise in the “good” and “bad” labels for our data was causing the resulting predictions to be less accurate. We realized that we needed better labels, so we decided to use specialized members of the crowd to help us collect them. After obtaining several thousand new labels, we were able to build a random forest that performed significantly better within a day of work.
In order to track the accuracy of our model over time, we select a sample of the model’s predictions each day and ask CrowdFlower contributors to review the chosen accounts to determine whether they are actually suspicious. Each prediction is a unit in the CrowdFlower system, which we create automatically using an HTTP request to the API. Since our job has auto order turned on, the units are automatically distributed to people in a few minutes and a short time later we receive our results at our webhook URL. In other words, we are able to automatically verify whether our predictions are getting better or worse over time by evaluating them nearly immediately using human expertise. This made it easy to get our predictive model built and tested quickly even though we initially had very little training data and no clear process for gathering more or evaluating which of our training data is reliable. It also allows us to compare multiple models on new data to see which generalizes more effectively.
A Ballooning Issue
Data scientists spend the majority of their time manipulating and cleaning data. With a shortage of data scientists and the amount of information growing exponentially, we must find better ways to gather, supplement, clean, and verify the data that serves as a foundation for product strategies, predictive models, and entire businesses. Big data leads us to new insights, but we need to approach it with healthy skepticism so we understand its limits before we accidentally overstep them.