Research & Insights

By Justin Tenuto, August 13, 2015

Why you should use open data to hone your machine learning models

One of the big reasons we created our Data for Everyone initiative is that there simply aren’t a ton of great open datasets out there for small businesses, startups, and academics to do work on. Sure, there are plenty of small, toy-sized datasets but those simply aren’t big enough to create algorithms that anyone can trust. In fact, our founder Lukas wrote as much in his post on Computer World:

Remember when Netflix ran a contest to beat it’s own movie rating algorithm? Tens of thousands of solutions were submitted, all based off a single dataset of about 100,000,000 rows and Netflix eventually awarded the million dollar prize to a team of data scientists who beat the company’s algorithm by more than 10%.

Even now, a half decade since the prize was awarded, that Netflix data set is constantly used in computer science research — over 3,000 papers mention it. And almost all of the papers that mention it were written after the contest ended. It’s not that movie data is so important to computer science research — there just aren’t many good quality datasets available. The contest wasn’t the important thing — releasing the data was the real value to the world.

We believe that open data should become the new open source; we think shared clean, enriched data is one of the key ingredients to real innovation.

That’s why we were incredibly happy to find out what the folks over at MonkeyLearn are doing with some of the sets we’ve shared on our Data for Everyone library. MonkeyLearn is a machine learning application that extracts and classifies information from text. They let you upload custom training sets and create your own machine learning algorithm that’s appropriate for your specific use-case. After all, different organizations have different vernacular and a one-size-fits-all solution works like one-size-fits-all t-shirt: it doesn’t really fit anyone all that well.

Since MonkeyLearn extracts and classifies text, they do a lot of sentiment analysis work. As they write in their post “sentiment analysis is damn hard” and “it’s one of the most complex machine learning tasks out there.” That’s true. Just because people inherently understand opinion and feeling in text better than a computer can doesn’t mean we always agree. There are competing metrics out there for how much people concur about the sentiment of a particular piece of text, but commonly, you hear about 70% and, again, different industries have different words, slang, subcultures, in-jokes, etc. that don’t translate to other fields. To put it another way: a basketball “MVP” is a most valuable player; a tech “MVP” is a minimum viable product. Those aren’t even close to the same thing.

That’s why the folks MonkeyLearn are so keen on custom sentiment analysis. Creating an algorithm based on a specific use is always going to better than a generic, out-of-the-box natural language processor (NLP).

But don’t take our word for it. Here’s how MonkeyLearn used CrowdFlower to not only set a baseline for their sentiment model, but used free, open data from our Data for Everyone library to create industry-specific algorithm that outperformed every other machine learning solution.

Setting a benchmark

First, they collected about 2,000 tweets about brands, celebrities, movies, and tech. They ran a CrowdFlower job to label the tweets (meaning our pool of contributors tagged each individual tweet as positive, negative, or neutral) and compared those judgements to their sentiment algorithm, as well as a few other models out there.

generic_sentiment_accuracy

That 65.4% percent means that MonkeyLearn’s sentiment algorithm agreed with roughly two-thirds of our human-scored dataset. Interestingly, their model had more trouble with negative statements than positive or neutral ones.

generic_sentiment_f1

And, as they point out, if you’re running sentiment analysis on your brand, odds are you’re more worried about those negatives than the positives. Hearing praise is all well and good, but if you have an issue with a recent product launch or your billing practices or whathaveyou, knowing that sooner rather than later lets you fix small problems before they become big ones.

So. We’re dealing with a 65.4% benchmark from MonkeyLearn on general sentiment. Here’s where the open data comes in.

Creating industry-specific models

MonkeyLearn hopped on Data for Everyone and downloaded a trio of free datasets we’ve made available for, well, this exact sort of thing. Those were:

And here’s what they did with that data. First, they segmented the sets into training and test data (this, of course, is a common practice for building and scoring algorithms). Next, they actually trained three separate algorithms on each set and finally, they checked that model against the remaining portion of test data. You can read their full post for details on how the all the models worked but since we created the airline set internally and wrote about it in a past post, we thought it would make sense to focus on that.

Now. Remember the 65.4% accuracy number? Trained on an industry-specific set and tested against an industry-specific set, their model improved over 15%

airlines_accuracy

Notably, the negatives were especially accurate, where they were the least accurate in the original, more generic sentiment classifier built off a wider swath of data.

airlines_f1

Just take a look at this negative keyword cloud. It’s basically everything you associate with a bad flying experience other than “where’s my luggage?” and “it sucked, we crashed.”

keywords

While MonkeyLearn’s other models–based on brand text and Apple-specific text–showed substantive improvement, the airline sentiment classifier outperformed both. Why is that? Well, for starters, those tweets were much narrower than the aforementioned datasets: they were tweets to customer service agents across a variety of major airlines.

But that’s not the important bit. The important bit is actually really simple. There’s just more data in the airline set. More human-labeled training data rows means algorithms have more examples to learn from and more examples to learn from means more accuracy. Feeding any classifier another, say, 10,000 rows of labeled data would improve its accuracy.

And with all the time data scientists spend tweaking and perfecting their models, the simple efficacy of feeding those models more information is sometimes glossed over. It shouldn’t be. More data is the simple way to make data science more effective. It’s why we created our Data for Everyone library. It’s why we think open data should be the new open source. And it’s why we publish not only the most interesting datasets that come our way, but also the largest.

We’d again like to thank the folks from MonkeyLearn for sharing their findings. They have more information (and more sweet, sweet graphs) on their post, so we definitely recommend checking it out. You can get a feel for how their product works and how important custom models and custom training data can be for you application. And, of course, if you’ve used any of our Data for Everyone sets and want to let us know what you did with it, well, we’d love to know.