News

By Lukas Biewald, March 4, 2015

Data for Everyone: Why We Created an Open Source Data Library

We think that open data is the new open source. We know data has become as important as software for many applications. And while there is lots of open-source software available, there are almost no good sources of open data. Today, by releasing large, high quality, real world data sets collected by our customers and published with their permission, we’re trying to change that.

For example, say you tried to create a search engine. You could easily find high quality open source software like Lucene but where would you find the data to actually train the ranking algorithm? What if you wanted to start a new shopping site? You’d have the same problem. You could find many open source software options, but how would you train your recommendation algorithm without data?

Open source software levels the playing field. In the past you needed millions of dollars to launch and run a web app; today an engineer can do it on their own. In order to build a machine learning algorithm, you need training data, and collecting that data is really expensive and time consuming. Today, with our Data for Everyone library, we’re making that trivial for many applications.

We’ve long been fans of initiatives like the U.S. government’s Data.gov program and UC Irvine’s Machine Learning Repository, but as data becomes more critical to all kinds of applications, we need more open data libraries to give academics, researchers, and startups the tools they need for success. That’s why we’re proud to announce a new open data initiative we’re calling Data for Everyone.

Data for Everyone means access to open, enriched data sets, free for anyone to use. We’ve been asking some of our users about sharing their data for the last year and have launched our library with data sets ranging from sentiment analysis to biomedical imagery. We’ll be updating the library constantly, but here are a few highlights of what’s live now:


Smart phone app functionality

From the descriptions of thousands of Android apps, contributors selected functionalities from a list of 25 options. Those options ranged everywhere from “uses the phone’s flashlight” to “entertainment.” This data set includes the original description of each application, the assigned functionalities, and, like all our data sets do, the confidence score of each judgement.

Biomedical image modalities

A large data set of various labeled biomedical images. They range from x-rays to MRIs to graphs and diagrams (like you’d see in a biology text). The data set includes live image URLs, the specific modality for each image.

The relationship between real and nonce words

One of the many fascinating language data sets in our library, this one looks at the familiarness of real words and made-up ones. Contributors were given a fantastical word (like granjug) to stand in for a real one (like canine). The nonce word was used in a sentence and contributors ranked how like the words felt, based on contextual clues, word sound, and so on.

Image keyword database

Currently our largest data set at 225,000 rows, contributors were given an image and asked whether a given word described that picture. This data set includes live URLs for every image, the assigned descriptor, and a simple yes/no judgement (with confidence score).

Twitter airline sentiment

The data set upon which we based our recent blog post on. In it, we looked at major U.S. air carrier customer service handles and analyzed what sort of complaints each airline received. This data set includes tens of thousands of tweets, their corresponding carriers, the positive, negative, and neutral sentiment, as well as the specific reasons why users were negative about airlines.


We’ll be publishing more data sets every month from here on out and talking about some of our favorites on the blog, but please let us know if there’s a particular kind of data you’re looking for in the comments. We’ll do our best to surface the sets our community wants most to see.