By Justin Tenuto, October 23, 2015

5 things we learned at the Rich Data Summit

Last week, we hosted the first annual Rich Data Summit in our hometown of San Francisco. And not to toot our own horn, but it went pretty damn well. We’ll be posting videos of all our great talks next week (our video folks are working on it as we speak), but we wanted to share a few themes that kept popping up at the conference in case you missed out. Here are the 5 things we learned at the Rich Data Summit:

Rich data isn’t just a buzzword

Over and over, we heard speakers talk about how one of the biggest challenges–if not the biggest challenge–in their day-to-day data science life is data quality. In other words, the difference between big data and rich data.

This distinction is one our keynote Nate Silver spelled out in an article he wrote earlier this February. The money quote: “By rich data, I mean data that’s accurate, precise and subjected to rigorous quality control.”


The example Silver gives to illustrate this notion revolves around baseball at-bats. The “big data” here is just the amount of at-bats (somewhere over 14 million). That’s a fun factoid, but you can’t really do much with it. “Rich data,” meanwhile, is knowing the particulars of those at-bats. Stuff like how many people were on base, who was pitching, the actual blow-by-blow of how the at-bat went (how many pitches did the batter foul off, did he hit a triple), and so on. That data’s not only been meticulously collected, but more importantly, it’s data you can actually unpack and analyze.

This theme of quality, rich data came up constantly. Companies have been saving their data since before they really knew what they wanted to do with it. How we enrich this data to make it something that’s “accurate, precise, and subjected to rigorous quality control” is a challenge a lot of smart folks are grappling with.

Everyone’s talking about human-in-the-loop

From Rob Munro at Idibon to Raul Garreta from MonkeyLearn, we heard again and again how valuable people-powered data was. But not just people-powered data for the sake of checking and cleaning datasets. Rather, it was using that human-curated data to train machine learning models.

Which, of course, makes a ton of sense. Machines excel at handling vast quantities of data where human excel at enriching certain kinds of data. By combining the two, by using people to do things like judging search relevance and then using those judgments to tweak and perfect a search algorithm, you’re much, much more likely to create a model that behaves and make decisions and judgments like a real person. You’ll see how constant this theme was next week when we publish all our videos.

Speaking of which, we’re launching a new product

Since CrowdFlower has long been used for exactly that reason (creating datasets with real people for human-in-the-loop machine learning, that is), we were thrilled to announce a new product we’re beta-testing this quarter: CrowdFlower AI.

ProductRDSWhat AI will allow you to do is really powerful. You can run the same CrowdFlower jobs like always, but then you’ll be able take the data you’ve enriched and create a machine learning model tailored specifically for your unique use-case. Using your own training sets and your own models will give you a much higher level of accuracy than you would with an out-of-the-box language processor. How so? Let us explain with an (admittedly) silly example: Say you run a dairy and want to find out what people think about your cheese (we told you it was silly). If you used a generic natural language processor, chances are, the word “smelly” is going to trend towards negative. But for cheese? Well, as any lover of particularly pungent cheeses will tell you, “smelly” is actually a good thing.

Again, a fairly trivial example, but it does well to explain the importance of having training sets for your use-case, not a general one. Our AI feature will allow you to train models, review, improve, and tweak them. You’ll be able to use a variety of different models and find the one that works best for you. That, in turn, lets you use the enriched data you get from a CrowdFlower job and then run tons more data through your model. CrowdFlower AI will let you know where it’s confident and where it isn’t, so you can pull out low-confidence rows and get humans to judge those. You can put those judgments back into your model to further train it.

If you’re interested in giving AI a test run, just let us know here. We’ll be explaining it and showing it off a lot more in the coming months.

Data science is everywhere

This isn’t something we learned as much as something we were again reminded of. To say nothing of the diverse attendees, the breadth of speakers at the Rich Data Summit evidenced not just how many organizations use data science to make important decisions, but how many different kinds of organizations and decisions data science touches.


We had data scientists from tech giants like Uber, Pinterest, and eBay. We had data scientists from companies expressly in the data space, like Kaggle, Basis, and Mode. We had data scientists who’ve used data to solve important real world problems, like the Catherine Bracy from Code for America, Eric Schles, who uses data to end human trafficking, and one of our keynotes, Beth Noveck, who spearheaded President Obama’s Open Government Initiative. And our morning keynote was Nate Silver of FiveThirtyEight, a well-trafficked data science blog that touches everything from sports to entertainment to politics.

The point is that while data science is still a relatively young field, it’s also everywhere. And since everything we do online (and, increasingly, offline) is fodder for some database someone, it’s not a field that’s going anywhere any time soon.

Data science is a open community

As our CEO Lukas mentioned during the day, we did a survey of data scientists last year and asked them what languages and programs they used. While the old mainstay of Microsoft Excel remains at the top of the list, 149 of 150 used at least some open source tools. And if you couple that with governments across the country (and the world) opening up their data, citizen science initiatives like Mark2Cure and Zooniverse, and free open source data libraries like UC Irvine’s Machine Learning Repository and (shameless plug alert) our own Data for Everyone library, you’ll see the pattern we’re highlighting here: data science is an open community. It’s full of sharing, of people piggy-backing on new advancements, of smart folks doing smart things and sharing their insights so anyone can analyze, visualize, or manipulate the data in whatever way they’re curious about.

RegistrationRDSIt’s something that most often strikes non-data scientists and people from other fields. We’re so used to this idea that information is an incredibly powerful advantage and that having it means we shouldn’t share it. That’s not to say that plenty of data isn’t private–it is, of course–but a lot of the tools and models and techniques that create the best and most accurate solutions are out there for anyone to use. It’s refreshing.

In fact, the Rich Data Summit partnered with our friends at the Open Data Science Conference (ODSC) to help throw our conference. If you’re in the Bay Area (or even if you need to hop on a plane), we recommend attending their event on November 14th and 15th.

Lastly, we just want to send a huge and hearty thanks to everyone who came, attended, spoke, helped pour drinks, you name it. We had a tremendous time last week and are already excited for next October. As mentioned, we’ll have videos to share next week, but until then: again, our thanks. We had a blast.