I enjoyed hearing my friend Harper claim that “Big Data is Bullshit”. I remember meeting with him as he worked with me and other silicon valley tech CEOs asking for techniques to deal with the large databases of voter and donors the Obama campaign was dealing with. Even then I thought that many of the proposed solutions were overkill for the actual size of the data sets he was dealing with. It’s sexier to advise someone to use MongoDB or Hadoop than to tell them their problem can be solved with a few lines of python. So I’m not surprised that he feels like “Big Data” is a bunch of BS.
In the last few years, I’ve watched the criteria for calling something big data drop dramatically to the point where people who have a data set with more rows than will fit on their monitor at once tell me they’re working with “Big Data”.
The confusion comes from the fact that like atoms, there are really three different phases of data each with its own set of tools and problems. Using the wrong set of tools for your size of data leads to an enormous waste of time and money.
Level 1: Under 20,000 Rows. “Data Sets That Can Be Opened In Excel”
Everyone loves to complain about Excel, but it’s a great tool. The graphs are a little ugly and the statistical tests are a little strange but being able to see and manipulate your data in an unstructured way is amazing. At this size you can scan through your data by hand and eyeball outliers. This is an amazing thing! Don’t give it up.
I’ve worked with many data scientists who immediately load small data sets into R, generate summary reports and miss obvious effects that are instantly apparent when looking at the actual data. Also, pivot tables are an awesome way to interactively explore data sets of this size.
That said, I think there’s room for a much better tool. If excel would add a few features like a sensible way to make a graph with error bars or someone would add a user-friendly GUI-based data exploration interface to R, people could work much more efficiently at this scale.
Level 2: Under 2,000,000 Rows. “Data Sets That Fit Into RAM on a Single Machine”
If your data fits into memory on a single machine, there’s no need to use new-fangled tools like NoSQL databases or Hadoop. This is a size that wouldn’t traditionally be called “Big Data”. But if you’re not a programmer, you are going to have a really hard time working on data sets at this scale. If you’re data is already in a perfect format, SPSS and Stata are supposed to make your life easy, but I think there is a huge hole in the market for something that helps non-technical people manipulate data sets of this size. Open Refine (formerly Google Refine) looks promising but has a long way to go.
The mistake I see a lot of companies make with data sets of this scale is to start throwing Hadoop and NoSQL databases at the problem. At this scale, a reasonably configured MySQL or Postgres database is going to be just fine and make your life a lot easier. Remember that the 2,000,000 or so row limit of this stage goes up every year as computers get more and more RAM, so even if your quantity of data is doubling every 18 months, you may be able to stay at this stage forever, and if you can avoid going to level 3, you will be glad you did!
Level 3: Above 2,000,000 Rows. “A World of Pain”
Once your data doesn’t comfortably fit into memory on a single machine, things are going to get much harder. This is where the “Big Data” tools start to make sense and where you want to hire specialized people to set them up and run them. Having run a Hadoop cluster in the early days of Powerset and dealt with data sets of this type at Stanford and Yahoo, I try to stay away from this size at all costs and I try to get my data out of this stage as fast as I can. Everything is trickier here, it’s hard to compute averages and look at what kinds of outliers you might have and it’s easy to make dumb mistakes that would be obvious at smaller scales.
One great and underused technique at this scale is sampling. For most things that you want to do with data, 100,000 randomly selected rows is as good as 10,000,000 rows and working at R or SPSS scale allows for a much faster analysis cycle.