Big-data is exciting not because of its actual size (i.e., how many rows of data it can fill up). It is exciting because it is an abundant, precious, and uncharted resource, sometimes worth more than its weight in gold-plated thumb-drives. Why is it so valuable? Its value is in its ability to accurately inform our decision making. With it, businesses can examine customer behavior, predict the future, or analyze how to use their resources. Scientists, nonprofits, and academic institutions use it in very much the same way to break new ground, innovate, and be more effective.
Photo by Jhayne
Where is this big-data? It’s not just in business and government data-centers. It is all around us. It’s coming from the growing number of sensors and cameras found in buildings, smartphones, and gadgets. It is the internet. It is the trillions of recorded times, building elevator logs, text/characters, facts, numbers, images, and posts on social-media. In these raw forms, however, there is often very little meaning to be found; it’s just noise.
How does big-data become useful? Data must be aggregated, cleaned, sifted through, and enriched. Only from that point do data-scientists have something to work with! From good data, they can find new associations and correlations. These insights can be as mundane as finding that people who buy orange juice at a drugstore are twice as likely to buy manilla envelopes, or they can be as exciting as finding that how people fill out their online-profiles on dating sites, like OkCupid, can actually predict the outcome of relationships!
In short, data science is the future of decision making. It really is, and it will continue to change our lives as it informs more and more of the decisions on how how we are funded, insured, advertised to, and treated in general.
The Current Image of Big-Data
Photo by Arthur Caranta
As the popular mental image goes, big-data is a collection of highly educated geeks in a room, analytics and data-mining software up on the screens, towering databases humming in the distance, and sophisticated algorithms drawn all over the walls. As cool as this seems, however, this picture is incomplete. If it was accurate, there would be no need for this article! There is a missing piece, namely, the crowd. We must not overlook this! The data processing and computing revolution that is big-data is working thanks, not only to technology, but, to human beings working (around the clock) on the internet, millions of them, to make discoveries through big data possible…
The Crowd: How Data Scientists Access Human Intelligence to Cleanse and Enrich Their Data
Behind much of the data being used by companies, like Facebook, and by scientific efforts, like FoldIt, is a good crowd. It’s one of the best data science tools out there. This is because, despite having the latest in technology, it is still impossible for computers (i.e., artificial intelligence and algorithms) alone to give meaning to what is seen in text, images, and sound recordings. Furthermore, fun fact, without the aid of a crowd, data scientists can spend up to two thirds of their time just preparing their data for analysis.
What does the crowd do to save the data scientist this time-drain / colossal effort? The answer is a myriad of tiny tasks, ones that fall into a number of categories:
1. Tagging, Labeling, Sorting, and Summarizing: People can use their common sense to give meaning to digital information. Example: A startup, like Pandora, that streams music may want to label its tracks by genre. That way, they can play the appropriate type of music for each individual user based on their preferences. With a huge database of songs, using the crowd to label content makes a lot of sense. Many businesses are already doing this.
2. Judging: People can read or view something and then give you their impression of it. Example: Sentiment analysis is a common endeavor taken by companies who are trying to understand the impressions they make. Unilever is a prime example. They are one of the world’s top producers of consumable goods. When they discovered that their automated tools were not going to be enough to give them the information they needed, concerning the launch of their Dove Men+Care line of products, they turned to the crowd. the result was increasing their accuracy from 30 to 90 percent when identifying negative and positive sentiments from social media streams and other sources. From there, they could reasonably predict their chances for success.
Keep in mind, however, judgement from crowdsourcing in sentiment analysis can go way beyond identifying “negative and positive”. With the help of the crowd, data can be examined to reveal answers to custom queries as well. (It’s also worth noting that this information can be processed in near real-time by a motivated crowd, allowing for timely and well-informed responses on behalf of the businesses.)
One more important use of crowdsourced data in sentiment analysis is the training of algorithms. Through machine learning, programs can learn from the data the crowd creates; from there, algorithms can be improved to be able to catch on to what humans have known all along, how to make appropriate judgements, such as identifying sentiment. (You can learn more about this process from this article on crowdsourcing and machine learning previously posted on this blog.)
3. Just Being Human: Robots can have a hard time doing simple things, such as making peanut-butter and jelly sandwiches, so why not let humans show them how? You can use browser-based controls to take over the actions of a robot while it observes. From there, it can create a model of your behavior. To the robot, you’re a mastermind, but as a human, you’re just being yourself! For the data scientist, this means the crowd is not only enriching and cleaning your data, they are creating it.
As you can see, through this type of work completed on crowdsourcing platforms (i.e., a barrage of tiny tasks to clean and enrich), data can go from meaningless to meaningful in minutes.
Millions of Workers: The Scale of Human Computation
Photo by Michiel S.
Why does the crowd work? Through a variety of portals (Swagbucks, for example), crowd workers benefit from working on data-focused tasks by earning points, cash, contest entry’s, and in some cases credit for volunteering. It’s a popular model. On the other side of the coin, for those that receive the work, they enjoy being able to get just the right amount of work done in a short time-frame, on demand, with accuracy and without any involvement of a middleman or outsourcing agency.
Around the world, the number of crowd workers available at any moment is in the millions. The company hosting this blog, CrowdFlower, has access to 5 million workers alone!
Why the Image of Big-Data needs to Be a Little Less Cold, Hard, and Robotic and a Little More Warm, Squishy, and Human
Photo by Marco Oliani
So, we’ve seen that the crowd is an effective way to train data and an indispensable part of the data scientist’s workflow; they can plug into the ‘hive-mind’ when needed and unplug when they are done. With this in mind, let’s get back to the main question. If data science is in-large-part a human process, one where millions of human eyes are sending info to human brains and then moving human hands to type and click helpful information, then why isn’t it represented this way in the media? Why are all the lifeless software programs and clever data scientists getting all the credit?
Take the latest, 2014, RoboCop movie, for example. In the film, we are to believe that Alex Murphy, the man behind the robo-suit, is able to access massive crime databases and live video-surveillance footage from the city (i.e.,some seriously big-data). With these streams of information available to him (and some facial-recognition software), inside his Google Glass-like visor, he is able to make highly effective decisions; he knows where to find criminals and how to gather evidence to incriminate uncaught criminals by comparing databases alone.
My complaint? Well, not to criticise the film (It was actually great!), but it’s hard to believe that such capable software could exist. No computer program and no person could systematically search all of that data so quickly. The reality of his situation would have to have included an army of people behind computer screens helping him view, identify, verify, and clean up that information if he was to have ever gotten anywhere.
This has to do directly with another topic, the confluence of computer-vision and machine-learning. For RoboCop’s ability to sort through that much video-surveillance and other files, the algorithm’s used would have to have rivaled or surpassed the actual human vision system. In other words, the software would have to have been ultra-brilliant at identifying people based on the limited cues found in imperfect video footage, i.e., blurry figures, poor lighting, confusing movement, depth and perspective, etc. To get there, assuming it ever could (I’d buy that for a dollar!), the program would have needed to have first been trained by humans doing the work before the software could give Robocop the confidence in the data needed to justify his jumping into so many cool action scenes.
Certainly though, RoboCop is not the point! Nevertheless, the movie’s plot should remind us, that, yes, big-data can intelligently advise big decisions, but this can only happen when the data is given meaning, often something that only we humans can do. Furthermore, if we are going to continue to explore the idea that big-data is a revolutionary tool in tech, let us not forget the human crowd that powers it.