A few weeks back, we hosted at meet up at the CrowdFlower office with some friends from Kimono and Silk. The goal? To demonstrate how to scrape the web for data, enrich it, then visualize it. In other words, the whole kit & kaboodle.
Here’s the video we took that evening. What follows is an explanation as to how all three tools worked in tandem to get us some really interesting insights. Let’s roll:
(Oh, and also: if you’d like to come to any of our future meet ups, you should. They’re fun. We talk about data and drink beers and have a grand old time. Join our meet up group so we can alert you about the next one. Spoiler alert: it’s with Carnegie Melon professor Adrien Treuille and he’ll be talking about collaborative science.)
Scraping data with Kimono
While many businesses already have their own data that needs enriching, there are myriad use cases where the data exists online but we don’t have a clean and easy way of fetching it. That’s where Kimono comes in handy. It’s a simple, intuitive Chrome extension that we use frequently to scrape data from Wikipedia, Yelp, Amazon, and hundreds of other sites online. In fact, we recently used Kimono (and Silk) to run our Academy Awards demographic job.
Pratap describes the process in the video above, but essentially, you hit a web page where you want data, launch Kimono, and start digging in. You click on the fields you’d like to collect and, with a little training, Kimono can snag thousands of rows in no time. For our meet up, we scraped Amazon and Google Shopping for prices, URLs, brand names, and more. Once we downloaded that structured data set through their extension, we were ready to enrich what we had.
Enriching data with CrowdFlower
What we ended up post-Kimono was a spreadsheet full with images for the product, URLs for sales pages, and the product name (in fact, if you’d like to see what we had, you can download that here). Now, the names and pictures of wearables are interesting, but we wanted more information. You know, the sort that scraping and spreadsheets can’t do on their own. In other words, we needed a human touch.
This, of course, is where CrowdFlower comes in. We (okay, our colleague Aaron) ran a series of jobs where we showed contributors the wearable and asked them to tell us information about it. We wanted to know its price, its purpose (was it for fitness? Gaming? Pets?), where on the body it was worn (wrist, ankle, torso, etc.), and more. Contributors saw an image of the product and could follow a link to the product page to tell us that information.
The jobs ran for a couple hours and what we had ended up with was an enriched data set full of information about all manner of wearable tech (which you can download here). All that we needed to do now was show it off.
Visualizing data with Silk
After getting our data with Kimono and expanding and enriching it with CrowdFlower, we were still left with something that, well, wasn’t exactly visually appealing. That’s where Silk comes in.
Silk is an easy-to-use data visualization platform. You can take essentially any spreadsheet, toss it into Silk, and create maps, graphs, charts, you name it. One of the coolest parts of Silk is that everything’s interactive. You can filter any visualization to find interesting relationships, patterns, or outliers. Here’s one of the visualizations we created that evening:
You can visit the Silk we created from the project, which has a map showing where each wearable was made, where on the body each wearable is worn, and the prices of every product we ran through this process.
In other words, using Kimono, CrowdFlower, and Silk together, we were able to find, analyze, and graph a data set in an afternoon. Which is pretty cool.
If you’d like to come to our next get together (or just be alerted to the next ones we’re doing) go ahead and join our Meetup page here. We’re in San Francisco and we’re easy to get to from BART. We’d love it if you stopped by.