Last May, I took a trip to Italy for two weeks. A little bit of history: my friend Jessica and I are both Italophiles, and when her mom sent us a link to a video contest where the prize was a round trip flight to Italy, we knew we had to enter. After a week of writing and editing lyrics in a Google Doc — half in Italian, half in English — the resulting music video ended up winning us a trip to the holy land of olive oil, vino, and other delectable edibles.
Apart from being a passionate eater, I’m a passionate supporter of the Slow Food movement, an organization which promotes good, clean, and fair food around the world. Each year, they publish a guidebook to restaurants in Italy that adhere to their principles. In Italy, this usually means each restaurant is handpicked to showcase the traditional food of a particular region; each restaurant supports artisanal methods and products that otherwise might go extinct (were eaters not eating them), and where the food is most likely naturally organic and local anyway.
But, a problem: it’s 2011, and I’m more apt to travel through interactive maps than with old-fashioned guidebooks. More importantly, I don’t plan, and I needed to know what edible options were around me at any moment in time on my trip. What I really needed was a version of the guidebook, in map form, that I could use on my mobile. No such option existed, of course, and I was left with two options: magically visualize restaurants around me by poring through the guidebook, or, crowdsource it.
What I needed to do (and, what much of our work at CrowdFlower comprises), was structure unstructured data. To create the map from the book, I went through the following steps:
- Obtained a PDF version of the book
- Split the book into pages, and uploaded each page as a separate PDF (thanks to pdftk)
- Created a CSV (comma separated values file) with each page’s PDF link and page number
- Created a crowdsourcing task to structure the data, using the previously uploaded individual PDF pages
- Geocoded the structured data
- Output the geocoded data in KML (Keyhole Markup Language) form
- Uploaded the KML file to a mapping site (e.g. Google Maps)
Conveniently, each page that outlined a restaurant in the PDF was formatted nearly the same. This made it easy to give instructions to workers, as seen below:
For each area on the page, workers were asked to copy and paste specific sections into the task. Each page, then, was split up into corresponding parts (Region, City, Directions, Restaurant Name, and the two Capture Areas). This is the essential concept here: structuring the unstructured data such that I could later geocode it properly, and display it in the way that I needed.
Once the task finished, I downloaded the resulting CSV file, and whipped out Google Refine (a.k.a. Excel on crack), which has a feature that allows you to enter a template API call that changes based on specific values in each row. Using the Google Geocoding API (any will do), I constructed the following API call, using the address value in each row as the “address” parameter for each API call:
After slicing and dicing the rest of the data into bits that I wanted to display on a map, I used Google Refine’s “templating” feature to export each row as a Placemark in KML format. Finally, I uploaded the resulting KML file into several different maps, each representing one region in Italy.
Watch the video that started it all: Io Sono Balsamico by Balsamico
Aron is a Crowdsourcing Project Manager at CrowdFlower, and is the resident agriculturalist-eater. Follow him on Twitter (@aron) for more sage bits of agricultural-eating learnings.