Featured Story

By Justin Tenuto, September 15, 2015

How scientists are using CrowdFlower to create a massive biomedical database

Let’s start with the good news: scientists do a ton of research. Just looking at the biomedical field alone, a million papers are published each year. And while that’s a staggering amount of knowledge we’re accumulating, it brings us to the bad news: nobody really knows what’s in all those papers. 

Why is that exactly? Well, first off, biomed research isn’t exactly the stuff of trade paperbacks. They’re incredibly important, but they’re dense, and they aren’t written for a lay audience. Scientists count on their peers to read research and for people in their disciplines to build upon or incorporate important advances in their field. But even then, just think about that million number. Even for incredibly specialized researchers, keeping up with the state of the art is a job in and of itself. 

If you’re imagining that’s a problem, well, it is. There are insights and advancements in those papers that could be incredibly valuable to other researchers, but with near 3,000 being published each day, no one can possibly keep up, let alone continue their own research and maintain anything approaching a sane schedule. 

This is especially problematic for scientists who are studying rare diseases. The NIH has identified over 6,800 of these illnesses and, in total, they affect 25 to 30 million Americans. In other words, while the diseases themselves are rare, rare diseases as a class affect almost 10% of the U.S. population. So why is this overabundance of research especially precarious for rare disease researchers? Essentially, there could be advancements in Malady A that could jumpstart experimentation in Illness B, but the scientists involved may have no idea the two are related. 

That’s where Andrew Su and his team at Scripps Research come in. They’re attempting to curate these millions upon millions of articles and build a cohesive, searchable database that researchers can use to uncover these sorts of connections. And we’re proud to note that they’re using CrowdFlower for their project. 

It works like this: first, Andrew and his team compiled a massive textual database of biomedical research. Then they identified diseases, symptoms, genes, and other salient information in that text, then isolated those sentences. This is important because it allowed them to build jobs on CrowdFlower where our contributors would only be looking at pertinent statements. In other words, instead of having laymen read massive research papers or dense paragraphs from those articles, contributors looked at one sentence at a time. And not every sentence; just the germane ones with disease, gene, and/or symptom information. Here’s what a task in one of those jobs looks like:

Scripps-SS-DFE

Pretty straightforward, right? We especially like that they color-coded the pertinent information so that non-biomedical folks (like our contributors) can confidently find the relationships that are valuable to Andrew and his team. And it’s not just the job above; they’ve been running a ton of jobs like this, looking for all sorts of different relationships with different combinations of genes, disease, proteins, symptoms, and more.

One of the goals, as we mentioned above, is to create a massive, searchable database where researchers can compare their area of study with another. This is probably best explained with an example. 

Say a scientist is looking at Rare Disease A. She knows that a mutation in gene Z is the cause but her team hasn’t been able to figure out treatments that affect that particular gene. Meanwhile, a different researcher is looking at Rare Disease B and it’s also caused by a mutation in gene Z. But that scientist is making progress with a particular treatment. When this database actually exists, that first scientist could search for the gene she knows causes Disease A and learn about other research that’s been successful treating that mutation. At her fingertips, suddenly, is a bevy of knowledge that can shortcut months of experimentation. As you’d imagine, that’s really valuable. 

And actually, this is what we talk about when we talk about the difference between big data and rich data. Big data is massive amounts of information (like a million research papers every year) while rich data is massive amounts of useful and meaningful information. In other words, it’s the substantive research and relationships from those million research papers. And while it all comes from the same source data, it’s the rich data that’s far more consequential. 

Andrew’s team isn’t simply looking to build that database through CrowdFlower though. They’re also keen on expanding into the citizen science realm. We’ve talked about FoldIt and EteRNA before, games that got the public to advance protein folding and RNA experimentation at massive scale, and, indeed, those are two sterling examples of the value of citizen science. But there are plenty of others. Zooniverse, which runs Penguin Watch, Fossil Finder, and whole host of other fascinating projects, is just one example. 

Andrew’s team has their own site called Mark2Cure. Right now, they’re looking at a particular rare disease called NGLY1, which has been diagnosed in only about 40 children. Each of them has a deficiency in a certain protein and, not only is it difficult to diagnose but there’s currently no effective treatment. Mark2Cure aims to help with both. You can sign up to help out here or visit their site to learn more. 

The work the Scripps team is doing on Mark2Cure is invaluable for building the database we wrote about above and for advancing biomedical research. They’re hoping that the work they’re doing on CrowdFlower–building relationships, not just identifying genes and proteins and so on–will be released on the site by the end of the year. If you’d like to see the results of some of the work that will build that next phase of Mark2Cure, we’ve had a dataset up on our Data for Everyone library for a few months you can download. 

Past that? We were just really proud to find out this sort of science was being done on our platform. If you have some time, head over to Mark2Cure and put in a little volunteering to help the project out. It’s easy and it’s important, which are two concepts that rarely get to share the same sentence.