What makes a bad survey question and why does it matter? I thought I'd use my first blog posts as Dolores Labs's friendly neighborhood social scientist to talk a little bit about question design since it's a relevant, but often overlooked, area of Crowdsourcing work.
You can ask “the crowd” all kinds of questions, but if you don't stop to think about the best way to ask your question, you're likely to get unexpected and unreliable results. You might call it the GIGO theory of research design.
To demonstrate the point, I decided to recreate some classic survey design experiments and distribute them to the workers in Crowdflower's labor pools. For the experiments, every worker saw only one version of the questions and the tasks were posted using exactly the same title, description, and pricing. One hundred workers did each version of each question and I threw out the data from a handful of workers who failed a simple attention test question. The results are actual answers from actual people.
An Example: Response Scales
The rest of this post focuses on one example question that involved a response scale and a test to see how altering the scale would affect people's answers. Here are two versions of the same question that I posted to Crowdflower:
Low Scale Version:
About how many hours do you spend online per day?
(a) 0 – 1 hour
(b) 1 – 2 hours
(c) 2 – 3 hours
(d) More than 3 hours
High Scale Version:
About how many hours do you spend online per day?
(a) 0 – 3 hours
(b) 3 - 6 hours
(c) 6 – 9 hours
(d) More than 9 hours
Notice that both versions can accommodate any answer and that the only difference is in the range of the scale items. You can give an accurate response to either question and neither version explicitly pushes you to give any answer over another.
So what did people say? Here's a pair of histograms breaking the responses up by the two versions of the question:
I didn't label the height of the bars because the results are almost useless in this form. The only conclusion we can draw is that a lot of people in the Crowdflower worker pool tend to spend more than three hours per day online (whoa, no way...).
At the same time, it seems like the workers might have given low answers more frequently in response the low scale (check out how big the first three blue bars are compared to just the first orange bar).
To look at that comparison more closely, let's break the answers into two categories for each scale: (1) the percentage of responses that were less than three hours, or (2) the percentage of responses that were more than 3 hours.
The difference between the height of the orange points (high scale) is much bigger than the corresponding difference between the height of the blue points (low scale). In other words, people who saw the high scale were much more likely to say they spent more than 3 hours online. In case you're a stats nerd, the Chi-square test showed that this variation was significant with a p-value
But maybe collapsing the responses like this is a little too coarse and you'd still like to see how the variation worked across the scale as a whole. With that in mind, Lukas suggested another way to look at the effects – a comparison of the cumulative percentage of responses – and the differences are even more clear.
That gap between the blue and the orange line at “Less than 3 hours” – the one level that was measured explicitly on both scales – is huge!
Explaining the Gap
If you're thinking that the differences between the scales alone can't explain why all of these results are so skewed, that's a good thought. However, the fact that this was a randomized experiment on a relatively homogeneous group of people makes it very unlikely that anything else explains the difference. Just to be sure, I did some other tests and found no significant differences between the sets of respondents that saw the low and high scales in terms of gender, country of origin, and the amount of time they took to complete the survey. So it seems like the scale is indeed the most likely culprit.
But what explains why scale questions can bias people's responses so heavily? Survey researchers call this kind of behavior satisficing - it happens when people taking a survey use cognitive shortcuts to answer questions. In the case of questions about personal behaviors that we're not used to quantifying (like the time we spend online), we tend to shape our responses based on what we perceive as “normal.” If you don't know what normal is in advance, you define it based on the midpoint of the answer range. Since respondents didn't really differentiate between the answer options, they were more likely to have their responses shaped by the scale itself.
These results illustrate a sticky problem: it's possible that a survey question that is distributed, understood, and analyzed perfectly could give you completely inaccurate results if the scale is poorly designed.
Okay, it's Broken. Now How Do I fix It?
So what are you supposed to do in order to figure out which scale is more accurate? One of the best ways to mitigate the problem is to do some open-ended research on your respondent population so that you can get a good sense of a reasonable range of responses. Then you can re-center your response scale around that distribution.
To try this out, I ran the survey yet again with the same question, except that this time I left the “hours online” question open-ended, allowing Crowdflower workers to type in their responses. Here's a density plot of those responses with the minimum, maximum, and mean responses highlighted (sparklines style):
While the distribution is skewed and has something of a long-ish tail, the mean (6.53 hours per day), median (6 hours per day), and mode (5 hours per day) are all close to the midpoint of the high scale in my original questions. Therefore, the responses from the high scale were probably a more accurate reflection of the worker's judgments.
Keep in mind, this technique provides no guarantee that the workers have accurate knowledge of how many hours they spend online – it's turtles all the way down. I'd be willing to bet that their best guesses are pretty good, but if a big policy decision was riding on this question, I'd try to supplement my little survey with some other data sources. No matter what, there's no perfect solution.
The point of all this has not been to undermine survey research, but to illustrate some of the problems that can happen if you're not careful with things like scale design, as well as to present some strategies for solving those problems. As crowdsourcing becomes a mainstream tool in a range of academic and commercial fields, survey and questionnaire design techniques are also becoming more widely applicable. Nevertheless, people don't usually encounter this kind of stuff outside of research methodology textbooks and the polling season of an election year.
I have a few more examples from these same experiments that I hope to follow up with in more posts soon. Meanwhile, leave a comment or email me at aaron [at] doloreslabs [dot] com with questions, comments, corrections and requests for data/code. All of these plots were created using R.