On one of those quiet, rainy afternoons that we call “Winter” here in San Francisco, I found myself in the Crowdflower offices with the legendary Mike Love.
Well, okay, maybe not that Mike Love. The other legendary Mike Love.
In any event, Mike and I were chatting about statistics and blogging (he’s exceptionally skilled at both) when he raised an interesting question: could you Crowdsource Benford’s Law?
Doe-eyed, mathematical neophyte that I am, neither Benford nor his law rang any bells, so I high-tailed it over to Wolfram Alpha MathWorld. There, in an article by Eric W. Weisstein, I encountered the following histogram:
Turns out that this simple, elegant, and purple plot describes the probability distribution of the first digit for many naturally occurring distributions. In other words, about 30% of the leading digits of the numbers in a large variety of tables or listings wil tend to be 1; 17.6% will tend to be 2; and so on down to around 4.6% of the numbers beginning with 9.
The phenomenon has an interesting history (at least as far as I could tell by going a few clicks past the Wikipedia article). It was originally described in an 1881 paper (subscription required) by Simon Newcomb, an astronomer who noticed that the first few pages in books of logarithm tables were worn much more heavily than later pages. The eponymous physicist Frank Benford then re-discovered the distribution in a 1938 paper where he tested it on a number of different data sources. More recently, the mathematician Ted Hill has provided a more sophisticated proof of the origins of the phenomenon and pioneered its use as a diagnostic to discover irregularities in data such as tax and elections returns (Tax season pro-tip: don’t fake a bunch of receipts that start with the number 9).
In any case, at Mike’s suggestion, I set out to see whether I could re-re-discover Benford’s Law with Crowdflower. The task I designed consisted of one question: “pick any number greater than zero.” In a little over an hour, I had about 500 valid responses and was off to the races to see what the resulting distribution of first digits looked like.
The results were all over the place. Here’s what they look like in a scatter plot with a log-scaled y-axis:
And here I plotted all of the numbers sorted by magnitude and leading digit:
The height of each tower of numbers captures the frequency of that leading digit. The size of the numerals corresponds to the magnitude of the number and the color corresponds roughly to its order of magnitude (red = big). Within each digit, the numbers are sorted along the Y-axis.
Even though it’s too small to really be legible, you can probably pick out a few interesting things from that graphic, such as the extraordinary frequency of the number 1 (it occurred, 67 times) or the size of the biggest number (2.345 x 10^13). The mean of all submissions was 4.72 x 10^10 while the median was 7. In other words, this little experiment resulted in a seriously skewed distribution.
But what about Benford’s Law? The height of the towers of numbers suggests that this data corresponded pretty well to the histogram at the beginning of my post. To supplement the visual, here’s a table showing the raw frequency and percentages by first digits (note, the first digits are at the top and are written out as words).
It’s still hard to tell how well that does or does not match up with the distribution in the histogram above. There are some nice statistical tests that could help us here, but for the sake of a decent blog post, I went in search of a better way to compare the distributions visually. With that in mind, another plot:
Here I superimposed the percentage values in table 1 (as brown dots) on a density curve of the original Benford’s Law values (shaded in blue). Then I also added a brown (loess) smoothed curve along with some gray confidence interval shading (thank you, Hadley Wickham) to capture the overall trend of the Crowdflower data. As you can see, the fit looks pretty good – I think Benford would be proud. Or at least maybe Mike Love. Maybe.
Admittedly, I may have cheated a bit in creating this second plot. Since the variable presented along the X axis (leading digit) consists of ordered categories, the smoothed lines are a bit of a representational stretch. Nevertheless, I’m satisfied with the way the resulting graphic visualizes the relationship between the probability function underlying Benford’s Law and the distribution of the Crowdflower responses. If anybody out there is inclined to pursue more interesting visualizations and/or rigorous tests to verify the fit, here’s a copy of the Crowdflower data I used.