Research & Insights

By Brendan O'Connor, March 18, 2008

Our color names data set is online

I just packaged and released the data set for our color names experiment. It has 10,000 color/label pairs.

This is the download link. Read on for more details:

I tried to generate the color patches in a way to get interesting colors. This of course is incredibly subjective. My main concern was to eliminate muddy dark grays, which are very common when uniformly sampling over standard RGB values. (Perhaps I went too far — see the big donut hole in the color wheel plots.) So the color patches were sampled from HSV with uniform sampling over hue, but saturation and value biased high (normal distribution). The exact code and parameters for this is included in the download.

The plots in the post and the explorer look like a color wheel with hue as the angle. But actually they’re from running PCA over the RGB values, using the first two principal components as x and y. This was a very arbitrary decision, but seemed to make a nice visual effect. There are many other reasonable ways to plot the data.

The data includes anonymized identity on the workers. (The Mechanical Turk service makes all workers anonymous, but we anonymized yet again for releasing the data set.) You can see that certain workers did a large number of annotations. We have no demographic information for this one, sorry.

The files are:

  • data.csv, which contains the color/label pairs, also with rgb and hsv representations.
  • R.R, which has some routines that were used to generate and plot the data. It has examples of how to read and use the data, if you like to use R.
  • html.rb, which with write_html() creates the explorer.
  • sample-hit.html, one of the web forms used for data collection. There were 1000 forms with 10 colors each. For each single form (“HIT”), exactly one annotator filled it out. Individual annotators sometimes did multiple forms if they wanted to.

Let us know if this is useful, if you have any questions, or find something wrong with the download — either email or leave a comment here. And if you do anything cool with this data, we’d really love to hear about it.