AI

By Justin Tenuto, November 6, 2015

Using machine learning to predict gender

 

This all started with a simple question: could we train an algorithm to determine if a Twitter account belonged to a man or a woman? With that in mind, we ran a simple data categorization job, fired up our brand new CrowdFlower AI feature, and tried to answer just that. What we found was, well, pretty damn interesting. But no spoilers. We’ll get to all that in a second. Let’s take a step back and start at the beginning.

Here’s how we did it

To run any CrowdFlower job, you of course need data, in this case tweets. The first challenge with a question like this is exactly what sort of tweets do you pull? To put it another way: if you fetch social data about, say, an especially seedy strip clubs, odds are, you’re going to get a few more male-authored tweets than female-authored ones. So we took our thumb off the scale and pulled 10,000 tweets with the word “the” in them and another 10,000 with the word “and” in them. But, importantly, we did a little something extra: in addition to a swath of random tweets, we also captured the user’s profile description (the “about me,” if you will), their profile image, and even the colors the accounts used for their links and sidebars.

With our data fetched, we ran a data categorization job where we asked our contributors to visit the profile pages of Twitter accounts and judge the gender of each. We had them bucket accounts into “male,” “female,” “brand or organization,” and gave them an option for “can’t tell” as well. Then, we ran the tweets through our AI feature.

And that’s where things got interesting.

We weren’t expecting the model to be super confident about its predictions–after all, each data row had just a single tweet, a profile, and some ancillary information to look at. But what we did manage to get were some major individual predictors (and anti-predictors) that strongly correlated to each account type. In other words: there are certain words, colors, and phrases that almost always mean an account belongs to a man or a woman. So how does CrowdFlower AI work? Here comes the science:

First, our machine learning feature looks at each data row (which in this case is a tweet, a profile, etc.) and the judgment our contributors made for each of those rows. Then, it looks for patterns. In accounts marked as male, what words come up most frequently? What come up least frequently? And since we pulled the colors these accounts used for their links and sidebars, the model was able to look at hex codes and figure out which colors were most often associated with men, women, or brands. The model then assigns a value to how predictive a certain piece of data is. In effect–or at least for our purposes here–that shakes out to a sort of top twenty-five words that predict an account is run by a man or woman.

And with that, the findings:

Male predictors

So what data is most predictive of a man’s Twitter account? As in, what word or phrase appears most often in men’s accounts and least often in other kinds? While a few of these we expected, we must say, number one was a bit of a surprise:

DudesWRESTLING, BROTHER. Somewhere, the Macho Man is snapping into a Slim Jim and smiling approvingly.

A few words on some of the other predictors we found:

  • thisiswhyweplay is an NBA hashtag. All hail Draymond Green.
  • The @ symbol was a really interesting predictor. It suggests guys are more likely to talk to (or, in a lot of cases, talk at) another account than women or non-individual accounts are.
  • 2fc2ef is a hex code, or, for those unfamiliar with that phrase, the way a computer describes color. And yes, 2fc2ef is blue.

We’ll get to the anti-predictors in a second, but since a fair share them predict an account belongs to a woman, let’s look at that data instead.

Female predictors

An added bonus of this section is that we learned our graphing program can handle emojis. We live in exciting times.

LadyDudes-1Like our male predictors, this one has a few surprises. For one, just look at how predictive that heart emoji is. It was the strongest predictor across all categories and it wasn’t even all that close. Not only that, a different heart emoij that was the fifth strongest predictor. That, of course, isn’t to say that every female account we saw had one of those, but rather that if a heart appeared in a tweet or profile, our model was very confident that account belonged to a woman. As we did with the gents, here are a few others worth comment:

  • camgirl was the second leading predictor. Do not google that at work.
  • psych is either the belated comeback of a really tubular 90’s slang word or there are a lot of women in our sample who are psych majors. We’re guessing the later. Reluctantly.
  • f5abb5 is a hex color. And yes, before you click that, it’s pink.

A word about anti-predictors

Our model also looks at data that appears in the set but is actually unlikely to correlate to a certain account type. In other words: what phrases, colors, etc.  don’t appear in men’s or women’s accounts. For men, you’ll of course a lot of data that appears as female predictors. But there were a few interesting additions:

Male-anti-predictors

“Feminist” for example, was the second most anti-predictive piece of data whereas it was fairly low on the list of female predictors. “Underground” is odd, but hey, let’s roll with it. And apparently, dudes dislike using smiley faces. Tres sad. 🙁

Now, a look at female anti-predictors:

Female-anti-predictorsThe “@” symbol and “wrestling” are flipped here, but, if you’ll recall, were both male predictors. You’ll notice a few new hex codes up there, but yes, those are various shades of blue and black. Also: “pizza” is anti-predictive. Apparently dudes are super into pizza. We should all be super into pizza. Pizza is good and cool.

What about non-individuals?

Of course, Twitter isn’t peopled solely by, uh, people. There are also a whole host of brand accounts, media sources, bots, and so on. Ever conscious of blasting out too many graphs, we’ll skip the predictors, because they weren’t wholly exciting. We saw words like “official,” “reddit,” “worldwide,” “newspaper,” and “association.” Mostly, that’s expected. Those are words you’d expect to see in a brand account. But we did want to lift the hood on some of the anti-predictors:

Brand-anti-predictors

Again, these are words that suggest an account belongs to a real human. It makes sense to see jobs in there like “strategist” and “writer.” “I” was another interesting finding that, when you stop to think about it for a quick second, makes a ton of sense. Non-individuals are also not “passionate,” nor do they much write about “vegetables.” This nice person (@passionatevegan) agrees with these findings.

In the end, the model is only about 60% confident it can look at an account, complete with link color, description, and a single random tweet with the word “and” or “the” in it and guess who’s behind the curtain. That makes sense. After all, we’re not all that much different. We use a lot of the same words. But in the end, we learned an important update on John Gray: Men are from wrestling, women are from heart emoji.

mw-1