It’s pretty clear to any parent that the Internet, in all its wonder and cat-filled glory, can be a dangerous place for kids. Not that we need any reminders – the media does a pretty good job of educating and alarming parents of the horrors lurking only a few clicks away. To put it mildly: NBC’s Chris Hansen wouldn’t have a job if it weren’t for the Internet.
This is where a company like Artimys Language Technologies comes in. Artimys uses machine learning to identify language patterns in online communication that indicate bullying, sexual predation or suicidal behavior. Sexual predators and bullies usually display common linguistic patterns – similar patterns of phrasing sentences, excessive use of particular words, stylistic elements, etc.
By studying and monitoring a child’s online conversations, Artimys can detect and send warnings of potential dangers to parents.
Clean Training Data is Critical to Machine Learning
Artimys’s machine learning algorithms are intelligent. But no artificial intelligence can be built in isolation of human intelligence. A major challenge of current language detection models is to make the distinction between true online bullying from aggressive banter among friends. For example, excessive use of profanity is a common indicator of online bullying. Yet, such profanities are common even in friendly conversations among young people.
What Artimys needed was a scalable system that could provide a massive dataset of real bullying language. Machine learning models need clean training data fed in at the outset, so they can learn to identify the specific patterns. For Artimys, they needed initial labels that separated actual bullying from banter.
Labeling Datasets on an Incredible Scale
Artimys started with 2 million messages drawn from Twitter and online message boards. From these, Artimys identified 40,000 conversations with a high-likelihood of aggressive behavior. It fed the data through CrowdFlower’s platform and on-demand workforce for labeling, which took only a few hours. Overall, CrowdFlower provided 150,000 responses to qualify the 40,000 messages. Artimys used this data to fine-tune its algorithm and detect true bullying and threatening behavior from friendly or ambiguous use of profanity.
Did It Work?
Here’s what Bob Dillon, CEO of Artimys had to say about CrowdFlower:
Platitudes aside, data labeling through CrowdFlower helped Artimys increase several key performance measures of accurate bullying language detection by 4-5 times. To be more specific, Artimys improved its:
– F1 score by 4.2x
– Model precision by 5.2x
– Recall by 25 percent
Which is to say, CrowdFlower helped Artimys make good on its claim of providing the best protection against online bullying and sexual predation to its customers. With further training, Artimys’s algorithm can only improve from here – and let parents sleep more soundly at night.