NOTE: The following post, by Kevin Cocco, was reposted from the Dialogue Earth blog.
Can a machine be taught to determine the sentiment of a Twitter message about weather? With the data from over 1 million crowd sourced human judgements the goal was to use this data to train a predictive model and use this machine learning system to make judgements. Below are the highlights from the research and development of a machine learning model in the cloud that predicts the sentiment of text regarding the weather. The following are the major technologies used in this research: Google Prediction API, CrowdFlower, Twitter, Google Maps.
The only person that can really determine the true sentiment of a tweet is the person who wrote it. When the human crowd worker makes tweet sentiment judgements only 44% of the time do all 5 humans make the same judgement. CrowdFlower’s crowd sourcing processes are great for managing the art and science of sentiment analysis. You can scale up CrowdFlower’s number of crowd workers per record to increase accuracy, of course at a scaled up cost.
The results of this study show that when all 5 crowd workers agree on the sentiment of tweet the predictive model makes the same judgement 90% of the time. When you take all tweets the CrowdFlower and Predictive model return the same judgement 71% of the time. Both CrowdFlower and Google Predictions supplement rather than substitute each other. As shown in this study, CrowdFlower can successfully be used to build a domain/niche specific data set to train a Google Predicition model. I see the power of integrating machine learning into crowd sourcing systems like CrowdFlower. CrowdFlower users could have the option of automatically training a predictive model as the crowd workers make their judgements. CrowdFlower could continually monitor the models trending accuracy and then progressively include machine workers into the worker pool. Once the model hit X accuracy you could have a majority of data stream routed to predictive judgments while continuing to feed a small percentage of data the crowd to refresh current topics and continually validate accuracy. MTurk hits may only be pennies but Google Prediction ‘hits’ cost even less.
Weather Sentiment Prediction Demo Application:
Note, this demo uses server side Twitter feed that is throttled, retry later if you get no results. Contact me regarding high volume applications and integrations with full Twitter firehouse.
Match Rate/ Accuracy Findings:
Below are the highlighted match rates of CrowdFlower human judgements to Google Prediction machine judgements. A match rate compares the resulting predicted sentiment labels from one method up to those from another:
- Google Prediction API matching CrowdFlower Sentiment Analysis = 71% match rate
- Mirroring DialogueEarth Plus filtering of lowest 22% confidence scores = 79% match rate
- Tweets sentiment can be confusing for humans and machines. Google predictions of only the tweets in which all the crowd workers agreed (CrowdFlower confidence score = 1) = 90% match rate
About Google Predication API:
During the May, 2011 Google IO conference Google released a new version of their Google Prediciton API with open access to Google’s machine learning systems in the cloud. The basic process to creating predictive models is to upload training data to Google Cloud Storage and then use Google Prediction API to train a machine learning model from the training data set. Once you have a trained model in the cloud you can write code with their API to submit data for sub second (avg 0.62 sec per) predictions.
About The Data:
Much has been written about DialogueEarth.org’s Weather Mood Pulse system. Pulse has collected 200k+ dataset of tweets that have been assigned one of five labels regarding the tweets sentiment related to the weather. This labeling of tweets is crowd sourced with the CrowdFlower system that presents each tweet with a survey for the workers in the crowd to decision. CrowdFlower has quality control processes in place to present the same tweet to several people in the crowd. DialogueEarth’s crowd jobs were configured so that each tweet was graded by 5 different people. CrowdFlower uses this 5 person matching and each person’s CrowdFlower “Gold” score to calculate a confidence score between 0 and 1 for each tweet. About 44% of the tweets have a confidence score of 1 or 100% of the graders agreed on the sentiment label for the tweet, while some of the other tweets have low scores like 0.321 meaning very little agreement in tweet sentiment plus some influence from each of the graders Gold scores. The Pulse system has chosen to use only the tweets that have a CrowdFlower confidence score equal or greater than 0.6. Dozens of models where build using various segments of CrowdFlower confidence score ranges. Testing showed that the best model used the full confidence range of CrowdFlower records.
Weather Sentiment Tweet Labels From CrowdFlower:
The CrowdFlower scored tweet data contains the tweet text, weather sentiment label and the CrowdFlower confidence score. The tweet data set was randomized into two segments: 111k(~90%) rows used to train model and 12k rows held out for testing the model.
|Sentiment||Modeled Tweets 90%||Test Tweets 10%||% of Total|
|not weather related||34,232||3,780||30.6%|
CrowdFlower Confidence Scores Correlation to Google Prediction Match Rate / Accuracy:
Running Match Rate – Ordered by CrowdFlower Confidence, Google Prediction Confidence
- X- axis shows the distribution of CF confidence scores in 10% random test data set 12k rows
- Google is better at predicting Tweets that have a higher CrowdFlower confidence score
- The Google confidence score correlates with accuracy/match rate, on average higher Google confidence = higher accuracy of matching
Google’s Prediction Confidence Score and Correlation to Match Rate Accuracy
Running Match Rate and Google Prediction Score – Ordered by Prediction Score, Random
- Higher Google confidence scores correlate with higher matching/accuracy rate.
- Filtering results at Google conf score > 0.8290 will result in 80% accuracy and filtering/loss of 24.41% of data.
|Accuracy/Match||Google Confidence||% Data Filtered||Rows out of 12390|
|98.47%||score = 1||86.88%||1627|
|90%||score > 0.99537||55.16%||5543|
|85%||score > 0.95688||38.78%||7586|
|80%||score > 0.82900||24.41%||9367|
|78.93%||score > 0.79122||21.60% **||9715|
|75.00%||score > 0.61647||11.06%||11021|
|70.91%||score > 0.25495||0.0%||12390|
** Note, 21.6% is the current % of data that Pulse filters by excluding CF conf. scores <= 0.6
Effect of Model Size on Match Rate / Accuracy
- The larger the data model training data set the higher match accuracy
- 95% (5k of 111k) decrease in dataset set size decreases match rate by 7.1% (64% – 71%)
- The classificationAccuracy returned from Google’s model build was between 5%(model 5k) to 0.06%(model 111k) different than tested accuracy rates.
Formatting Tweet Text for modeling:
When preparing text for modeling Google recommends removing all punctuation because “Punctuation rarely add meaning to the training data but are treated as meaningful elements by the learning engine”. The tweet text was lowercased, stripped of all punctuation, special characters, returns, tabs,.. These were replaced by a space to prevent two words from joining. With the unique 140 character limit and the use of emoticons it might be interesting to replace emoticons with words like replacing :’-( with something like a specific replacement ‘ emoticon_i_am_crying ‘ or general ‘emoticon_negative’ before building the model. Here is an example of tweet hygiene below:
BEFORE: 83 degrees n Atl @mention:59pm :> I LOOOOVE this …feels like flawda
AFTER: 83 degrees n atl mention 59pm i loooove this feels like flawda
Training The Google Prediction Model
Here are the basic steps for training a model. The training time can take a few hours for a 100k/10MB training data set, this seems to depend on the sever load. When Google is finishes building the model the trainedmodels.get method will return a confusion matrix and also the model’s “classificationAccuracy” score. Note, Google’s classificationAccuracy score and the testing match rate accuracy scores below are statistically the same (0.71 vs 0.709).
Google Model building confusion matrix with a classificationAccuracy of 0.71
|SENTIMENT||negative||positive||not weather related||neutral||cannot tell|
|not weather related||239.5||163.5||268||1965.5||66|
Testing 12k hold out tweets against the model above with a match rate / accuracy of 0.709
Crowd Flower Actual
|SENTIMENT||negative||positive||not weather related||neutral||cannot tell|
|not weather related||251||200||2967||460||64|