Two weeks ago, we were very happy to announce the launch of CrowdFlower AI in partnership with Microsoft Azure Machine Learning. We’re opening up access to machine learning for thousands of organizations that can benefit from intelligent automation, without requiring them to construct their own in-house solutions.
If you’re using (or thinking about using) CrowdFlower AI, it’s safe to assume that you’ve got some data on your hands and know how valuable it can be if you put it to work in the right ways. But you might not be a full-on data scientist, and you might be asking yourself how you can understand how good your CrowdFlower AI model is. Fortunately for you, we provide a number of ways to inspect where your model is at, right in our web interface. Here’s a quick tour of the measures we display, and what they tell you.
Measure 1: Accuracy
At the highest level, you can ask how your model is doing by asking how accurate it is. This is a pretty simple one – what percentage of the model’s predictions are correct? Accuracy numbers are displayed most prominently when you expand a row on your model listings page. Here (and throughout the rest of this post) we show an example of this, from a sentiment analysis model we built for the airline industry as part of a recent product demo (for more details, check out this talk we gave recently). There are two questions listed here: one for the sentiment of the tweet, and the other for the nature of the complaint, if the tweet was negative.
It’s worth pausing for a moment to note that these numbers are conservative – this is the accuracy you would expect if you were to use the AI model exclusively, and not employ humans in the loop to offset its less confident predictions. By combining human and machine intelligence in your task, you can push the accuracy of your results much higher.
For now though, let’s ask why it’s good to consider other measures beyond just accuracy. One reason is that bad errors can hide behind high accuracy – for example, if 95% of your support ticket data is low priority, you could be 95% accurate just by saying everything is low priority. You want to make sure your model is accurate in all the right corners of your data, so this brings us to our next measure.
Measure 2: The confusion matrix
When you click “Analyze” on your model view page, you’ll be taken to a whole page of information to help you dig deeper into what your model is doing. Near the top of this page is the confusion matrix; this is a table that matches up the predictions your model makes against the actual labels in the training data. Let’s take a look at the confusion matrix for our airline sentiment question:
Here, actual labels in your data appear as the rows of the table, and the model’s predictions appear as the columns. Note that all the rows sum up to 100% – so the way to read this chart is by asking “for each of my actual labels, what percentage of them are predicted correctly, and how frequently do the other types of predictions occur?” At a glance you can now see all the different sources of accuracy and error in your data. They’re even color coded! This can also help you take action to improve the state of your model – for example, we see on the right that there are almost twice as many negative items in the training data as positive and neutral combined. Adding in more examples of these other labels will probably help the classifier recognize them better.
The confusion matrix is an incredibly powerful source of information, and you can calculate many other things using it. However, there are three measures in particular that are so widely used in machine learning and data science that we display them automatically on the analyze page. Let’s take a look at these in the final stop on our tour.
Measure 3: Precision, recall, and F1 score
The motivation behind precision and recall is closely tied to our reasons for inspecting the confusion matrix above — there is more than one type of error that a model can make. Precision asks the question “of all the times I predicted something was an X, how often was I correct?” Recall is basically the inverse, asking “of all the actual X’s in the data, how many of them did I find?”
Looking at our sentiment model, we can see that for the Negative label, our recall is awesome – we’re finding 95% of the negative tweets in the data. Our precision is somewhat lower though – the model is calling some things negative when they’re actually neutral or positive. Again, we can see that in the confusion matrix above, but these numbers are laid out very concisely for you here as well. You may, for example, care particularly about precision on your Negative label and want to minimize the chance that that label is misapplied – in this case you might seek additional examples of things that could be mistaken for negative but aren’t, and adjust your training data accordingly to better define the boundaries of when that label applies. Often, however, good precision and good recall are equally important, which is what the F1 score reflects (it’s a particular type of average known as the harmonic mean).
How did we calculate these numbers?
One more question you might have is how we were able to calculate the accuracy of a model that only contains training data – what did we test it on? The answer is that we use a trick known as cross-validation – basically training a bunch of models on different subsets of the data, and validating whether the hidden answers for the small subset of data that was held out were accurately predicted. This graphic helps make that clear:
Caption: Image source: Chris McCormick
We’re not bringing this up because you need to understand the technical details of how this works. Rather, the big point is that all of the numbers we show you on the Analyze page in CrowdFlower AI are calculated using your training data. If your training data is different from the data you’re trying to classify, the model might not perform as well as the numbers suggest. But If you have a good amount of high-quality training data (which, as a CrowdFlower user, you should!) then the various measures we display for you give you powerful insight into how your model will perform when put to use.