There are lots of great tools out there for building machine learning models and data processing pipelines. Most of these tools, like R, scikit-learn, spark.ml and TensorFlow, require substantial hands-on coding to produce working results. At CrowdFlower, we use many of these resources to varying degrees. However, we also recognize that many people will prefer to approach model building and deployment in a hands-on integrated environment supported by a graphical interface. To this end, we are pleased to showcase an end-to-end model construction process in Microsoft’s Azure Machine Learning Studio.
At CrowdFlower, we know a lot of our customers are interested in machine learning. Some use data collected on our platform in their own internal efforts, while others are building solutions using our recently-announced CrowdFlower AI platform. We also know, though, that there are a lot of people who are becoming interested in AI and machine learning, and want to learn more about the tools that are available to them. Stay tuned in the future for more content about getting started doing machine learning, in text analytics and beyond.
Maybe you want to get into machine learning or automatic text classification, but aren’t sure where to start. Maybe you’re curious to learn more about Microsoft’s Azure Machine Learning offering. Maybe you just like seeing cool graphical interfaces. If any of these statements describe you, you’ll enjoy this walkthrough of how to build a text classifier in Microsoft’s Azure Machine Learning (AML) Studio. As a use case, we’re going to build a fairly simple sentiment analysis model. You’ll see everything from the raw data used in training all the way through evaluation metrics for the completed model.
Why sentiment analysis?
In one sense, sentiment analysis is just another type of document classification. You have a set of labels, and you want to predict which of those labels best applies to a given piece of text. But unlike a lot of other classification tasks, which can be very specific to certain projects, sentiment analysis has a lot of general appeal. This data in particular, which we adapted from a dataset in our Data for Everyone library, is a fun twist on the typical sentiment analysis job. Most sentiment analysis is concerned with attitudes expressed in text (positive, negative, or neutral), but this data actually seeks to capture the emotional content of passages instead. We felt that the unique appeal of this dataset, and the fact that it motivates a much larger class of text analytics tasks, made it a great place to start looking at building models.
Getting set up
There are lots of screenshots in this blog post, but to really be able to follow along, we recommend signing up (if you haven’t already) for a Microsoft Live account, and using this to set up an account on Azure Machine Learning Studio. It’s all free to do, and pretty quick. Plus, once you have this access set up, you’ll be able to tinker with different versions of the model and see how they perform. It’s a great way to learn.
Whether you have created an account or not, you can view the page for our uploaded experiment (AML’s term for a data pipeline and model) in the Cortana Intelligence Gallery. The description contained on that page is essentially a compressed version of this blog post (without the experiment walkthrough). If you have completed the registration steps above, you can click the big green button labeled “Open in Studio” and copy this experiment into your own AML workspace to play around with.
Once you’ve done this, the view in your Studio should look something like the image below. It looks complicated (and it is, a bit) but once we break it down into individual steps it will become a lot less imposing. At a high level, though, what is shown here is a data processing pipeline, where data enters at the top, is subjected to a series of transformations and calculations, and at the bottom you get a trained model and some associated information.
Step 1: Getting data in
Let’s start at the top of the pipeline and examine these steps in more detail. At the top, you will see two boxes, one labeled “Web service input” and one containing a dataset built from a .csv file. These represent the two ways you can get data into the experiment. Here, we’re only interested in the .csv dataset. The data we’re using contains about 10,000 tweets that were labeled using the CrowdFlower platform for the emotional state they convey — happiness or sadness. This data and lots of other cool datasets are available for free download on our Data For Everyone page.
Click on the small dot at the bottom of the .csv dataset. This should bring up a menu of options:
Click “Visualize” and you will see a preview of the dataset’s contents. Here, the data contains three columns:
- id_nfpu: This is a unique identifier for each row of data. This is useful if you are only passing part of your data to the classifier, and want to be able to stitch predictions back together with other metadata later on.
- label: This is the label assigned to a row of training data. Here, labels are either “happiness” or “sadness”, representing the two emotional states of posts being classified.
- features: This column contains the text to which the label applies. It will get transformed into features used by the model during training and prediction.
Step 2: Clean and prepare your data
Both of the data sources feed into a module labeled “Edit Metadata”. This is the first data transformation step, in which takes the “label” column and transforms it into a categorical variable. The next step, “Clean missing data”, gets rid of any rows in the dataset that don’t contain a label, since the model won’t like that.
The final module in this screenshot showcases one of our favorite things about AML Studio, namely that users can write their own data processing scripts in R (or Python!) and incorporate them directly into the pipeline. This is great if you have data that needs to be processed in a specific way, or even if you just prefer to hand-code some or all of your data pipeline steps. R support in AML is quite full-featured, and even allows generation of graphics, but here we make use of this step more simply. Click on that module and a code editor opens showing the script. The dataset is read in and converted into an R data frame, and then two modifications are made to the “features” column: non-alphanumeric characters are scrubbed from the text, and the text is converted to lowercase. After that, the data frame is converted back into an AML dataset and passed down the pipeline. Very smooth.
Step 3: Train your model
The next set of steps in the pipeline contains two important sequences of events. Let’s start with the “Feature Hashing” module. This receives its input from the R script just described. If you click on the module, you will see that it is operating on the “features” column in the dataset, and that there are two options you can set, labeled “N-grams” and “Hashing bitsize.”
N-grams may be a familiar term — it refers to how the text is broken up and used as features in the model. Here, with a value of 1, only so-called unigram features are used, meaning the individual words in the text. With a value of 2, both unigrams and two-word sequences would be used, and so on. The other parameter, “Hashing bitsize”, refers to the number of features that are used to build the model. Here we supply a value of 12, meaning that 2^12 = 4096 features are used.
The feature hashing module’s output is directed to two places — the first time we’ve seen this in our experiment. This allows that data to be simultaneously used both for training a logistic regression model, and for performing cross-validation to see how that model performs. Before the cross-validation step, this output data also passes through a module that generates the cross-validation folds, here set to a default of 5.
In addition to featurized data, both training and cross-validation depend on the existence of a model instance. This is achieved by the two steps on the left of the above screenshot. In the first, we choose a logistic regression model. In the second step we define the model as a multiclass model (in which probabilities for any number of candidate labels will sum to 1).
Step 4: Evaluate your model
Finally, after the model has been trained and cross-validated, all that remains is evaluation and making the model available outside the experiment. The trained model as well as cross-validation data can all be hooked up to web service outputs, and the cross-validation predictions can also be exported to CSV. If you click the small circle on the bottom of the “Evaluate Model” module, you will see a variety of accuracy measures as well as a confusion matrix.
That concludes the walk-through of the experiment we have uploaded to Cortana Intelligence Gallery, but your exploration doesn’t have to end there. Want to see how the model performs with different sets of features? Want to try out alternatives to logistic regression (say, naive Bayes or boosted decision trees)? Want to start from scratch, and build a model using another dataset? You can do all of these things. Make some changes to the model, click “Run”, and see how it does. If you wind up with something you want to use to actually generate predictions on new data, just click the button labeled “Set up Web Service” and watch your experiment being transformed into a predictive model that you can deploy and put to use.