In this post, we’ll look at reviews from the Yelp Dataset Challenge. We’ll train a machine learning system to predict the star-rating of a review based only on its text. For example, if the text says “Everything was great! Best stay ever!!” we would expect a 5-star rating. If the text says “Worst stay of my life. Avoid at all costs”, we would expect a 1-star rating. Instead of writing a series of rules to work out whether some text is positive or negative, we can train a machine learning classifier to “learn” the difference between positive and negative reviews by giving it labelled examples.

This post follows closely from the previous one: Analyzing 4 Million Yelp Reviews with Python on AWS. You’re strongly encouraged to go through that one first. In particular, we will not be showing how to set up an EC2 Spot instance with adequate memory and processing power to handle this large dataset, but the same setup was used to run the analysis for this post.

Get Python Training for Teams

Introduction and Overview

This post will show how to implement and report on a (supervised) machine-learning based system of the Yelp reviews. Specifically, this post will explain how to use the popular Python library scikit-learn to:

  • convert text data into TF-IDF vectors
  • split the data into a training and test set
  • classify the text data using a LinearSVM
  • evaluate our classifier using precision, recall and a confusion matrix

In addition, this post will explain the terms TF-IDF, SVM, precision, recall, and confusion matrix.

In order to follow along, you should have at least basic Python knowledge. As the dataset we’re working with is relatively large, you’ll need a machine with at least 32GB of RAM, and preferably more. The previous post demonstrated how to set up an EC2 Spot instance for data processing, as well as how to produce visualisations of the same dataset. You’ll also need to install scikit-learn on the machine you’re using.

Loading and Balancing the Data

To load the data from disk into memory, run the following code. You’ll need to have downloaded the Yelp dataset and untarred it in order to read the Yelp Review’s JSON file.

Even on a fast machine, this code could take a couple of minutes to run.

We now have two arrays of data: the text of each review and the respective star-rating. Our task is to train a system that can predict the star-rating from looking at only the review text. This is a difficult task since different people have different standards, and as a result, two different people may write a similar review with different star ratings. For example, user Bob might write “Had an OK time. Nothing to complain about” and award 4 stars, while user Tom could write the same review and award 5 stars. This makes it difficult for our system to accurately predict the rating from the text alone.

Another complication is that our dataset is unbalanced. We have more examples of texts that typically have a 5-star rating than texts that typically have a 2-star rating. Because of the probabilistic models at the base of most machine learning classifiers, we’ll get less biased predictions if we train the system on balanced data. This means that ideally we should have the same number of examples of each review type.

In machine learning, it’s common to separate our data into features and labels. In our case, the review texts (the input data) will be converted into features and the star ratings (what we are trying to predict) are the labels. You’ll often see these two categories referred to as X and Y respectively. Adding the following method to a cell will allow us to balance a dataset by removing over-represented samples from the two lists.

Now we can create a balanced dataset of reviews and stars by running the following code (remember that now our texts are x and the stars are y).

You can see above that in the original distribution, we had 358,550 2-star reviews and 1.7 million 5-star reviews. After balancing, we have 358,550 of each class of review. We’re now ready to prepare our data for classification.

Vectorizing our Text Data

Computers deal with numbers much better than they do with text, so we need a meaningful way to convert all the text data into matrices of numbers. A straightforward (and oft-used) method for doing this is to count how often words appear in a piece of text and represent each text with an array of word-frequencies. Therefore the short text of the dog jumps over the dog could be represented by the following array:

The array would be quite large, containing one element for every possible word. We would store a lookup table separately, recording that (for example) the 0th element of each array represents the word “dog”. Because the word dog occurs twice in our text, we have a 2 in this position. Most of the words do not appear in our text, so most elements would contain 0. We also have a 1 to represent jumps, another 1 for over and another 2 for the.

A slightly more sophisticated approach would be to use Term Frequency Inverse Document Frequency (TF-IDF) vectors. This approach comes from the idea that common words, such as the aren’t very important, while less common words such as Namibia are more important. TF-IDF therefore normalises the count of each word in each text by the number of times that that word occurs in all of the texts. If a word occurs in nearly all of the texts, we deem it to be less significant. If it only appears in several texts, we regard it as more important.

The last thing that you need to know about text representation is the concept of n-grams. Words often mean very different things when we combine them in different ways. We will expect our learning algorithm to learn that a review containing the words bad is likely to be negative, while one containing the word great is likely to be positive. However, reviews containing phrases such as “… and then they gave us a full refund. Not bad!” or “The food was not great” will trip up our system if it only considers words individually.

When we break a text into n-grams, we consider several words grouped together to be a single word. “The food was not great” would be represented using bi-grams as (the food, food was, was not, not great), and this would allow our system to learn that not great is a typically negative statement because it appears in many negative reviews.

Using progressively longer combinations of words allows our system to learn fine-grained meanings, but at the cost of processing power and data-scarcity (there are many three-word phrases that might only appear once, and therefore are not much good for learning general rules). For our analysis, we’ll stick with single words (also called unigrams) and bigrams (two words at a time).

Luckily scikit-learn implements all of this for us: the TF-IDF algorithm along with n-grams and tokenization (splitting the text into individual words). To turn all of our reviews into vectors, run the following code (which took roughly 12 minutes to complete on an r4.4xlarge EC2 instance):

Creating a Train/Test Split

You can find patterns anywhere in any random noise, such as finding shapes in clouds. Machine learning algorithms are all about finding patterns, and they often find patterns that aren’t meaningful to us. In order to prove that the system is actually learning what we think it is, we’ll “train” it on one part of our data and then get it to predict the labels (star ratings) on an part of the data it didn’t see during training. If it does this with high accuracy (if it can predict the ratings of reviews it hasn’t seen during the training phase), then we’ll know the system has learned some general principles rather than just memorizing results for each specific review.

We had two arrays of data–the reviews and the ratings. Now we’ll want four arrays–features and labels for training and the same for testing. There is a train_test_split function in scikit-learn that does exactly this. Run the following code in a new cell:

We now have a third of our data in X_test and y_test. We’ll teach our system using two-thirds of the data (X_train and y_train), and then see how well it does by comparing its predictions for the reviews in X_test with the real ratings in y_test.

Fitting a Classifier and Making Predictions

The classifier we’ll use is a Linear Support Vector Machine (SVM), which has been shown to perform well on several text classifications tasks. We’ll skip over the mathematics and theory behind SVMs, but essentially the SVM will try to find a way to separate the different classes of our data. Remember that we have vectorized our text, so each review is now represented as a set of coordinates in a high-dimensional space. During training, the SVM will try to find some hyperplanes that separate our training examples. When we feed it the test data (minus the matching labels), it will use the boundaries it learned during training to predict the rating of each test review. You can find a more in-depth overview of SVMs in Wikipedia.

To create a Linear SVM using scikit-learn, we need to import LinearSVC and call .fit() on it, passing in our training instances and labels (X_train and y_train). Add the following code to a new cell and run it:

Fitting the classifier is faster than vectorizing the text (~ 6 minutes on an r4.4xlarge instance). Once the classifier has been fitted, it can be used to make predictions. Let’s start by predicting the rating for the first ten reviews in our test set (remember that the classifier has never seen these reviews before). Run the following code in a new cell:

(Note that the classifier is fast once it has been trained so it should only take a couple of seconds to generate predictions for the entire test set.)

The first line of the output displays the ratings our classifier predicted for the first ten reviews in our dataset, and the second line shows the actual ratings of the same reviews. It’s not perfect, but the predictions are quite good. For example, the first review in our data set is a 5-star review, and our classifier thought it was a 4-star review. The classifier predicted that the fifth review in our dataset was a 5-star review, which was correct. We can take a quick look at each of these reviews manually. Run the following code in a new cell:

>>> If you enjoy service by someone who is as competent as he is personable, I would recommend Corey Kaplan highly. The time he has spent here has been very productive and working with him educational and enjoyable. I hope not to need him again (though this is highly unlikely) but knowing he is there if I do is very nice. By the way, I’m not from El Centro, CA. but Scottsdale, AZ.
>>> I walked in here looking for a specific piece of furniture. I didn’t find it, but what I did find were so many things that I didn’t even know I needed! So much cool stuff here, go check it out!

We can see that although the reviewer of the first reviewer did leave 5-stars, he uses more moderate descriptions and his review contains some neutral phrases such as “By the way, I’m not from El Centro”, while the phrases used in the fifth review are more extremely positive (“Cool stuff”). It’s clear that the prediction task would be difficult for a human as well!

Looking at the results of ten predictions is a nice sanity-check and can help us build our own intuitions about what the system is doing. We’ll want to see how it performs over a larger set before deciding how well it can perform the prediction task.

Evaluating our Classifier

There are a number of metrics that we can use to estimate the quality of our classifier. The simplest method for evaluating such a system is to see the percentage of the time it accurately predicts the desired answer. This method is unsurprisingly called accuracy. We can calculate the accuracy of our system by comparing the predicted reviews and the real reviews–when they are the same, our classifier predicted the review correctly. We sum up all of the correct answers and divide by the total number of reviews in our test set. If this number is equal to 1, it means our classifier was spot on every time. A score of 0.5 means half of its answers were correct. You can ask scikit-learn to calculate the accuracy as follows:

The score may seem a bit low, since the classifier was only correct about 62% of the time, but keep in mind that with five rating classes, random guessing would be correct only 20% of the time.

Accuracy is a crude metric–there are of course finer-grained evaluation methods. It’s likely that some classes are ‘easier’ to predict than others, so we want to look at how well the classifier can predict each class (for example, only 5-star reviews) individually. Looking at results on a per-class level means that there are two different ways that the classifier could be wrong. For a given review and a given class, the classifier might have a false positive or a false negative classification. If we take 5-star reviews as an example, a false positive occurs when the classifier predicted that a review was a 5-star review when in fact it wasn’t. A false negative occurs when the classifier predicted a review wasn’t a 5-star review, when in fact it was.

Building on the ideas of false-positives and false-negatives, we introduce precision and recall. A classifier that could predict 5-star reviews with high precision would almost never predict that other reviews were 5-star reviews, but it might ‘miss’ many real 5-star reviews and classify them into other classes. A classifier with high recall for 5-star reviews would hardly ever predict that a 5-star review was something else, but it might predict that many other reviews are 5-star reviews. Precision and recall can be a bit confusing at first–there is a nice Wikipedia article that explains these topic in more detail.

We would like our classifier to strike a balance between precision and recall for all of the classes, and we can measure both at the same time using an F1 Score, which measures both precision and recall as a single metric. We can get an overview of all the classes by using the classification_report from scikit-learn. Run the following code in a new cell:

We can see from the above that the 1- and 5-star reviews are the easiest to predict, and we get F1 Scores of 0.75 and 0.74 respectively for them. The neutral reviews are more difficult to predict, as evidenced by F1 scores that are just over 0.5.

The final evaluation metric we can consider is a confusion matrix. A confusion matrix demonstrates which predictions are most often confused. For our setup, we would hope that 1- and 5-star reviews are not confused too often by the classifier, but we don’t care too much if it mixes up 4-star and 5-star reviews.

We can examine the confusion matrix by running the following code:

The confusion matrix itself can be a bit … confusing. The rows represent the predictions we made for each kind of review, while the columns represent the correct classes. Let’s consider a few examples to see what is happening:

  • The first row describes all of the reviews that the classifier predicted as being 1-star reviews. The top left cell means that the classifier correctly predicted 92,846 1-star reviews.
  • The cell immediately to the right (first row, second column) indicates that there were 20,722 reviews which were 1-star reviews but which our classifier thought were 2-star reviews.
  • The cell in the first column of the second row indicates that there were 27,820 2-star reviews which our classifier thought were 1-star reviews.
  • The diagonal from the top left to the bottom right represents all of the correct predictions made by our classifier. We want these numbers to be the highest.
  • The top right corner says that there were 871 5-star reviews which the classifier thought were 1-star reviews.
  • The bottom left hand corner says that there were 1,023 1-star reviews which the classifier thought were 5-star reviews.

We can see that there are clusters of high numbers towards the top left and bottom right, showing that the classifier mainly confused, for example, 4- and 5-star reviews, or 1- and 2-star ones. The numbers towards the top right and bottom left are lower, showing that the classifier rarely was completely wrong (thinking that a 5-star review was actually a 1-star, or vice-versa).

Remodeling our Problem

As our final task, we’ll try to model a simpler problem and run the exact same analysis. Instead of trying to predict the exact star rating, we’ll try to classify the posts into positive (4- or 5-star reviews) or negative (1- or 2-star reviews). We’ll remove all the 3-star reviews for this task.

We don’t need to recalculate the vectors since we’re using the same texts. We simply need to remove the instances and labels for all 3-star reviews and modify our labels (train_y and test_y). Because our (vectorized) reviews are stored in numpy arrays and our labels are stored in Python lists, we have to treat each of these separately. The code below strips out the 3-star reviews and the labels, then converts all the labels to n for negative or p for positive:

Once we’ve set up these new arrays, we can run exactly the same “train-predict-evaluate” steps that we went through previously:

We can see that this task is much easier for our classifier, and it predicts the correct class 96% of the time! The confusion matrix tells us that there were 9,028 negative reviews that the classifier thought were positive and 9,237 positive reviews that the classifier thought were negative–everything else was correct.

Conclusion

This post covered the basics of using a Support Vector Machine to classify text. After vectorizing text and training a classifier, two prediction tasks were performed–predicting the exact rating of each review vs. predicting whether the review was positive or negative.

Why would we be interested in predicting the ratings from text–after all, we already have the correct ratings at our disposal? There are many uses for systems such as the one we just built. For example, a company might want to analyze social media in order to find out how the public feels about the company, or reach out to customers who post negative reviews of the company. Specific companies are often mentioned on Twitter, Facebook, and other social media sites, and in these situations the text descriptions are not accompanied by a star rating.

As a more concrete example, imagine that you’re the manager of the NotARealName Hotel, and you’re automatically gathering tweets which mention your hotel, and you want

  • a) to know if the general sentiment regarding your hotel is going up or down over time, and
  • b) to respond to negative comments about your hotel (or share positive ones)

The following code uses our same classifier on new data. We use two fictitious tweets–the first one is obviously negative and the second one positive. Can we classify them without human intervention?

Our classifier determines that the first tweet is negative and the second one is positive. However, we wouldn’t want to trust it too much. Remember that it’s likely to be sensitive to the kind of language that’s used in Yelp reviews, so if we want to classify text from other areas, e.g., finance or politics, we’ll likely find the results to be less satisfying. However, with enough training data, we can easily retrain the classifier on examples that better match our desired task.

Get Python Training for Teams

The following two tabs change content below.
mm

Gareth Dwyer

Gareth enjoys writing. His favourite languages are Python and English, but he's not too fussy. He is author of the book "Flask by Example", and plays around with Python, Natural Language Processing, and Machine Learning. You can follow him on Twitter as @sixhobbits.

Comments

comments