Practical Neural Networks with Keras: Classifying Yelp Reviews

Follow us on LinkedIn for our latest data and tips!

Keras is a high-level deep learning library that makes it easy to build Neural Networks in a few lines of Python. In this post, we’ll use Keras to train a text classifier. We’ll use a subset of Yelp Challenge Dataset, which contains over 4 million Yelp reviews, and we’ll train our classifier to discriminate between positive and negative reviews. Then we’ll compare the Neural Network classifier to a Support Vector Machine (SVM) on the same dataset, and show that even though Neural Networks are breaking records in most machine learning benchmarks, the humbler SVM is still a great solution for many problems.

Once we’re done with the classification tasks, we’ll show how to package the trained model so that we can use it for more practical purposes.


This post is aimed at people who want to learn about neural networks, machine learning, and text classification. It will help if you have used Python before, but we’ll explain all of the code in detail, so you should be able to keep up if you’re new to Python as well. We won’t be covering any of the mathematics or theory behind the deep learning concepts presented, so you’ll be able to follow even without any background in machine learning. You should have heard, and have some high-level understanding, of terms such as “Neural Network”, “Machine Learning”, “Classification” and “Accuracy”. If you’ve used SSH, or at least run commands in a shell before, some of the setup steps will be much easier.

If you want to follow along with the examples in this post, you’ll need an account with Amazon Web Services, as we’ll be using their Spot Instance GPU-compute machines for training. You can use your own machine, or from any other cloud provider that offers GPU-compute virtual private servers, but then you’ll need to install and configure:

  • A Python environment with python 3, pip3, numpy, scipy, and Jupyter
  • Keras and TensorFlow
  • CUDA and CuDNN

In our example, we’ll be using the AWS Deep Learning AMI, which has all of the above pre-installed and ready to use.


In this post, we will:

  • Set up an AWS Spot Instance (pre-configured with a Tesla GPU, CUDA, cuDNN, and most modern machine learning libraries)
  • Load and parse the Yelp reviews in a Jupyter Notebook
  • Train and evaluate a simple Recurrent Neural Network with Long Short-Term Memory (LSTM-RNN) using Keras
  • Improve our model by adding a Convolutional Neural Network (CNN) layer
  • Compare the performance of the Neural Network classifier to a simpler SVM classifier
  • Show how to package all of our models for practical use
Setting up an AWS Spot Instance

Because Neural Networks need a lot of computational power to train, and greatly benefit from being run on GPUs, we’ll be running all the code in this tutorial on a Virtual Private Server (VPS) through Amazon Web Services (AWS).

AWS offers cloud machines with high processing power, lots of RAM and modern GPUs. They auction off extra capacity by the hour through “Spot Instances” ( which are specifically designed for short-lived instances. Because we’ll only need the instance for a couple of hours at most, spot instances are ideal for us. The EC2 p2.xlarge instances that we’ll be using usually cost around $1 per hour, while the same machine using Spot Pricing is usually around $0.20 per hour (depending on current demand).

If you don’t have an AWS account, create one at You’ll need to go through a fairly long sign-up process, and have a valid credit card, but once you have an account, launching machines in the cloud can be done a few clicks.

Once you have an account, log in to the AWS console at and click on the link to “EC2”, under the “Compute” category. The first thing we need to do is to pick a region into which our instance will be launched. Pick one of “US East (N. Virginia)”, “US West (Oregon)” or “EU (Ireland)” (whichever one is closest to you) as this is where the Deep Learning pre-configured machine images are available. You can select the region from the dropdown in the top right corner of the page. Once you’ve chosen your region, click on the Spot Instances link in the left column, and hit “Request Spot Instances”.


We can accept most of the defaults in the resulting page, but we need to choose:

  • The AMI (Amazon Machine Image): We’ll pick an image with everything we need already installed and configured.
  • The Instance Type: EC2 instances come in different shapes and sizes. We’ll pick one optimized for GPU-compute tasks, specifically the p2.xlarge instance.

The easiest way to find the correct AMI is by its unique ID. Press “Search for AMI”, select “Community AMIs” from the dropdown, and paste the relevant AMI ID from here into the search box. For example, I am using the eu-west-1 (Ireland) region, so the AMI-ID is ami-c5afaaa3. Hit the select button once you’ve found the AMI, and close the pop-up, shown in the image below.

/Users/g/Desktop/Select AMI.png

To choose the instance type, first delete the default selection by pressing the x circled below, then press the Select button.

In the pop-up, choose “GPU Compute” from the “Instance Type” dropdown, and select p2.xlarge. You can have a look at the current spot price, or press “Pricing History” for a more detailed view. It’s good to check that there haven’t been any price spikes in the past few days. Finally, press “Select” to close the window.


Under the Availability Zone section, tick all three options, as we don’t care which zone our instance is in. AWS will automatically fire up an instance in the (current) cheapest zone. Press on “Next” to get to the second page of options. By “Instance store” tick the “Attach at launch” box, so that our disk will be ready for using the moment our instance boots.

Under “Set keypair and role”, choose a key pair. If you don’t have one yet, press “Create new key pair”, which will generate a public-private key pair. You’ll need to follow the instructions at if you’re not used to working with SSH and key pairs, and specifically it’s important to change the permissions on your new private key before you can use by running the following command, substituting ‘my-key-pair.pem’ with the full path and name of your private key.

chmod 400 my-key-pair.pem

The last thing to configure is a security group. Large AWS machines present juicy targets to attackers who want to use them for botnets or other nefarious purposes, so by default they don’t allow any incoming traffic. As we’ll be wanting to connect to our instance via SSH, we need to configure the firewall appropriately. Select “create new security group”, and press “Create Security Group” in the next window. You’ll see a pop-up similar to the one below. Fill in the top three fields and add an incoming rule to allow SSH traffic from your IP address.


Name the new group “allow-ssh”, and add “Allow us to SSH” as a description. Also, change the “VPC” dropdown from “No VPC” to the only other option, which is the VPC you’ll be launching the instance into. Hit the “Add Rule” button on the “Inbound” tab, and choose to allow SSH traffic under “Type”, and choose “My IP” under “Source”. This will automatically whitelist your current IP address, and allow you to connect to your instance. Click “Create”, and close the Security Groups tab to return to the Request Instance page where we were before.

Hit the review button at the bottom right of the page, check that all the details are correct. Then press the blue “Launch” button. Your request should be fulfilled within a few seconds, and you’ll see your instance boot up in the console under “Instances”. (Hit the Refresh button indicated below if you don’t see your new instance). Copy the public IP address, which you might need to scroll to the right to see, to the clipboard.


If the instance status in the “Spot Requests” panel gets stuck in “pending”, it might be because your AWS account hasn’t yet been verified, or because your Spot Instance limit is set to zero. You can see your limits by clicking on “Limits” in the left-hand panel, and looking for “Spot instance requests”. If your account didn’t get properly verified, you’ll be unable to launch instances. In this case, check that you’ve completed all the steps at and contact AWS Support if necessary.

Connecting to our EC2 Instance

Now we can connect to our instance via SSH. If you’re using Mac or Linux, you can SSH simply by opening a terminal and typing the following command, replacing the with the public IP address that you copied above, and the /path/to/your/private-key.pem with the full path to your private key file.

ssh -i /path/to/your/private-key.pem -L 8888:localhost:8888 [email protected]

The -i flag is for “identifier”. This cryptographically proves to AWS that we are who we say we are because we have the private key file associated with our instance. The -L flag sets up an SSH tunnel so that we can access a Jupyter Notebook running on our instance as if it were running on our local machine.

If you see an error when trying to connect via SSH, you can run the command again and add a -vvv flag which will show you verbose connection information. Depending on how your account is configured to use VPCs, you may need to use the public DNS address instead of the public IP address. See for some common issues and solutions regarding VPC settings.

If you’re on Windows, you can’t use the ssh command by default, but you can work around this through any of the following options:

  • If you’re on Windows 10, you can use the Windows Subsystem for Linux (WSL). Read about how to set it up and use it here:
  • If you’re on an older version of Windows, you can use PuTTy (if you prefer having a graphical user interface), or Git Tools for Windows (if you prefer the command line). You can get PuTTy and instructions on how to use it at or Git for Windows here If you install Git for Windows, it’s important to select the “Use Git and optional Unix tools” option in the last step of the installer. You’ll then be able to use SSH directly from the Windows command prompt.
  • /Users/g/Desktop/git-for-windows.png
Running a Jupyter Notebook server

After connecting to the instance, we’ll need to run a couple of commands. First, we need to upgrade Keras to version 2, which comes with many API improvements, but which breaks compatibility in a number of ways with older Keras releases. Second, we’ll run a Jupyter Notebook server inside a tmux session so that we can easily run Python code on our powerful AWS machine.

On the server (in the same window that you connected to the server with SSH), run the following commands:

pip3 install keras --upgrade --user

Now open a tmux session by running:


This creates a virtual session in the terminal. We need this in case our SSH connection breaks for any reason (e.g. if your WiFi disconnects). Usually, breaking the connection to the server automatically halts any programs that are currently running in the foreground. We don’t want this to happen while we are a few hours into the training of our neural network! Running a tmux session will allow us to resume our session after reconnecting to the server if necessary, leaving all our code running in the background.

Inside the tmux session, run:

jupyter notebook

You should see output similar to the following. Copy the URL that you see in a web browser on your local machine to open up the notebook interface, from which you can easily run snippets of Python code. Code run in the notebook will be executed on the server, and Keras code will automatically be run in massively parallel batches on the GPU.


Hit ctrl + b and tap the d key to “detach” the tmux session — now the Jupyter session is still running, but it won’t die if the connection is interrupted. You can type tmux a -t 0 (for “tmux attach session 0”) to attach the session again if you need to stop the server or view any of the output.

Loading and parsing the Yelp dataset

Now download the Yelp dataset from yelp. You’ll need to fill in a form and agree to only use the data for academic purposes. The data is compressed as a .tgz file, which you can transfer to the AWS instance by running the following command on your local machine (after you have downloaded the dataset from Yelp):

scp -i ~/path/to/your/private/key.pem yelp_dataset_challenge_round9.tgz [email protected]:~

Once again, substitute with the path to your private key file and the public IP address of your VPS as appropriate. Note that the command ends with ~. The colon separates the SSH connection string of your instance from the path that you want to upload the file to, and the ~ represents the home directory of your VPS.

Now, on the server untar the dataset by running tar -xvf yelp_dataset_challenge_round9.tgz. This should extract a number of large JSON files to your home directory on the server.

At this point, we’re ready to start running our Python code. Create a notebook on your local machine by firing up your web browser and visiting the URL that was displayed after you ran the Jupyter notebook command. It should look similar to http://localhost:8888?token=<yourlongtoken>. For this to work, you need to be connected to your VPS via SSH with the -L 8888:localhost:8888 as specified earlier. In your browser, click the “new” button in the top right, and choose to create a Python 3 notebook.


If you haven’t used Jupyter notebook before, you’ll love it! You can easily run snippets of Python code in so-called “cells” of the notebook. All variables are available across all cells, so you never have to re-run earlier bits of code just to reload the same data. Create a new cell and run the following code. (You can insert new cells by using the “Insert -> Insert Cell Below” menu at the top, and you can run the code in the current cell by hitting “Cell -> Run Cells”. A useful shortcut is Shift+Enter, which will run the current cell and insert a new one below).

from collections import Counter
from datetime import datetime

import json

from keras.layers import Embedding, LSTM, Dense, Conv1D, MaxPooling1D, Dropout, Activation
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

import numpy as np

The above code imports a bunch of libraries for us that we’ll be using later on. The first three are standard Python imports, while Keras and Numpy are third-party libraries that come installed with the Deep Learning AMI that we are using.

To load the reviews from disk, run the following in the next cell:


# Load the reviews and parse JSON
t1 =
with open("yelp_academic_dataset_review.json") as f:
    reviews ="\n")
reviews = [json.loads(review) for review in reviews]
print( - t1)

The reviews are structured as JSON objects, one per line. This code loads all the reviews, parses them into JSON, and stores them in a list called reviews. We also print out an indication of how long this took — on the AWS machine, it should run in an under a minute.

Each review in the Yelp dataset contains the text of the review and the associated star rating, left by the reviewer. Our task is to teach a classifier to differentiate between positive and negative reviews, looking only at the review text itself.

It is very important to have a balanced dataset for supervised learning. This means that we want the same amount of positive and negative reviews when we train our neural network to tell the difference between them. If we have more positive reviews than negative reviews, the network will learn that most reviews are positive, and adjust its predictions accordingly. We’ll, therefore, take a sample of the Yelp reviews which contains the same amount of positive (four or five-star reviews) and negative (one, two, or three-star reviews).

# Get a balanced sample of positive and negative reviews
texts = [review['text'] for review in reviews]

# Convert our 5 classes into 2 (negative or positive)
binstars = [0 if review['stars'] <= 3 else 1 for review in reviews]
balanced_texts = []
balanced_labels = []
limit = 100000  # Change this to grow/shrink the dataset
neg_pos_counts = [0, 0]
for i in range(len(texts)):
    polarity = binstars[i]
    if neg_pos_counts[polarity] < limit:
        neg_pos_counts[polarity] += 1

This gets 100 000 positive and 100 000 negative reviews. Feel free to use a higher or lower number, depending on your time constraints. For our dataset, we’ll need a couple of hours to train each of two different neural network models below. If you use less data, you’ll probably get significantly worse accuracy, as neural networks usually need a lot of data to train well. More data will result in longer training times for our neural network.

We can verify that our new dataset is balanced by using a Python Counter. In a new cell, run:

# >>> Counter({0: 100000, 1: 100000})
Tokenizing the texts

Machines understand numbers better than words. In order to train our neural network with our texts, we first need to split each text into words and represent each word by a number. Luckily, Keras has a preprocessing module that can handle all of this for us.

First, we’ll have a look at how Keras’ tokenization and sequence padding works on some toy data, in order to work out what’s going on under the hood. Then, we’ll apply the tokenization to our Yelp reviews.

Keras represents each word as a number, with the most common word in a given dataset being represented as 1, the second most common as a 2, and so on. This is useful because we often want to ignore rare words, as usually, the neural network cannot learn much from these, and they only add to the processing time. If we have our data tokenized with the more common words having lower numbers, we can easily train on only the N most common words in our dataset, and adjust N as necessary (for larger datasets, we would want a larger N, as even comparatively rare words will appear often enough to be useful).

Tokenization in Keras is a two step process. First, we need to calculate the word frequencies for our dataset (to find the most common words and assign these low numbers). Then we can transform our text into numerical tokens. The calculation of the word frequencies is referred to as ‘fitting’ the tokenizer, and Keras calls the numerical representations of our texts ‘sequences’.

Run the following code in a new cell:

tokenizer = Tokenizer(num_words=5)
toytexts = ["Is is a common word", "So is the", "the is common", "discombobulation is not common"]
sequences = tokenizer.texts_to_sequences(toytexts)

# >>> [[1, 1, 4, 2], [1, 3], [3, 1, 2], [1, 2]]
  • In line one, we create a tokenizer and say that it should ignore all except the five most-common words (in practice, we’ll use a much higher number).
  • In line three, we tell the tokenizer to calculate the frequency of each word in our toy dataset.
  • In line four, we convert all of our texts to lists of integers

We can have a look at the sequences to see how the tokenization works.

# >>> [[1, 1, 4, 2], [1, 3], [3, 1, 2], [1, 2]]

We can see that each text is represented by a list of integers. The first text is 1, 1, 4, 2. By looking at the other sequences, we can infer that 1 represents the word “is”, 4 represents “a”, and 2 represents “common”. We can take a look at the tokenizer word_index, which stores to the word-to-token mapping to confirm this:


which outputs:

{'a': 4,
 'common': 2,
 'discombobulation': 7,
 'is': 1,
 'not': 8,
 'so': 6,
 'the': 3,
 'word': 5}

Rare words, such as “discombobulation” did not make the cut of “5 most common words”, and are therefore omitted from the sequences. You can see the last text is represented only by [1,2] even though it originally contained four words, because two of the words are not part of the top 5 words.

Finally, we’ll want to “pad” our sequences. Our neural network can train more efficiently if all of the training examples are the same size, so we want each of our texts to contain the same number of words. Keras has the pad_sequences function to do this, which will pad with leading zeros to make all the texts the same length as the longest one:

padded_sequences = pad_sequences(sequences)


Which outputs:

array([[1, 1, 4, 2],
       [0, 0, 1, 3],
       [0, 3, 1, 2],
       [0, 0, 1, 2]], dtype=int32)

The last text has now been transformed from [1, 2] to [0, 0, 1, 2] in order to make it as long as the longest text (the first one).

Now that we’ve seen how tokenization works, we can create the real tokenized sequences from the Yelp reviews. Run the following code in a new cell.

tokenizer = Tokenizer(num_words=20000)
sequences = tokenizer.texts_to_sequences(balanced_texts)
data = pad_sequences(sequences, maxlen=300)

This might take a while to run. Here, we use the most common 20000 words instead of 5. The only other difference is that we pass maxlen=300 when we pad the sequences. This means that as well as padding the very short texts with zeros, we’ll also truncate the very long ones. All of our texts will then be represented by 300 numbers.

Building a neural network

There are different ways of building a neural network. One of the more complicated architectures, which is known to perform very well on text data, is the Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM). RNNs are designed to learn from sequences of data, where there is some kind of time dependency. For example, they are used for time-series analysis, where each data point has some relation to those immediately before and after. By extension, they work very well for language data, where each word is related to those before and after it in a sentence.

The maths behind RNNs gets a bit hairy, and even more so when we add the concept LSTMs, which allows the neural network to pay more attention to certain parts of a sequence, and to largely ignore words which aren’t as useful. In spite of the internal complications, with Keras we can set up one of these networks in a few lines of code[1].

model = Sequential()
model.add(Embedding(20000, 128, input_length=300))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])In line two, we add an Embedding layer. This layer lets the network expand each token to a larger vector, allowing the network to represent words in a meaningful way. We pass 20000 as the first argument, which is the size of our vocabulary (remember, we told the tokenizer to only use the 20 000 most common words earlier), and 128 as the second, which means that each token can be expanded to a vector of size 128. We give it an input_length of 300, which is the length of each of our sequences.
  • In line one, we create an empty Sequential model. We’ll use this to add several “layers” to our network. Each layer will do something slightly different in our case, so we’ll explain these separately below.
  • In line two, we add an Embedding layer. This layer lets the network expand each token to a larger vector, allowing the network to represent a words in a meaningful way. We pass 20000 as the first argument, which is the size of our vocabulary (remember, we told the tokenizer to only use the 20 000 most common words earlier), and 128 as the second, which means that each token can be expanded to a vector of size 128. We give it an input_length of 300, which is the length of each of our sequences.
  • In line three, we add an LSTM layer. The first argument is 128, which is the size of our word embeddings (the second argument from the Embedding layer). We add a 20% chance of dropout with the next two arguments. Dropout is a slightly counter-intuitive concept: we reset a random 20% of the weights from the LSTM layer with every iteration. This makes it more difficult for the neural network to learn patterns, which results in a more robust network, as the rules the network learns are more generalisable.
  • In line four, we add a Dense layer. This is the simplest kind of Neural Network layer, where all neurons in the layer are connected to each other. This layer has an output size of 1, meaning it will always output 1 or 0. We will train the network to make this layer output 1 for positive reviews and 0 for negative ones.
  • In line five, we compile the model. This prepares the model to be run on the backend graph library (in our case, TensorFlow). We use loss=’binary_crossentropy’ because we only have two classes (1 and 0). We use the adam optimizer, which is a relatively modern learning strategy that works well in a number of different scenarios, and we specify that we are interested in the accuracy metric (how many positive/negative predictions our neural network gets correct).
Training our neural network

Now we are ready to train or “fit” the network. This can be done in a single line of code. As this is where the actual learning takes place, you’ll need a significant amount of time to run this step (approximately two hours on the Amazon GPU machine)., np.array(balanced_labels), validation_split=0.5, epochs=3)

Here, we pass in our padded, tokenized texts as the first argument, and the labels as the second argument. We use validation_split=0.5 to tell our neural network that it should take half of the data to learn from, and that it should test itself on the other half. This means it will take half the reviews, along with their labels, and try to find patterns in the tokens that represent positive or negative labels. It will then try to predict the answers for the other half, without looking at the labels, and compare these predictions to the real labels, to see how good the patterns it learned are.

It’s important to have validation data when training a neural network. The network is powerful enough to find patterns even in random noise, so by seeing that it’s able to get the correct answers on ‘new’ data (data that it didn’t look at during the training stage), we can verify that the patterns it is learning are actually useful to us, and not overly-specific to the data we trained it on.

The last argument we pass, epochs=3, means that the neural network should run through all of the available training data three times.

You should (slowly) see output similar to the below. The lower the loss is, the better for us, as this number indicates the errors that the network is making while learning. The acc number should be high, as this represents the accuracy of the training data. After each epoch completes, you’ll also see the val_loss and val_acc numbers appear, which represent the loss and accuracy on the held-out validation data.

Train on 100000 samples, validate on 100000 samples

Epoch 1/3

100000/100000 [==============================] - 1780s - loss: 0.3974 - acc: 0.8237 - val_loss: 0.4305 - val_acc: 0.8158

Epoch 2/3

100000/100000 [==============================] - 1764s - loss: 0.2953 - acc: 0.8758 - val_loss: 0.3167 - val_acc: 0.8745

Epoch 3/3

100000/100000 [==============================] - 1754s - loss: 0.2305 - acc: 0.9057 - val_loss: 0.3296 - val_acc: 0.8589

We can see that for the first two epochs, the acc and val_acc numbers are similar. This is good. It means that the rules that the network learns on the training data generalize well to the unseen validation data. After two epochs, our network can predict whether a review is positive or negative correctly 87.5 percent of the time!

After the third epoch, the network is starting to memorize the training examples, with rules that are too specific. We can see that it gets 90% accuracy on the training data, but only 85.8% on the held-out validation data. This means that our network has “overfitted”, and we’d want to retrain it (by running the “compile” and “fit” steps again) for only two epochs instead of three. (We could also have told it to stop training automatically when it started overfitting by using an “Early Stopping” callback. We don’t show how to do this here, but you can read how to implement it at

Adding more layers

Our simple model worked well enough, but it was slow to train. One way to speed up the training time is to improve our network architecture and add a “Convolutional” layer. Convolutional Neural Networks (CNNs) come from image processing. They pass a “filter” over the data, and calculate a higher-level representation. They have been shown to work surprisingly well for text, even though they have none of the sequence processing ability of LSTMs. They are also faster, as the different filters can be calculated independently of each other. LSTMs by contrast are hard to parallelise, as each calculation depends on many previous ones.

By adding a CNN layer before the LSTM, we allow the LSTM to see sequences of chunks instead of sequences of words. For example, the CNN might learn the chunk “I loved this” as a single concept and “friendly guest house” as another concept. The LSTM stacked on top of the CNN could then see the sequence [“I loved this”, “friendly guest house”] (a “sequence” of two items) and learn whether this was positive or negative, instead of having to learn the longer and more difficult sequence of the six independent items [“I”, “loved”, “this”, “friendly”, “guest”, “house”].

Add the definition of the new model in a new cell of the notebook:

model = Sequential()
model.add(Embedding(20000, 128, input_length=300))
model.add(Conv1D(64, 5, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']), np.array(balanced_labels), validation_split=0.5, epochs=3)

Here we have a slightly different arrangement of layers. We add a dropout layer directly after the Embedding layer. Following this, we add a convolutional layer which passes a filter over the text to learn specific chunks or windows. After this, we have a MaxPooling layer, which combines all of the different chunked representations into a single chunk. If you’re interested in learning more about how this works (and seeing some GIFs which clarify the concept), have a look at

The rest of our model is the same as before. Training time should be significantly faster (about half an hour for all three epochs) and accuracy is similar. You should see output similar to the following.

Train on 100000 samples, validate on 100000 samples

Epoch 1/3

100000/100000 [==============================] - 465s - loss: 0.3309 - acc: 0.8561 - val_loss: 0.3181 - val_acc: 0.8640

Epoch 2/3

100000/100000 [==============================] - 462s - loss: 0.2325 - acc: 0.9048 - val_loss: 0.3335 - val_acc: 0.8549

Epoch 3/3

100000/100000 [==============================] - 457s - loss: 0.1666 - acc: 0.9349 - val_loss: 0.3833 - val_acc: 0.8570

Although the highest accuracy looks slightly worse, it gets there much faster. We can see that the network is already overfitting after one epoch. Adding more dropout layers after the CNN and LSTM layers might improve our network still more, but this is left as an exercise to the reader.

Comparing our results to a Support Vector Machine classifier

Neural Networks are great tools for many tasks. Sentiment analysis is quite straightforward though, and similar results can be achieved with much simpler algorithms. As a comparison, we’ll build a Support Vector Machine classifier using scikit-learn, another high-level machine learning Python library that allows us to create complex models in a few lines of code.

Run the following in a new cell:

from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score

t1 =
vectorizer = TfidfVectorizer(ngram_range=(1,2), min_df=3)
classifier = LinearSVC()
Xs = vectorizer.fit_transform(balanced_texts)

print( - t1)

score = cross_val_score(classifier, Xs, balanced_labels, cv=2, n_jobs=-1)

print( - t1)
print(sum(score) / len(score))

Now, instead of converting each word to a single number and learning an Embedding layer, we use a term-frequency inverse document frequency (TF-IDF) vectorisation process. Using this vectorisation scheme, we ignore the order of the words completely, and represent each review as a large sparse matrix, with each cell representing a specific word, and how often it appears in that review. We normalize the counts by the total number of times the word appears in all of the reviews, so rare words are given a higher importance than common ones (though we ignore all words that aren’t seen in at least three different reviews.

  • Line six sets up the vectorizer. We set ngram_range to (1,2) which means we’ll consider all words on their own but also look at all pairs of words. This is useful because we don’t have a concept of word order anymore, so looking at pairs of words as single tokens allows the classifier to learn that word pairs such as “not good” are usually negative, even though “good” is positive. We also set min_df to 3, which means that we’ll ignore words that aren’t seen at least three times (in three different reviews).
  • In line seven, we set up a support vector machine classifier with a linear kernel, as this has been shown to work well for text classification tasks.
  • In line eight, we convert our reviews into vectors by calling fit_transform. Fit transform actually does two things: first it “fits” the vectorizer, similarly to how we fitted out Tokenizer for Keras, to calculate the vocabulary across all reviews. Then it “transforms” each review into a large sparse vector.
  • In line 13, we get a cross-validated score of our classifier. This means that we train the classifier twice (because we pass in cv=2). Similarly to our validation_split for Keras, we first train on half the reviews and check our score on the other half, and then vice-versa. We set n_jobs=-1 to say that the classifier should use all available CPU cores — in our case, it will only use two as scikit-learn can only do basic parallelization and run each of the two cross-validation splits on a separate core.

You should see output similar to the following.


(200000, 774090)


[ 0.86372 0.88085]


We can see that the SVM is significantly faster. We need about a minute and a half to vectorize the reviews, transforming each of 200 000 reviews into a vector containing 775 090 features. It takes another minute and a half to train the classifier on 100 000 reviews and predict on the other 100 000.

Line four shows the accuracy for each cross-validation run, and line five shows the average of both runs. For our task, the SVM actually performs slightly better than the neural network in terms of accuracy, and it does so in significantly less time.

However, note that we have more flexibility with the neural network, and we could probably do better if we spent more time tuning our model. Because neural networks are not as well understood as SVMs, it can be difficult to find a good model for your data. Also note that if you want to train on even more data (we were using only a small sample of the Yelp reviews), you may well find that the neural network starts outperforming the SVM. Conversely, for smaller datasets, the SVM is much better than the neural network — try running all of the above code again for 2 000 reviews instead of 200 000 and you’ll see that the neural network really battles to find meaningful patterns if you limit the training examples. The SVM, on the other hand, will perform well even for much smaller datasets.

Packaging our models for later use

Being able to predict the sentiment of this review set is not very useful on its own — after all, we have all the labels at our disposal already. However, now that we’ve trained some models, we can easily use them on new, unlabeled data. For example, you could download thousands of news reports before an election and use our model to see whether mainly positive or mainly negative things are being said about key political figures, and how that sentiment is changing over time.

When saving our models for later use, it’s important not to forget the tokenization. When we converted our raw texts into tokens, we fitted the tokenizer first. In the case of the neural network, the tokenizer learned which were the most frequent words in our dataset. For our SVM, we fitted a TF-IDF vectorizer. If we want to classify more texts, we must use the same tokenizers without refitting them (a new dataset will have a different most-common word, but our neural network learned specific things about the word representations from the Yelp dataset. Therefore it’s important that we still use the token 1 to refer to the same word for new data).

For the Keras model, we can save the tokenizer and the trained model as follows (make sure that you have h5py installed with pip3 install h5py first if you’re not using the pre-configured AWS instance).

import pickle

# save the tokenizer and model
with open("keras_tokenizer.pickle", "wb") as f:
   pickle.dump(tokenizer, f)"yelp_sentiment_model.hdf5")

If we want to predict whether some new piece of text is positive or negative, we can load our model and get a prediction with:

from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences
import pickle

# load the tokenizer and the model
with open("keras_tokenizer.pickle", "rb") as f:
   tokenizer = pickle.load(f)

model = load_model("yelp_sentiment_model.hdf5")

# replace with the data you want to classify
newtexts = ["Your new data", "More new data"]

# note that we shouldn't call "fit" on the tokenizer again
sequences = tokenizer.texts_to_sequences(newtexts)
data = pad_sequences(sequences, maxlen=300)

# get predictions for each of your new texts
predictions = model.predict(data)

To package the SVM model, we similarly need to package both the vectoriser and the classifier. We can use the scikit-learn joblib module, which is designed to be faster than pickle.

from sklearn.externals import joblib

joblib.dump(vectorizer, "tfidf_vectorizer.pickle")
joblib.dump(classifier, "svm_classifier.pickle")

And to get predictions on new data, we can load our model from disk with:

from sklearn.externals import joblib

vectorizer = joblib.load("tfidf_vectorizer.pickle")
classifier = joblib.load("svm_classifier.pickle")

# replace with the data you want to classify
newtexts = ["Your new data", "More new data"]

# note that we should call "transform" here instead of the "fit_transform" from earlier
Xs = vectorizer.transform(newtexts)

# get predictions for each of your new texts
predictions = classifier.predict(Xs)

We took an introductory look at using Keras for text classification and compared our results to a simpler SVM. You now know:

  • How to set up a pre-configured AWS spot instance for machine learning
  • How to preprocess raw text data for use with Keras neural networks
  • How to experiment with building your own deep learning models for text classification
  • That an SVM is still often a good choice, in spite of advances in neural networks
  • How to package your trained models, and use them in later tasks.

If you have any questions or comments, feel free to reach out to the author on Twitter or github.

  1. Note that the architecture that we use for our neural networks closely follows the official Keras examples which can be found at