About the Author:

With Privacy and Data Up for Grabs, Building Secure Applications Is a Critical Skill

September 26th, 2019

Data, privacy, breaches, these topics are bandied about a lot these days, and for good reason. The damage from attacks and leaks seems to get more severe with news story. In the aftermath, a company’s image, reputation, even its customer’s trust – not to mention significant sources of revenue – are at stake. There’s even legislation afoot forcing companies to protect consumer data.

Now consider that today’s software applications are essentially modern-day storefronts. Businesses are expanding, which means they’re doubling down on mobile, web or API-based applications; the volume and frequency for application production is increasing. Simultaneously, the complexity and volume of attacks on these applications is also increasing. Further, there aren’t enough people available who understand and can implement sufficient applications security to handle the problem. (more…)

About the Author:

SQL vs NoSQL: Determining What’s Right For You

January 15th, 2018

Gone are the days of having a single SQL database to manage all of your organization’s information. In today’s data-saturated age, more storage opportunities emerge to meet these rapidly changing needs.

You may have heard the term NoSQL tossed around, but what does it mean? And what can it do for you? How can a stakeholder know when one option will be more effective over the other, and what should you choose for your business? Those are the topics we will cover in this article.

SQL Databases

The old standby of data storage since the 1970s, SQL databases store information in a relational fashion. This means the data has a relationship to other data in the database. For example, a class directory for a school might have tables for classrooms, students, teachers, and more.


SQL (or Structured Query Language) is used to describe these objects and their relations to one another. While SQL is versatile enough to create complex queries and is widely-used and tested, things start to break down if you need to add more fields or a different structure down the line.

Since SQL requires predefined schemas of information, new types of information or ill-formatted data will grind the system to a halt.


When you start getting more and more data, as many companies these days are, you have to find out a way to scale up. To scale a SQL database you need to add more resources to the server. This is called vertical scaling.

To scale vertically, you must increase system resources such as RAM, storage space, or CPU. If you’re hosting your database on a cloud server like AWS, this can get expensive very quickly.

Use Cases

These databases are best utilized with structured data such as that from our school example. Other datasets could hold weather information, inventory management data, or stock prices.

Example Products

NoSQL Databases

As the variety and type of data we produce changes, so too do the tools we use to contain that data.

NoSQL databases focus on storing collections of unstructured data. Many APIs return JSON documents that are essentially lists of key-value pairs. The structure changes over time and data is coming in rapidly, maybe even in real-time. This type of data doesn’t fit easily into a traditional relational database, but you need somewhere to easily store and access the information. So what do you do?

Enter the NoSQL database.


True to its name, NoSQL databases eschew the SQL language and format in favor of more flexible storage. Data is stored in a more amorphous fashion that allows for greater scalability and real-time data ingestion.

There are four main types of data held by NoSQL databases:

  • Key-value pairs
  • Documents
  • Graphs
  • Wide columns

The benefit of these different types means that you don’t have to have a defined schema or format before starting to ingest data. This cuts down on maintenance or upgrades down the road to add new types or structure.


NoSQL databases scale horizontally instead of vertically. This is done by a process called “sharding”, where the database’s storage is split over multiple servers. While sharding is possible with SQL databases it takes a lot more work and maintenance, while on many NoSQL stores this comes enabled by default.

Use Cases

NoSQL databases work well for lots of varied, unstructured data. If you need to hold incoming sensor data or API responses, as two examples, NoSQL would be most effective. They can also be ideal for very, very large datasets (tens or hundreds of terabytes or more) because while there is a theoretical upper limit on how much you can increase one system’s resources you can add machines over and over.

Example Products

The Big Question

So when should you stick with a relational database versus trying out a NoSQL solution? First, ask yourself how your data is structured. If it is in a fairly 2-dimensional (flat) format and has strong relations with other data in your dataset, consider a SQL database.

If you’re dealing with variable data that changes in format or a key-value store like JSON or XML, then give a NoSQL solution a try.

Here are some other basic criteria you want to look at when evaluating a new data storage solution:

  • structure of your data
  • the volume of your data
  • whether you anticipate this dataset to grow significantly in the future

In fact, the writer at TheHFTGuy has put together this handy flowchart for you to determine what kind of database best fits your data needs.

database flowchart

There are many different options out there today, but you now know the major types of data stores and where to apply them.

About the Author:

ETL Management with Luigi Data Pipelines

October 15th, 2017

As a data engineer, you’re often dealing with large amounts of data coming from various sources and have to make sense of them. Extracting, Transforming, and Loading (ETL) data to get it where it needs to go is part of your job, and it can be a tough one when there’s so many moving parts.

Fortunately, Luigi is here to help. An open source project developed by Spotify, Luigi helps you build complex pipelines of batch jobs. It has been used to automate Spotify’s data processing (billions of log messages and terabytes of data), as well as in other well-known companies such as Foursquare, Stripe, Asana, and Red Hat.

Luigi handles dependency resolution, workflow management, and helps you recover from failures gracefully. It can be integrated into the services you already use, such as running a Spark job, dumping a table from a database, or running a Python snippet.

In this example, you’ll learn how to ingest log files, run some transformations and calculations, then store the results.

Let’s get started.


For this tutorial, we’re going to be using Python 3.6.3, Luigi, and this fake log generator. Although you could use your own production log files, not everyone has access to those. So we have a tool that can produce dummy data for us while we’re testing. Then you can try it out on the real thing.

To start, make sure you have Python 3, pip, and virtualenv installed. Then we’ll set up our environment and install the luigi package:

$ mkdir luigidemo
$ cd luigidemo
$ virtualenv luigi
$ source luigi/bin/activate
$ pip install luigi


This sets up an isolated Python environment and installs the necessary dependencies. Next, we’ll need to obtain some test data to use in our data pipeline.

Generate test data

Since this example deals with log files, we can use this Fake Apache Log Generator to generate dummy data easily. If you have your own files you would like to use, that is fine, but we will be using mocked data for the purposes of this tutorial.

To use the log generator, clone the repository into your project directory. Then navigate into the Fake-Apache-Log-Generator folder and use pip to install the dependencies:

$ pip install -r requirements.txt

After that completes, it’s pretty simple to generate a log file. You can find detailed options in the documentation, but let’s generate a 1000-line .log file:

$ python apache-fake-log-gen.py -n 1000 -o LOG

This creates a file called access_log_{timestamp}.log.

Now we can set up a workflow in Luigi to analyze this file.

Luigi Tasks

You can think of building a Luigi workflow as similar to building a Makefile. There are a series of Tasks and dependencies that chain together to create your workflow.

Let’s say we want to figure out how many unique IPs have visited our website. There are a couple of steps we need to go through to do this:

  • Read the log file
  • Pull out the IP address from each line
  • Decide how many unique IPs there are (de-duplicate them)
  • Persist that count somewhere

Each of those can be packaged as a Luigi Task. Each Task has three component parts:

  • run() — This contains the logic of your Task. Whether you’re submitting a Spark job, running a Python script, or querying a database, this function contains the actual execution logic. We break up our process, as above, into small chunks so we can run them in a modular fashion.
  • output() — This defines where the results will go. You can write to a file, update HDFS, or add new records in a database. Luigi provides multiple output targets for you to use, or you can create your own.
  • requires() — This is where the magic happens. You can define dependencies of your task here. For instance, you may need to make sure a certain dataset is updated, or wait for another task to finish first. In our example, we need to de-duplicate the IPs before we can get an accurate count.

Our first Task just needs to read the file. Since it is the first task in our pipeline, we don’t need a requires() function or a run() function. All this first task does is send the file along for processing. For the purposes of this tutorial, we’ll write the results to a local file. You can find more information on how to connect to databases, HDFS, S3, and more in the documentation.

To build a Task in Luigi, it’s just a simple Python class:

import luigi

# this class extends the Luigi base class ExternalTask
# because we’re simply pulling in an external file
# it only needs an output() function
class readLogFile(luigi.ExternalTask):

def output(self):
   return luigi.LocalTarget('/path/to/file.log')


So that’s a fairly boring task. All it does is grab the file and send it along to the next Task for processing. Let’s write that next Task now, which will pull out the IPs and put them in a list, throwing out any duplicates:

class grabIPs(luigi.Task): # standard Luigi Task class

   def requires(self):
       # we need to read the log file before we can process it
       return readLogFile()

   def run(self):
       ips = []

       # use the file passed from the previous task
       with self.input().open() as f:
           for line in f:
               # a quick glance at the file tells us the first
               # element is the IP. We split on whitespace and take
               # the first element
               ip = line.split()[0]
               # if we haven’t seen this ip yet, add it
               if ip not in ips:

       # count how many unique IPs there are
       num_ips = len(ips)

       # write the results (this is where you could use hdfs/db/etc)
       with self.output().open('w') as f:

   def output(self):
       # the results are written to numips.txt
       return luigi.LocalTarget('numips.txt')


Even though this Task is a little more complicated, you can still see that it’s built on three component parts: requires, run, and output. It pulls in the data, splits it up, then adds the IPs to a list. Then it counts the number of elements in that list and writes that to a file.

If you try and run the program now, nothing will happen. That’s because we haven’t told Luigi to actually start running these tasks yet. We do that by calling luigi.run (you can also run Luigi from the command line):

if __name__ == '__main__':

   luigi.run(["--local-scheduler"], main_task_cls=grabIPs)


The run function takes two arguments: a list of options to pass to Luigi and the task you want to start on. While you’re testing, it helps to pass the –local-scheduler option; this allows the processes to run locally. When you’re ready to move things into production, you’ll use the Central Scheduler. It provides a web interface for managing your workflows and dependency graphs.

If you run your Python file at this point, the Luigi process will kick off and you’ll see the process of each task as it moves through the pipeline. At the end, you’ll see which tasks succeeded, which failed, and the status of the run overall (Luigi gives it a :) or :( face).

Check out numips.txt to see the result of this run. In my example, it returned the number 1000. That means that 1000 unique IP addresses visited the site in this dummy log file. Try it on one of your logs and see what you get!

Extending your ETL

While this is a fairly simple example, there are a lot of ways that you could easily extend this. You could:

  • Split out dates and times as well, and filter the logs to see how many visitors there were on a certain day
  • Geolocate the IPs to find out where your visitors are coming from
  • See how many errors (404s) your visitors are encountering
  • Put the data in a more robust data store
  • Build scheduling so that you can run the workflow periodically on the updated logs

Try some of these options out and let us know if you have any questions! Although this example is limited in scope, Luigi is robust enough to handle Spotify-scale datasets, so let your imagination go wild.

About the Author:

Machine and Deep Learning in Python: What You Need to Know

July 19th, 2017

Big Data. Deep Learning. Data Science. Artificial Intelligence.

It seems like a day doesn’t go by when we’re not bombarded with these buzzwords. But what’s with all the hype? And how can you use it in your own business?

What is Machine Learning?

At its simplest level, machine learning is simply the process of optimizing mathematical equations. There are several different kinds of machine learning, all with a different purpose. Two of the most popular forms of machine learning are supervised and unsupervised learning. We’ll go through how they work below:

  • Supervised Learningsupervised learning uses labeled examples of known data to predict future outcomes. For example, if you kept track of weather conditions and whether your favorite sports team was playing that day, you could learn from those patterns over time and predict if the game would be rained out or not based on the weather forecast. The “supervised” part means that you have to supply the system with “answers” that you already know. That is, you already knew when your team did and didn’t play, and you know what the weather was on those days. The computer reads through this information iteratively and uses it to form patterns and make predictions. Other applications of supervised learning could be predicting if people will default on their loan payments.
  • Unsupervised Learningunsupervised learning refers to a type of machine learning where you don’t necessarily know what the “answer” is you’re looking for. Unlike our “will my sports game get rained out” example, unsupervised learning is more suitable for exploratory or clustering work. Clustering groups things that are similar or connected, so you could feed it a group of Twitter posts and have it tell you what people are most commonly talking about. Some algorithms that apply unsupervised learning are K-Means and LDA.
What is Deep Learning?

Deep learning, despite the hype, is simply the application of multi-layered artificial neural networks to machine learning problems. It’s called “deep” learning because the neural networks contain many levels of classification instead of one layer as a whole. For example, a deep learning algorithm that wanted to classify faces in photos would first learn to classify the shape of eyes, then noses, then mouths, and then the spatial relationship of them all together. This is instead of trying to recognize the whole face at once. It breaks it down into component parts to get a better understanding.

Deep learning has been in the news a lot lately. You may remember the trippy image generation project called DeepDream that Google released in 2015. Also noteworthy was AlphaGo’s triumph over a professional Go player, also using deep learning. Before this, a computer had never been able to beat a human at a game of Go, so this marked a new milestone in artificial intelligence.


credit: jessica mullen from austin, tx – Deep Dreamscope

Machine Learning in Python

One of the best things about Python is the fact that there are so many libraries available. Since anyone can create a Python package and submit it to PyPI (Python Package Index), there are packages out there for just about everything you can think of. Machine and Deep Learning are no exception.

In fact, Python is one of the most popular languages for data scientists due to its ease of use and wealth of scientific packages available. Many Python developers, especially in the data space, like to use Jupyter Notebooks because it allows them to iterate and refine code and models without running the entire program each time.


scikit-learn is the frontrunner and longtime favorite of data scientists. It’s been around the longest and has whole books devoted to the topic. If you want a wealth of machine learning algorithms and customizations, scikit-learn likely has what you need. However, if you’re looking for something that’s more heavily stats-focused, you may want to go with StatsModels instead.


Caffe is a fast open framework for deep learning written in Python. Developed by an AI research team at UC Berkeley, it performs well in image processing scenarios and is used by large companies such as Facebook, Microsoft, Pinterest, and more.


TensorFlow made waves in the machine learning community as Google’s open source deep learning offering. It currently stands as the most prominent deep learning framework in the space, with many developers participating. TensorFlow works well with object recognition and speech recognition tasks.


Theano is a Python library for fast numerical computation. Many developers use it on GPUs for data-intensive operations. It also has symbolic computation capabilities so you can calculate derivatives for functions with many variables. In fact, with GPU optimization, it can even outperform C. If you’re crunching some serious data, Theano could be your go-to.

Who’s using Machine Learning?

A better question would be: who’s not using machine learning in their business? And if not, why not?

The possibilities of data analytics at scale have been realized across industries, from healthcare to finance to oil and gas. Here are some notable firms betting on machine learning:

  • Google — Google uses machine learning across their company, from Google Translate to helping you categorize your photos to self-driving car research. Teams at Google also develop TensorFlow, a leading deep learning framework.
  • Facebook — Facebook makes heavy use of machine learning in the ad space. By looking at your interests, pages you visit, and things you ‘like’, Facebook gets a very good idea of who you are as a person and what kind of things you may be interested in buying. It uses this information to show you advertisements and posts in your newsfeed. Facebook also uses machine learning to recognize faces in your photos and help you tag them.
  • Netflix — Netflix uses the movies you watch, rate, and search for to create customized recommendations. One machine learning algorithm for product recommendations that both Netflix and Amazon employ is called collaborative filtering. In fact, Netflix hosts a contest called The Netflix Prize that awards people that can develop new and better recommendation systems.
Pros and Cons of Machine Learning in Python
  • Python is a general-purpose language, which means it can be used in a variety of scenarios and has a wealth of packages available for just about any purpose.
  • Python is easy to learn and read.
  • Developers can use Jupyter Notebooks to iteratively build their code and test it as they go.
  • There’s no industry-standard IDE for Python like there is for R. Still, many good options exist.
  • In most cases, Python’s performance cannot compare with C/C++.
  • The wealth of options in Python can be both a pro and a con. There are lots of choices, but it may take more digging and research to find what you need. In addition, setting up separate packages can be complicated if you’re a novice programmer.

The era of Big Data is here, and it’s not going away. You have learned a little more about the different types of machine learning, deep learning, and the major technologies that companies are using. Next time you have a data-intensive problem to solve, look no further than Python!

About the Author:

What Does AWS Lambda Do and Why Use It?

April 6th, 2017

Setting up and running a backend or server can be a very resource intensive part of the development process for many companies/products. Amazon AWS Lambda is a serverless computing service that gives companies the computing power they would usually get from a server, but for much less time and money. This lets developers focus on writing great code and building great products. This short piece will give an introduction on AWS Lambda.

Get AWS Training for Teams

AWS Lambda is considered to be both serverless and event driven. An event driven platform is one that traditionally depends on input from the user to determine which process will be executed next. The input might be a mouse click, a key press, or any other sensor input. The program contains a primary loop that ‘listens’ for any event that might take place and a separate function that calls another executable code when triggered.

Lambda is priced in 100 millisecond increments. This can decrease costs if only a short amount of computing time is needed (vs. paying for a server by the hour). While the EC2 service targets large scale operations, AWS Lambda serves applications that are usually smaller and more dynamic.

Lambda can also be run alongside other Amazon S3 services. It can be programmed to respond to whenever there is a change in the data contained in an S3 Bucket. A container would be launched when the Lambda function is launched. The container is active for sometime even after the code is executed as it waits for more code.

Websites that rely on tasks like image uploading or simple data analytics can make use of Lambda to handle simple discrete processes. Paired with Amazon Gateway API, Lambda can also be used to host backend logic for websites over HTTPS.


Traditionally, developers would setup servers and processes to have computing power (to be to execute different tasks). With AWS Lambda, they can focus on their product and buy spot computing power from Lambda only when they need it.

Get AWS Training for Teams

About the Author:

Predicting Yelp Stars from Reviews with scikit-learn and Python

March 14th, 2017

In this post, we’ll look at reviews from the Yelp Dataset Challenge. We’ll train a machine learning system to predict the star-rating of a review based only on its text. For example, if the text says “Everything was great! Best stay ever!!” we would expect a 5-star rating. If the text says “Worst stay of my life. Avoid at all costs”, we would expect a 1-star rating. Instead of writing a series of rules to work out whether some text is positive or negative, we can train a machine learning classifier to “learn” the difference between positive and negative reviews by giving it labelled examples.

This post follows closely from the previous one: Analyzing 4 Million Yelp Reviews with Python on AWS. You’re strongly encouraged to go through that one first. In particular, we will not be showing how to set up an EC2 Spot instance with adequate memory and processing power to handle this large dataset, but the same setup was used to run the analysis for this post.

Get Python Training for Teams

Introduction and Overview

This post will show how to implement and report on a (supervised) machine-learning based system of the Yelp reviews. Specifically, this post will explain how to use the popular Python library scikit-learn to:

  • convert text data into TF-IDF vectors
  • split the data into a training and test set
  • classify the text data using a LinearSVM
  • evaluate our classifier using precision, recall and a confusion matrix

In addition, this post will explain the terms TF-IDF, SVM, precision, recall, and confusion matrix.

In order to follow along, you should have at least basic Python knowledge. As the dataset we’re working with is relatively large, you’ll need a machine with at least 32GB of RAM, and preferably more. The previous post demonstrated how to set up an EC2 Spot instance for data processing, as well as how to produce visualisations of the same dataset. You’ll also need to install scikit-learn on the machine you’re using.

Loading and Balancing the Data

To load the data from disk into memory, run the following code. You’ll need to have downloaded the Yelp dataset and untarred it in order to read the Yelp Review’s JSON file.

import json

# read the data from disk and split into lines
# we use .strip() to remove the final (empty) line
with open("yelp_academic_dataset_review.json") as f:
reviews = f.read().strip().split("\n")

# each line of the file is a separate JSON object
reviews = [json.loads(review) for review in reviews] 

# we're interested in the text of each review 
# and the stars rating, so we load these into 
# separate lists
texts = [review['text'] for review in reviews]
stars = [review['stars'] for review in reviews]

Even on a fast machine, this code could take a couple of minutes to run.

We now have two arrays of data: the text of each review and the respective star-rating. Our task is to train a system that can predict the star-rating from looking at only the review text. This is a difficult task since different people have different standards, and as a result, two different people may write a similar review with different star ratings. For example, user Bob might write “Had an OK time. Nothing to complain about” and award 4 stars, while user Tom could write the same review and award 5 stars. This makes it difficult for our system to accurately predict the rating from the text alone.

Another complication is that our dataset is unbalanced. We have more examples of texts that typically have a 5-star rating than texts that typically have a 2-star rating. Because of the probabilistic models at the base of most machine learning classifiers, we’ll get less biased predictions if we train the system on balanced data. This means that ideally we should have the same number of examples of each review type.

In machine learning, it’s common to separate our data into features and labels. In our case, the review texts (the input data) will be converted into features and the star ratings (what we are trying to predict) are the labels. You’ll often see these two categories referred to as X and Y respectively. Adding the following method to a cell will allow us to balance a dataset by removing over-represented samples from the two lists.

from collections import Counter

def balance_classes(xs, ys):
"""Undersample xs, ys to balance classes."""
freqs = Counter(ys)

# the least common class is the maximum number we want for all classes
max_allowable = freqs.most_common()[-1][1]
num_added = {clss: 0 for clss in freqs.keys()}
new_ys = []
new_xs = []
for i, y in enumerate(ys):
if num_added[y] < max_allowable:
num_added[y] += 1
return new_xs, new_ys

Now we can create a balanced dataset of reviews and stars by running the following code (remember that now our texts are x and the stars are y).

balanced_x, balanced_y = balance_classes(texts, stars)

>>>Counter({5: 1704200, 4: 1032654, 1: 540377, 3: 517369, 2: 358550})
>>>Counter({1: 358550, 2: 358550, 3: 358550, 4: 358550, 5: 358550})

You can see above that in the original distribution, we had 358,550 2-star reviews and 1.7 million 5-star reviews. After balancing, we have 358,550 of each class of review. We’re now ready to prepare our data for classification.

Vectorizing our Text Data

Computers deal with numbers much better than they do with text, so we need a meaningful way to convert all the text data into matrices of numbers. A straightforward (and oft-used) method for doing this is to count how often words appear in a piece of text and represent each text with an array of word-frequencies. Therefore the short text of the dog jumps over the dog could be represented by the following array:

[2, 0, 0, 0, ..., 1, 0, 0, 0, ..., 2, 0, 0, 0, ..., 1, 0, 0, 0, ...]

The array would be quite large, containing one element for every possible word. We would store a lookup table separately, recording that (for example) the 0th element of each array represents the word “dog”. Because the word dog occurs twice in our text, we have a 2 in this position. Most of the words do not appear in our text, so most elements would contain 0. We also have a 1 to represent jumps, another 1 for over and another 2 for the.

A slightly more sophisticated approach would be to use Term Frequency Inverse Document Frequency (TF-IDF) vectors. This approach comes from the idea that common words, such as the aren’t very important, while less common words such as Namibia are more important. TF-IDF therefore normalises the count of each word in each text by the number of times that that word occurs in all of the texts. If a word occurs in nearly all of the texts, we deem it to be less significant. If it only appears in several texts, we regard it as more important.

The last thing that you need to know about text representation is the concept of n-grams. Words often mean very different things when we combine them in different ways. We will expect our learning algorithm to learn that a review containing the words bad is likely to be negative, while one containing the word great is likely to be positive. However, reviews containing phrases such as “… and then they gave us a full refund. Not bad!” or “The food was not great” will trip up our system if it only considers words individually.

When we break a text into n-grams, we consider several words grouped together to be a single word. “The food was not great” would be represented using bi-grams as (the food, food was, was not, not great), and this would allow our system to learn that not great is a typically negative statement because it appears in many negative reviews.

Using progressively longer combinations of words allows our system to learn fine-grained meanings, but at the cost of processing power and data-scarcity (there are many three-word phrases that might only appear once, and therefore are not much good for learning general rules). For our analysis, we’ll stick with single words (also called unigrams) and bigrams (two words at a time).

Luckily scikit-learn implements all of this for us: the TF-IDF algorithm along with n-grams and tokenization (splitting the text into individual words). To turn all of our reviews into vectors, run the following code (which took roughly 12 minutes to complete on an r4.4xlarge EC2 instance):

from sklearn.feature_extraction.text import TfidfVectorizer

# This vectorizer breaks text into single words and bi-grams
# and then calculates the TF-IDF representation
vectorizer = TfidfVectorizer(ngram_range=(1,2))
t1 = datetime.now()

# the 'fit' builds up the vocabulary from all the reviews
# while the 'transform' step turns each indivdual text into
# a matrix of numbers.
vectors = vectorizer.fit_transform(balanced_x)
print(datetime.now() - t1)
Creating a Train/Test Split

You can find patterns anywhere in any random noise, such as finding shapes in clouds. Machine learning algorithms are all about finding patterns, and they often find patterns that aren’t meaningful to us. In order to prove that the system is actually learning what we think it is, we’ll “train” it on one part of our data and then get it to predict the labels (star ratings) on an part of the data it didn’t see during training. If it does this with high accuracy (if it can predict the ratings of reviews it hasn’t seen during the training phase), then we’ll know the system has learned some general principles rather than just memorizing results for each specific review.

We had two arrays of data–the reviews and the ratings. Now we’ll want four arrays–features and labels for training and the same for testing. There is a train_test_split function in scikit-learn that does exactly this. Run the following code in a new cell:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(vectors, balanced_y, test_size=0.33, random_state=42)

We now have a third of our data in X_test and y_test. We’ll teach our system using two-thirds of the data (X_train and y_train), and then see how well it does by comparing its predictions for the reviews in X_test with the real ratings in y_test.

Fitting a Classifier and Making Predictions

The classifier we’ll use is a Linear Support Vector Machine (SVM), which has been shown to perform well on several text classifications tasks. We’ll skip over the mathematics and theory behind SVMs, but essentially the SVM will try to find a way to separate the different classes of our data. Remember that we have vectorized our text, so each review is now represented as a set of coordinates in a high-dimensional space. During training, the SVM will try to find some hyperplanes that separate our training examples. When we feed it the test data (minus the matching labels), it will use the boundaries it learned during training to predict the rating of each test review. You can find a more in-depth overview of SVMs in Wikipedia.

To create a Linear SVM using scikit-learn, we need to import LinearSVC and call .fit() on it, passing in our training instances and labels (X_train and y_train). Add the following code to a new cell and run it:

from sklearn.svm import LinearSVC

# initialise the SVM classifier
classifier = LinearSVC()

# train the classifier
t1 = datetime.now()
classifier.fit(X_train, y_train)
print(datetime.now() - t1)

Fitting the classifier is faster than vectorizing the text (~ 6 minutes on an r4.4xlarge instance). Once the classifier has been fitted, it can be used to make predictions. Let’s start by predicting the rating for the first ten reviews in our test set (remember that the classifier has never seen these reviews before). Run the following code in a new cell:

preds = classifier.predict(X_test)

>>>[4, 1, 1, 2, 5, 4, 1, 5, 5, 1]
>>>[5, 1, 1, 3, 5, 4, 1, 5, 5, 2]

(Note that the classifier is fast once it has been trained so it should only take a couple of seconds to generate predictions for the entire test set.)

The first line of the output displays the ratings our classifier predicted for the first ten reviews in our dataset, and the second line shows the actual ratings of the same reviews. It’s not perfect, but the predictions are quite good. For example, the first review in our data set is a 5-star review, and our classifier thought it was a 4-star review. The classifier predicted that the fifth review in our dataset was a 5-star review, which was correct. We can take a quick look at each of these reviews manually. Run the following code in a new cell:


>>> If you enjoy service by someone who is as competent as he is personable, I would recommend Corey Kaplan highly. The time he has spent here has been very productive and working with him educational and enjoyable. I hope not to need him again (though this is highly unlikely) but knowing he is there if I do is very nice. By the way, I’m not from El Centro, CA. but Scottsdale, AZ.
>>> I walked in here looking for a specific piece of furniture. I didn’t find it, but what I did find were so many things that I didn’t even know I needed! So much cool stuff here, go check it out!

We can see that although the reviewer of the first reviewer did leave 5-stars, he uses more moderate descriptions and his review contains some neutral phrases such as “By the way, I’m not from El Centro”, while the phrases used in the fifth review are more extremely positive (“Cool stuff”). It’s clear that the prediction task would be difficult for a human as well!

Looking at the results of ten predictions is a nice sanity-check and can help us build our own intuitions about what the system is doing. We’ll want to see how it performs over a larger set before deciding how well it can perform the prediction task.

Evaluating our Classifier

There are a number of metrics that we can use to estimate the quality of our classifier. The simplest method for evaluating such a system is to see the percentage of the time it accurately predicts the desired answer. This method is unsurprisingly called accuracy. We can calculate the accuracy of our system by comparing the predicted reviews and the real reviews–when they are the same, our classifier predicted the review correctly. We sum up all of the correct answers and divide by the total number of reviews in our test set. If this number is equal to 1, it means our classifier was spot on every time. A score of 0.5 means half of its answers were correct. You can ask scikit-learn to calculate the accuracy as follows:

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, preds))


The score may seem a bit low, since the classifier was only correct about 62% of the time, but keep in mind that with five rating classes, random guessing would be correct only 20% of the time.

Accuracy is a crude metric–there are of course finer-grained evaluation methods. It’s likely that some classes are ‘easier’ to predict than others, so we want to look at how well the classifier can predict each class (for example, only 5-star reviews) individually. Looking at results on a per-class level means that there are two different ways that the classifier could be wrong. For a given review and a given class, the classifier might have a false positive or a false negative classification. If we take 5-star reviews as an example, a false positive occurs when the classifier predicted that a review was a 5-star review when in fact it wasn’t. A false negative occurs when the classifier predicted a review wasn’t a 5-star review, when in fact it was.

Building on the ideas of false-positives and false-negatives, we introduce precision and recall. A classifier that could predict 5-star reviews with high precision would almost never predict that other reviews were 5-star reviews, but it might ‘miss’ many real 5-star reviews and classify them into other classes. A classifier with high recall for 5-star reviews would hardly ever predict that a 5-star review was something else, but it might predict that many other reviews are 5-star reviews. Precision and recall can be a bit confusing at first–there is a nice Wikipedia article that explains these topic in more detail.

We would like our classifier to strike a balance between precision and recall for all of the classes, and we can measure both at the same time using an F1 Score, which measures both precision and recall as a single metric. We can get an overview of all the classes by using the classification_report from scikit-learn. Run the following code in a new cell:

from sklearn.metrics import classification_report
print(classification_report(y_test, preds))

precision recall f1-score support

1 0.72 0.79 0.75 118238
2 0.56 0.52 0.54 118438
3 0.54 0.51 0.53 118370
4 0.55 0.54 0.54 118538
5 0.71 0.76 0.74 118024

avg / total 0.62 0.62 0.62 591608

We can see from the above that the 1- and 5-star reviews are the easiest to predict, and we get F1 Scores of 0.75 and 0.74 respectively for them. The neutral reviews are more difficult to predict, as evidenced by F1 scores that are just over 0.5.

The final evaluation metric we can consider is a confusion matrix. A confusion matrix demonstrates which predictions are most often confused. For our setup, we would hope that 1- and 5-star reviews are not confused too often by the classifier, but we don’t care too much if it mixes up 4-star and 5-star reviews.

We can examine the confusion matrix by running the following code:

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, preds))

[[92846 20722 3075 724 871]
[27820 62166 23652 3448 1352]
[ 5478 24054 60686 23535 4617]
[ 1319 3414 21111 63503 29191]
[ 1023 763 3056 23446 89736]]

The confusion matrix itself can be a bit … confusing. The rows represent the predictions we made for each kind of review, while the columns represent the correct classes. Let’s consider a few examples to see what is happening:

  • The first row describes all of the reviews that the classifier predicted as being 1-star reviews. The top left cell means that the classifier correctly predicted 92,846 1-star reviews.
  • The cell immediately to the right (first row, second column) indicates that there were 20,722 reviews which were 1-star reviews but which our classifier thought were 2-star reviews.
  • The cell in the first column of the second row indicates that there were 27,820 2-star reviews which our classifier thought were 1-star reviews.
  • The diagonal from the top left to the bottom right represents all of the correct predictions made by our classifier. We want these numbers to be the highest.
  • The top right corner says that there were 871 5-star reviews which the classifier thought were 1-star reviews.
  • The bottom left hand corner says that there were 1,023 1-star reviews which the classifier thought were 5-star reviews.

We can see that there are clusters of high numbers towards the top left and bottom right, showing that the classifier mainly confused, for example, 4- and 5-star reviews, or 1- and 2-star ones. The numbers towards the top right and bottom left are lower, showing that the classifier rarely was completely wrong (thinking that a 5-star review was actually a 1-star, or vice-versa).

Remodeling our Problem

As our final task, we’ll try to model a simpler problem and run the exact same analysis. Instead of trying to predict the exact star rating, we’ll try to classify the posts into positive (4- or 5-star reviews) or negative (1- or 2-star reviews). We’ll remove all the 3-star reviews for this task.

We don’t need to recalculate the vectors since we’re using the same texts. We simply need to remove the instances and labels for all 3-star reviews and modify our labels (train_y and test_y). Because our (vectorized) reviews are stored in numpy arrays and our labels are stored in Python lists, we have to treat each of these separately. The code below strips out the 3-star reviews and the labels, then converts all the labels to n for negative or p for positive:

keep = set([1,2,4,5])

# calculate the indices for the examples we want to keep
keep_train_is = [i for i, y in enumerate(y_train) if y in keep]
keep_test_is = [i for i, y in enumerate(y_test) if y in keep]

# convert the train set
X_train2 = X_train[keep_train_is, :]
y_train2 = [y_train[i] for i in keep_train_is]
y_train2 = ["n" if (y == 1 or y == 2) else "p" for y in y_train2]

# convert the test set
X_test2 = X_test[keep_test_is, :]
y_test2 = [y_test[i] for i in keep_test_is]
y_test2 = ["n" if (y == 1 or y == 2) else "p" for y in y_test2]

Once we’ve set up these new arrays, we can run exactly the same “train-predict-evaluate” steps that we went through previously:

classifier.fit(X_train2, y_train2)
preds = classifier.predict(X_test2)
print(classification_report(y_test2, preds))
print(confusion_matrix(y_test2, preds))

>>> precision recall f1-score support

n 0.96 0.96 0.96 236676
p 0.96 0.96 0.96 236562

avg / total 0.96 0.96 0.96 473238

[[227648 9028]
[ 9237 227325]]

We can see that this task is much easier for our classifier, and it predicts the correct class 96% of the time! The confusion matrix tells us that there were 9,028 negative reviews that the classifier thought were positive and 9,237 positive reviews that the classifier thought were negative–everything else was correct.


This post covered the basics of using a Support Vector Machine to classify text. After vectorizing text and training a classifier, two prediction tasks were performed–predicting the exact rating of each review vs. predicting whether the review was positive or negative.

Why would we be interested in predicting the ratings from text–after all, we already have the correct ratings at our disposal? There are many uses for systems such as the one we just built. For example, a company might want to analyze social media in order to find out how the public feels about the company, or reach out to customers who post negative reviews of the company. Specific companies are often mentioned on Twitter, Facebook, and other social media sites, and in these situations the text descriptions are not accompanied by a star rating.

As a more concrete example, imagine that you’re the manager of the NotARealName Hotel, and you’re automatically gathering tweets which mention your hotel, and you want

  • a) to know if the general sentiment regarding your hotel is going up or down over time, and
  • b) to respond to negative comments about your hotel (or share positive ones)

The following code uses our same classifier on new data. We use two fictitious tweets–the first one is obviously negative and the second one positive. Can we classify them without human intervention?

# only two texts as an example
texts = ["I really hated my stay at The NotARealName Hotel", "Had a really really great stay at NotARealName - would recommend to everyone"]

# note that we only call .transform() here and not .fit_transform()
# as we want to keep the vocabulary from the previous experiments
vecs = vectorizer.transform(texts)

# predict a positive or negative label for each input

>>>['n' 'p']

Our classifier determines that the first tweet is negative and the second one is positive. However, we wouldn’t want to trust it too much. Remember that it’s likely to be sensitive to the kind of language that’s used in Yelp reviews, so if we want to classify text from other areas, e.g., finance or politics, we’ll likely find the results to be less satisfying. However, with enough training data, we can easily retrain the classifier on examples that better match our desired task.

Get Python Training for Teams

About the Author:

Running Apache Spark on the Jupyter Notebook

March 9th, 2017

The Jupyter Notebook, formerly called IPython, is a web-based IDE for Spark development. Jupyter lets users write Scala, Python, or R code against Apache Spark, execute it in place, and document it using markdown syntax.

It is natural and logical to write code in an interactive web page. The user can write some lines of code, execute it, fix errors, and add some more code (and fix that). All of this is easier than using the cursor keys to iterate through the command history or use a text editor that does not have an interpreter and Spark connection. On top of all this, the Jupyter Notebook user does not need to perform any configuration or be concerned about the details of the Spark implementation.

Get Spark Training for Teams

Running and Installing

Running Jupyter is as easy as installing Docker and then running this one command to download the image from Docker and start it:

docker run -d -p 8888:8888 jupyter/all-spark-notebook

Then open Jupyter by navigating to localhost:8888 in your browser.


As you can see, when you click New it gives you the opportunity to write Scala, Python 2 or 3, or R code. There exist interpreters for other languages as well.

At this point, a dialogue box opens up into which you can type. Each of these boxes is called a cell. A cell can contain code to be executed or markdown to be rendered.

Construct an RDD

Just as when you use the Spark shells, when you write code in Jupyter, there is no need to set the SQLContext /SparkContext or import those statements, since that is already brought into scope automatically.

Now we can construct an RDD. You simply write this code into a cell and then click Cell/Run Cells.

val data = Array(1, 2, 3, 4, 5)

val distData = sc.parallelize(data)

Working with the Notebook

You can change the title of the notebook by typing over the word “Untitled” at the top of the screen. There is no Save button. Jupyter saves all your changes in a .ipynb file as you work.


Add blank cells by clicking Insert.



As you work on your program, the screen will be filled with errors and run output. Click Cells/All Output/Clear to clear all output.


Markdown is the syntax used to write README.md pages at Github. Use it to make headers, bulleted and numbered lists, and create code blocks. You can use this cheat sheet for markdown.

To change the cell from code to markdown click Cell/Cell Type/Markdown.


It might attempt to interpret as you type. To have it evaluate click Run Cells as normal


Deploying Jupyter

You should configure Nginx or Apache as a reverse proxy server in front of Jupyter if you want to run Jupyter over the public internet, since that exposes it on port 80, so there is no need to change your firewall rules. Be sure to give it a password, since Jupyter Notebooks also let you write Bash code. A hacker could do real damage to your computer if you left that open.

Jupyter is generally configured to work for one person, i.e., a local installation of Spark. But you can make it run atop a Spark Mesos cluster. Here are some instructions for that.

Get Spark Training for Teams

About the Author:

Analyzing 4 Million Yelp Reviews with Python on AWS

February 27th, 2017

Yelp runs a data challenge every year in which it invites people to explore its real-world datasets for unique insights. In this post, we’ll cover show how to load the dataset into a Jupyter Notebook running on a powerful but cheap AWS spot instance, and produce some initial explorations and visualizations.

This post is aimed at people who:

  • Have some existing Python knowledge
  • Are interested in learning more about how to process and visualise large-scale data with Python

If you are interested in taking part in the Yelp challenge, this tutorial will leave you in a good place to start more interesting analyses.


In this post, we’ll be looking at the Yelp data from the Yelp Dataset Challenge. This is an annual competition that Yelp runs where it asks participants to come up with new insights from its real-world data. We will:

  • Launch an AWS EC2 Spot instance with enough power to process the dataset (4 million reviews) quickly.
  • Configure the EC2 instance and install Jupyter Notebook as well as some data processing libraries.
  • Display some basic analysis of the data, along with visualisations using Matplotlib.

If you have a high end desktop or laptop (with at least 32GB RAM), you can probably run most of the analyses locally. However, learning how to process data in the cloud is a useful skill, so I’d recommend following along with the entire tutorial. Even if the Yelp data is small enough for your local machine, you may well want to process larger datasets in the future. And considering that AWS offer instances with up to 2TB of RAM, the method described here will work for even larger datasets.

Creating an AWS EC2 Spot Instance

Amazon Web Services (AWS) offer Elastic Cloud Compute (EC2) instances. These are on-demand servers that you can rent by the hour. They tend to be fairly expensive, especially for the more beefy machines, but luckily AWS also offer so-called ‘spot’ instances. These are instances that they currently have in excess supply, and they auction them off temporarily to the highest bidder, normally at much lower prices than their regular instances. This is very useful for short-term needs (such as data analysis) because the chance of someone else outbidding you while you still need the machine is comparatively low. To fire up a spot EC2 instance, follow the following steps:

Get AWS Training for Teams

  • Visit aws.amazon.com and sign up for an account (assuming that you don’t have an account with them already). It’s a somewhat complicated signup process, and it requires a credit card, even for their free trial, so this step might take some time. You can instead use Microsoft Azure, Google Cloud Compute, Rackspace, Linode, Digital Ocean, or any of a number of cloud providers for this step if you want, but all require a credit card for sign up and they don’t all offer the same variety of instances or the same discount pricing structure as AWS.
  • Visit the AWS Console. Pick a region using the dropdown in the top right-hand corner. For latency reasons, it’s nice to pick a region close to you, but some regions have more instances available and have cheaper spot instances. For example, even though I’m in The Netherlands, I chose the Oregon region (us-west-2) while making this post, as there were low-priced spot instances available there. (If you really need to save every cent you can, this Mozilla Python script can help you find the cheapest instance currently available worldwide.)
  • Click Services in the top left-hand corner, and choose EC2 from the list. You’ll now be taken to the main EC2 page.
  • /Users/g/Desktop/Screen Shot 2017-02-18 at 14.52.17.png
  • In the left-hand column, select Spot Requests and then click Request Spot Instances
  • /Users/g/Desktop/Screen Shot 2017-02-19 at 14.19.40.png

There are many options we can modify when creating a launch request for an instance. Luckily, we can leave almost all the defaults as they are. The ones we will change are:

  • The AMI — this is the “Amazon Machine Image.” It defaults to Amazon Linux, but we’ll be using Ubuntu-Server instead. Choose Ubuntu Server 16.04 LTS (HVM) from the AMI drop down.
  • The instance type–we’ll want an instance with lots of RAM (at least 30GB but preferably 60+ GB) and at least some SSD space. Click the x next to the default selected instance to remove it and then click the Select button next to that. You should see a popup similar to the one shown below. You can use the column headings to sort by a specific column. To choose an instance, I sorted by price and then found the first instance with 30GB RAM and some SSD space, which was an m3.2xlarge. The m instances aim to balance CPU, RAM and hard disk. The r instances focus on RAM and are also good for data analysis if you find a cheap one.

/Users/g/Desktop/Screen Shot 2017-02-17 at 17.59.59.png

At the bottom of the screen, click Next to get to the second (and last) page of settings for your instance.

  • Under Set Keypair and Role, click Create a new key pair. This will open a new tab and take you to the EC2 key management page. Choose to create a new key pair again, give it a name, and download the private key when prompted. Save it in your home directory as ec2key.pem.
  • Under Manage firewall rules select default. This will create an inbound firewall rule that allows the instance to accept SSH connections (we’ll be connecting to the instance via SSH).

Now click Review at the bottom of the page, check that everything looks as expected, and click Launch. This creates a spot instance “request,” and you might have to wait a bit for it to be “fulfilled” (meaning an instance became available that matched your request). You can see the state of the request under the Spot Requests tab (the one you used to create the request). When the request has been labeled fulfilled (and given a green icon), you’ll see the instance under the Instances section (you sometimes need to reload the Instances section to see the new instance).

Note: Be aware that the prices of spot instances can skyrocket unpredictably, leaving you with a nasty billing surprise. By default, the price is capped at the on-demand price (the price you usually pay), so if someone bids higher than that, you can lose your instance (and your work) suddenly.

/Users/g/Desktop/Screen Shot 2017-02-19 at 14.33.17.png

Scroll to the right in the Instances window to find the Public DNS of your instance, and copy this to your clipboard.

/Users/g/Desktop/Screen Shot 2017-02-19 at 14.35.18.png

Connecting to Our EC2 Spot Instance

Now open up a terminal or command prompt on your local machine. If you’re using Windows, you won’t be able to use SSH by default. Most people use PuTTy to SSH from Windows, but if you have a modern version of Windows it’s easier to enable WSL (Windows Subsystem for Linux). Once you’ve set that up, you can use SSH exactly as described in this post. As an alternative, you could install Git Tools for Windows. In the last step of the installation process, you’ll be asked if you want to Use Git and optional Unix tools from the Windows Command Prompt. Select yes, and you’ll be able to use SSH from the Windows CMD prompt.

Before we can connect to the instance, we need to change the permissions for the .pem key file that you downloaded earlier. Assuming that your key was saved in your home directory as ec2key.pem, run the following command:

chmod 600 ~/ec2key.pem

Now you can use it to connect to the instance. Make sure you still have the Public DNS name for your instance in your clipboard, and run the following command:

ssh -i ~/ec2key.pem -L 8888:localhost:8888 \

This connects to your instance, allowing you to run commands on it via SSH. The -i flag points to your key file, which proves that you’re the owner of the instance and the -L flag sets up port forwarding. Here we specify that port 8888 on our local machine should be forwarded to port 8888 on the remote instance. We’ll need this in a bit so that we can view a Jupyter notebook locally and have it execute on our instance.

Configuring Our EC2 Instance

To set up our instance, we only need to configure pip and install some Python libraries for data processing. Run the following commands on the instance:

export LC_ALL="en_US.UTF-8"
sudo apt update
sudo apt install python3-pip
pip3 install pip matplotlib jupyter --user

The first command sets the LC_ALL environment variable which specifies the locale. By default, Ubuntu Server often does not specify this, and pip needs locale information to function correctly. The later commands install pip using Ubuntu’s apt package manager. We then use pip to reinstall itself, as the apt versions of software sometimes lag behind the current versions. We also install jupyter which is the notebook we’ll use and matplotlib for plotting.

If you chose an instance with an SSD, you’ll have to mount that. Run:


to see the available disks. You’ll probably see the SSD listed as /dev/xvdb (though it might be called something else). Run the following commands to mount the SSD, substituting the xvdb if necessary:

sudo mkdir /mnt/ssd
sudo mount /dev/xvdb /mnt/ssd
sudo chown -R ubuntu /mnt/ssd

If you picked a machine with about 30GB of RAM, you can still run into some issues while loading and manipulating some of the Yelp data. I created another 20GB of swap place (virtual RAM on the hard drive) just in case (this step takes a while to run):

sudo dd if=/dev/zero of=/mnt/ssd/swapfile bs=1G count=20
sudo mkswap /mnt/ssd/swapfile
sudo swapon /mnt/ssd/swapfile
Getting the Yelp Data onto Our Machine

Currently, Yelp requires that you fill out an online form to get a link to access the data. This link is then tied to the machine where you filled out the form. There may be a workaround, but I had to download the data locally and then transfer it across to AWS, which took quite a while with my slow uplink connection. Fill out the form and obtain the download link here: https://www.yelp.com/dataset_challenge/dataset

Once you’ve downloaded the approx 1.8GB tar file, you can scp it to your instance with the following command (assuming that you saved the tar file to your Downloads folder. If not, substitute ~/Downloads/ for the path to the Yelp file). You’ll also need to substitute the DNS string for your own. Note that this command needs to be run on your local machine, not from the EC2 instance.

scp -i ~/ec2key.pem ~/Downloads/yelp_dataset_challenge_round9.tar \

Now, on the instance, you can untar the data with the following commands:

cd /mnt/ssd
tar -zxvf yelp_dataset_challenge_round9.tar

This should create a bunch of large .json files. We’ll be opening these directly in Python, so our command line work is nearly done.

Starting and Accessing the Jupyter Notebook

Now start the Jupyter notebook server on the instance by running:


You should see output saying that no web browser was detected, and giving you a URL with a token, similar to the following:

/Users/g/Desktop/Screen Shot 2017-02-19 at 15.04.52.png

Copy the URL to your clipboard and paste it into a browser on your local machine. You’ll see the default Jupyter Notebook page. Create a new Python 3 notebook by selecting New in the top right-hand corner and then choosing Python 3.

/Users/g/Desktop/Screen Shot 2017-02-19 at 15.13.50.png

If you’ve never used Jupyter Notebooks before, take a few moments to get acquainted with how things work. You can insert cells, delete cells, or run specific cells. Cells are useful because you can run specific blocks of code after making a change without having to rerun all the code above. Each cell shares a namespace with any previously run cell, so you can always access your variables and imports from new cells. The most useful keyboard shortcut is Ctrl + Enter, which runs the code in the currently selected block and displays the output.

The default working directory is the directory from which you launched Jupyter. If you followed the commands as laid out above, this would have been /mnt/ssd/ on your instance, so the JSON Yelp data files should be in the current working directory. To check, you can type !ls into one the cell at the top and run it. This will output all the filenames in the current directory.

Starting our Data Analysis

Now it’s finally time to load the data into Python and play around with it. The Yelp data is in a bit of a strange format–although they provide JSON files, the files contain a separate JSON object on each line, instead of one single JSON object.

In the first cell, we’ll want to set up some import of libraries that we’ll be using. We’ll set matplotlib to work in notebook mode, which makes our plots interactive (mousing over them will show the X-Y coordinates). Put the following code into the first cell of the notebook, and run it.

%matplotlib notebook
from matplotlib import pyplot as plt
import json
from collections import Counter
from datetime import datetime

Now we’ll read in the entire review file and split them into an array of individual (string) reviews. This takes about half a minute, even on a nice machine.

t1 = datetime.now()
s = ""
with open("yelp_academic_dataset_review.json", encoding="utf8") as f:
reviews = f.read().strip().split("\n")
print(datetime.now() - t1)

Note that you can use tab completion for the filename which is super useful to prevent typos and speed up your coding in general (e.g., type yelp_a and press tab instead of typing out the whole name).

You should see a printout showing that the code took about 20-30 seconds to run, and that there are a little over 4 million reviews in the dataset. (All lines of script output in this post will be prefixed with >>>, but you won’t see the prefix in the actual notebook).

>>> 0:00:21.302640
>>> 4153150

In the next cell, let’s convert all the reviews to JSON. This takes a bit longer than reading them in from the file (about 45 seconds on the machine I was using).

reviews = [json.loads(review) for review in reviews]

And it’s always nice to have one review printed in full so that we have an easy reference on how to access pieces of each review. Add the following code to a new cell and run it.


We can get a basic distribution of the star ratings that users usually leave by using a Python Counter:

stars = Counter([review['stars'] for review in reviews])

>>> Counter({5: 1704200, 4: 1032654, 1: 540377, 3: 517369, 2: 358550})

We can see that there are more 5 star reviews than the others, but a visualisation would make the distribution much clearer. Let’s create a basic bar graph of these numbers. Note that we normalize by the length of the reviews, so the Y-axis shows the percentage of total reviews in each category. This post was partly inspired by one that used an Amazon review data-set, which you can find here http://minimaxir.com/2017/01/amazon-spark/. It’s interesting that those reviews followed a similar star-distribution.

Xs = sorted(list(stars.keys()))
Ys = [stars[key]/len(reviews) for key in Xs]
plt.bar(Xs, Ys)

This produces the following graphic:

/Users/g/Desktop/Screen Shot 2017-02-19 at 15.30.02.png

In notebook mode, Jupyter will keep track of whether or not a specific plot is “active.” This is useful as it allows you to plot different points onto the same image. However, it can also get in the way if you’re trying to create a new plot and the output keeps going to the previous one. After creating each plot, you’ll see it has a header that looks like this:

/Users/g/Desktop/Screen Shot 2017-02-19 at 15.34.36.png

Press the blue button in the top right after creating each plot to deactivate it. New calls to plt.plot, etc will then be sent to new graphs, instead of being added to the previous one.

Finding the Most Prolific Reviewers

Let’s find the users who have left the most reviews. We’ll create a Counter object by User_ID (note that Yelp has encrypted the User IDs in this dataset, so they all look a bit strange). Add the following code to a new cell and run it

users = Counter([review['user_id'] for review in reviews])

>>>[(‘CxDOIDnH8gp9KXzpBHJYXw’, 3327), (‘bLbSNkLggFnqwNNzzq-Ijw’, 1795), (‘PKEzKWv_FktMm2mGPjwd0Q’, 1509), (‘QJI9OSEn6ujRCtrX06vs1w’, 1316), (‘DK57YibC5ShBmqQl97CKog’, 1266), (‘d_TBs6J3twMy9GChqUEXkg’, 1091), (‘UYcmGbelzRa0Q6JqzLoguw’, 1074), (‘ELcQDlf69kb-ihJfxZyL0A’, 1055), (‘U4INQZOPSUaj8hMjLlZ3KA’, 1028), (‘hWDybu_KvYLSdEFzGrniTw’, 988)]

We can see that one user has left an impressive 3327 Yelp reviews. Let’s name this user Mx Prolific and create a collection of only their reviews.

mx_prolific = [review for review in reviews if review['user_id'] ==
mp_stars = Counter([review['stars'] for review in mx_prolific])

>>> Counter({3: 1801, 4: 1036, 2: 390, 1: 53, 5: 47})

Note that mx_prolific’s ratings diverge strongly from the overall distribution we saw before. While overall 5-star reviews are the most common, Mx Prolific has awarded only 47 5-star reviews (Perhaps these are establishments that are worth checking out!), and nearly 2000 3 star reviews.

Now we can create a second-level Counter to count the frequencies of the number of reviews left by each individual user. We summarized our original data by combining all the reviews left by the same user into a single record. Now we want to summarize further and combine, for example, all the users who have left exactly 12 reviews into a single record.

num_reviews_left = Counter([x[1] for x in users.most_common()])

This allows us to visualise how many reviews are left by most users. Because nearly all users have left only very few reviews, we’ll visualise the drop off only up to 20. (Change line 3 below to plt.bar(Xs, Ys) to plot all the records and see how plotting more data can sometimes produce a less informative result).

Xs = [x[0] for x in num_reviews_left.most_common()]
Ys = [x[1] for x in num_reviews_left.most_common()]
plt.bar(Xs[:20], Ys[:20])
plt.xlabel("Number of Reviews")
plt.ylabel("Number of Users")

This produces the following output. We can see that a huge number of users leave just one review, and that the dropoff over 2, 3, and 4 reviews is pretty steep.

/Users/g/Desktop/Screen Shot 2017-02-19 at 15.50.21.png

We can do the same to see how many reviews are typically received by a single business.

businesses = Counter([review['business_id'] for review in reviews])
num_reviews_by_business = Counter([x[1] for x in businesses.most_common()])
Xs = [x[0] for x in num_reviews_by_business.most_common()]
Ys = [x[1] for x in num_reviews_by_business.most_common()]
Xs = Xs[:18]
Ys = Ys[:18]
plt.bar(Xs, Ys)
plt.xlabel("Number of Reviews")
plt.ylabel("Number of Businesses")

This produces the following output and image. The most-reviewed business has received 21,908 reviews! The dropoff of the number of reviews a business receives is slower than for reviews left by users, but a low numbers of reviews is much more common. Note that Yelp has only included businesses with at least three reviews in this dataset.

>>> [(3, 21908), (4, 15473), (5, 11498), (6, 9012), (7, 7475), (8, 6061), (9, 5208), (10, 4308), (11, 3857), (12, 3508)]
>>> 947

/Users/g/Desktop/Screen Shot 2017-02-19 at 15.49.53.png

Lastly, we’ll determine whether good or bad reviews tend to have more words. The next post will focus on natural language processing, and we’ll be covering far more sophisticated text processing techniques, but for now we’ll simply use Python’s split() function to split each review into words, and then look at averages by number of stars.

import statistics
review_lengths_by_star = [[],[],[],[],[]]
for review in reviews:
length = len(review['text'].split())
idx = review['stars'] - 1
print([statistics.mean(review_lengths) for review_lengths in review_lengths_by_star])

>>> [146.6973890450556, 146.11345697950077, 135.06209687863014, 118.64652826600197, 93.93472010327426]

We can see that negative reviews tend to be a bit longer, with 1-star reviews having an average of 147 words, while 5-star reviews have a lower average of 94 words. We’ll make a final plot to visualise this:

plt.bar([1, 2, 3, 4, 5], [statistics.mean(rlength) for rlength in
plt.ylabel("Word length")

/Users/g/Desktop/Screen Shot 2017-02-19 at 15.57.58.png

Let’s recap what we did. We set up a powerful data processing environment, and took a cursory look at some of the Yelp data–but we’ve only just scratched the surface in terms of insights we can draw from these data. In a later post, we’ll be using the same dataset to introduce some machine learning and natural language processing concepts.

Get AWS Training for Teams

About the Author:

Comparing and Contrasting Apache Flink vs. Spark

February 21st, 2017

Apache Flink and Spark are major technologies in the Big Data landscape. There is some overlap (and confusion) about what each do and do differently. This post will compare Spark and Flink to look at what they do, how they are different, what people use them for, and what streaming is.

Streaming Data

Streaming data means data that comes in continuously. Processing streaming data poses technical complications not inherent in batch systems. Providing the ability to do that is the technological breakthrough that makes streaming products like Flink, Spark, Storm, and Kafka so important. It lets organizations make decisions based on what is happening right now.

When people learn about streaming data, the first use case they usually look at is Twitter.  Twitter has APIs that let programmers query these tweets in real time. Streaming data examples also include traffic of any kind, stock ticker prices, weather readings coming from sensors, and temperature and vibration gauges on industrial machines.

What do Flink and Spark Do?

Apache Spark is considered a replacement for the batch-oriented Hadoop system. But it includes a component called Apache Spark Streaming as well. Contrast this to Apache Flink, which is specifically built for streaming.

Flink and Spark are in-memory databases that do not persist their data to storage. They can write their data to permanent storage, but the whole point of streaming is to keep it in memory and use it right now. There’s no reason to write it to storage as you want to analyze current data.

Spark/Flink are far more sophisticated that other real time systems, like a syslog connection over TCP built into every Linux system. This is because Flink and Spark are big data systems. They are fault tolerant and built to scale to immense amounts of data.

All of this lets programmers write big data programs (like MapReduce functions) with streaming data. They can take data in whatever format it is in, join different sets, reduce it to key-value pairs (map), and then run calculations on adjacent pairs to produce some final calculated value. They also can plug these data items into machine learning algorithms to make some projection (predictive models) or discover patterns (classification models).

How is Apache Flink different from Apache Spark, and, in particular, Apache Spark Streaming?

At first glance, Flink and Spark would appear to be the same. The main difference is Flink was built from the ground up as a streaming product. Spark added Streaming onto their product later.

Let’s take a look at the technical details of both.

Spark Micro Batches

Spark divides streaming into discrete chunks of data called micro batches. Then it repeats that in a continuous loop. Flink takes a checkpoint on streaming data to break it into finite sets. Incoming data is not lost when taking this checkpoint as it preserved in a queue in both products.

Either way you do it there will always be some lag time processing live data, so dividing it into sets should not matter. After all, when a program runs a MapReduce operation, the reduce operation is run on the map dataset that was created a few seconds ago.

Using Flink for Batch Operations

Spark was built, like Hadoop, to run over static data sets. Flink can do this by just stopping the streaming source.

Flink processes data the same way, whether it is finite or infinite. Spark does not: it uses DStreams (Discretized Streams) for streaming data and RDD (Resilient distributed dataset) for batch data.

The Flink Streaming Model

To say that a program processes streaming data means it opens a file and never closes it. That is like keeping a TCP socket open, which is how syslog works. Batch programs, on the other hand, open a file, process it, then close it.

Flink says it has developed a way to checkpoint streams without having to open and close them. To checkpoint means to notate where one has left off and then resume from that point. Then they run a type of iteration that lets their machine language algorithms run faster than Spark. That is not insignificant as ML routines can take many hours to run.

Flink versus Spark in Memory Management

Flink has a different approach to memory management. Flink pages out to disk when memory is full, which is what happens with Windows and Linux too. Spark crashes that node when it runs out of memory. But it does not lose data, since it is fault tolerant. To fix that issue, Spark has a new project called Tungsten.

Flink and Spark Machine Learning

What Spark and Flink do well is take machine learning algorithms and make them run over a distributed architecture. That, for example, is something that the Python scikit-learn framework and CRAN (Comprehensive R Archive Network) cannot do. Those are designed to work on one file, on one system. Spark and Flink can scale and process enormous training and testing data sets over a distributed system. It is worth noting, however, that the Flink graph processing and ML libraries are still in beta test, as of January 2017.

Command Line Interface (CLI)

Spark has CLIs in Scala, Python, and R. Flink does not really have a CLI, but the distinction is subtle.

To have a Spark CLI means a users can start up Spark, obtain a SparkContext, and write programs one line at a time. That makes walking through data and debugging easier. Walking through data and running map and reduce processes, and doing that in stages, is how data scientists work.

Flink has a Scala CLI too, but it is not exactly the same. With Flink, you write code and then run print() to submit it in batch mode and wait for the output.  

Again this might be a matter of semantics. You could argue that spark-shell (Scala), pySpark (Python), are sparkR (R) are batch too. Spark is said to be “lazy.” That means when you create an object it only creates a pointer to it. It does not run any operation until you ask it to do something like count(), which would require creating the object in order to measure it. So it would submit that to its batch engine then.  

Both languages, of course, support batch jobs, which is how most people would run their code once they have written and debugged it. With Flink you can run Java and Scala code in batch. With Spark you can run Java, Scala, and Python code in batch.  

Support for Other Streaming Products

Both Flink and Spark work with Kafka, the streaming product written by LinkedIn. Flink also works with Storm topologies.

Cluster Operations

Spark or Flink can run locally or on a cluster. To run Hadoop or Flink on a cluster one usually uses Yarn. Spark is usually run with Mesos. If you want to use Yarn with Spark then you have to download a version of Spark that has been compiled with Yarn support.

The Flink End Game

What Spark has that Flink does not is a large install base of production users. If you look on the Flink website you can see some examples of what some companies have built using that, including Capital One and Alibaba. So it’s certainly going to move beyond beta stage and into the mainstream. Which one will have more users? No one can say, although more funding is pouring in Spark right now. Which one is more suitable for streaming data? That depends on requirements and merits further study.

About the Author:

A Forest-Level View of the Big Data Ecosystem

February 13th, 2017

Rich Morrow teaches Big Data, DevOps and AWS courses for DevelopIntelligence. He is a 20+ year veteran of IT and has done everything from being a trench-level core developer to VP of engineering. He is currently a developer evangelist, instructor, and prolific writer on Cloud, Big Data, DevOps/Agile, Mobile, and IoT topics.

Rich works with many companies and developers each year and this gives him a forest-level view of how technology is evolving and the pros/cons of many different tools. For this interview, we focused on the Big Data tools and the Big Data technology landscape.

DevelopIntelligence: Tell us a little bit about yourself, your background in technology, and what you’re working on.

Rich: Sure, I’ve been a software developer, by trade, for about 20 years. I did fingers-on-the-keyboard for many years. I’ve worked for Fortune 500 companies, government, startups, and have founded my own consultancy. I’ve kind of seen the IT world from every angle…but my real passion is startups.

I started my own consultancy about seven or eight years ago. At first, I was basically doing fingers on the keyboard, software dev at the first two or three years of consulting and then got into training on public cloud technologies. Instructor-led live training is about 50-60% of what I do and then the rest is just a mishmash of things like doing training videos for O’reilly, training for DevelopIntelligence, webinars, blog posts, white papers, and speaking engagements. I did a couple of keynotes last year, and got to present at the AWS conference in November.

Get Big Data Training for Teams

DevelopIntelligence: Oh cool. What do you enjoy about training?

Rich: I enjoy getting to work with about 30 to 40 really cool companies for like a week each throughout the course of a year.

The thing that I love about the training is it really gives me a broad perspective, because when I am teaching, I am learning.

Students are telling me things, they are like hey “, we had a problem with this technology in our organization and here’s why, or hey, we found this other thing really useful.”

I get to pick up little tidbits, little nuggets of valuable stuff from a whole bunch of different folks. So when I go back to write thought leadership pieces and speak at events, I can come with a broader perspective.

DevelopIntelligence: Tell us a bit about the evolution of the big data ecosystem. Where could it be going and what are you bullish about?

Rich: Yes, totally great question. So, let’s go back about 50, 60 years maybe to really start talking about this. Big data analytics was confined to the big companies because in order to do big data analysis you had to plunk down a few million to buy (and then maintain) some Oracle or EMC hardware and software. Then you’d have to go back every few years, renew your license, and shovel another few million at Larry Ellison. This really limited who was able to do big data analysis.

The democratization of technology in the last 10-15 years is probably the most exciting thing to happen in IT in my lifetime. It’s reduced cost and complexity of even traditionally expensive systems like Data Warehouses to the point that everyone now has access to these tools. Open source technologies like Spark combined with low-cost, pay-as-you-go cloud platforms like AWS means you can rent cloud hardware and software to run these analyses for less than $100. And even if you’re a mom and pop shop and you just want to analyze your weblogs, you can do that in Amazon for like five, ten bucks a pop once you’ve figured out what you need to do.

This is really exciting because just like in any kind of democracy, you get a whole bunch of ideas flowing. You need a large community of people sharing code, sharing ideas, sharing architectural patterns coming up with new and novel ways to do things. And when this happens, everybody benefits.

I’m also really excited about the Internet of things (IoT) as it’s the marriage of our physical and digital worlds. This is going to be much bigger and impactful than anything that’s come before. To keep up on the developments, I attend as many IoT conferences as I can.

There’s all these cool stories coming out of the IOT world. One that sticks in my mind is the company doing “IoC” (Internet of Cattle), where they literally affix a sensor on cows and track their temperature, how many times they got sick, if they got a shot or immunization, all that detail. The goal was when you go into Whole Foods and buy some steak, it would literally list out all the important aspects of your meat. You could tell if the refrigerated shipping truck ever failed and took the meat over a certain temperature, you could see what it was fed, what antibiotics it was injected with. You could see all that data. That’s just one powerful example that you look at and say “Wow, this is going to change everything”. We’re starting to put IoT sensors on our parking structures so that when you drive in, it just automatically tells you which spaces are open on which floors. If it’s driverless car, it could even park itself there. With IoT, there’s all kinds of ways we can help optimize businesses, lives, all kinds of cool stuff.

DevelopIntelligence: Are big data and the Internet of Things evolving together?

Rich: Absolutely. It is really interesting because different technologies are now influencing each other from totally different disciplines. Like how genome sequencing work coming out of biology results in new mathematical algorithms that we can plug back into other unrelated systems like cascading network data failures.

Technology is a the common catalyst for all this. I’m constantly bringing up mobile, IoT, cloud, and Big Data in a single conversation because they are so interconnected. If you are trying to do a mobile app these days and you want it to scale, you are going to go to public cloud. When it scales, you’re going to kick out a bunch of data which you then want to extract value of and that’s going to lead you down the big data rabbit hole.

DevelopIntelligence: With specific technologies in those spaces that you’ve been working with over the last five to fifteen years… how have they been evolving and which ones are you bullish about?

Rich: The biggest thing that happened probably right around that timeline is about 15 years ago was when open source became valid in ways it wasn’t before. Steve Ballmer (from Microsoft) hated Linux because Linux started eating their lunch. The recent versions of the Windows include Linux. You know you’ve won when they mimic you.

Hadoop was the first open source project to really shake up the big data world. It kind of came out of nowhere, and provided a low cost storage and processing framework that could scale like mad. Because it makes no assumption about the underlying hardware, a lot of companies would just throw it on five or ten or twenty old machines laying around and boom! all of a sudden you’ve got something that folks could start using as an analytics engine. They could build a NoSQL with HBase, a Data Warehouse with Hive and so forth. A whole complete, rich ecosystem of projects just kind of grew up out of that. Years and years of growth, innovation, and adoption of the Hadoop platform resulted in a large user base that attracted many other projects.

Probably the hottest project in any tech space right now is Apache Spark — an in-memory analytics engine that came out of a group called the Berkeley Data Analytics Stack (or BDAS). BDAS are the same folks who developed Apache Mesos. Lots of people would lampoon me for this, but Spark is basically Hadoop on steroids — at least the processing half of Hadoop (Spark brings no native storage, but it can and frequently does use Hadoop’s storage engine called HDFS or the Hadoop Distributed File System). Spark lets you do all the same type of parallel data analytics you did with Hadoop MapReduce, but it does it much, much faster, lending itself to real-time analytics, interactive analytics, and SQL based queries. It ships with a module called Spark SQL that can essentially turn your Spark cluster into a high-speed data warehouse which you could plug your visualization interfaces (like Tableau, Looker, Microstrategy) right into and start doing point-and-click analytics. And the SQL actually runs as a first class citizen, so it doesn’t need to down compile like it does with Hive in Hadoop. It just runs crazy fast, 10 to 50 times faster depending on the query.

Spark gets even more atractive when you pair it up with public cloud. The high memory machines Spark requires are relatively expensive because memory is still one of the most expensive parts of a server. Rather than having to go out and buy these expensive clusters of several hundreds or thousands machines, you can just run them on Amazon on an hourly basis. And like a lot of organizations, you may do your analytics for less than an hour once a day, or a few hours a month or quarter.

If you were doing these analytics with on-premise systems, you’d waste a lot of cash on licenses, consulting fees, power, cooling, data center real estate, replacing failed hardware and other maintenance costs. All these hidden costs just go straight out the window when you look at something like Hadoop in AWS. There’s a lot of companies out there, big organizations, big enterprises that do like their weekly analytics for 20 to 50 bucks a month on Amazon.

DevelopIntelligence: Some people say there’s too much big data hype. Do you ever worry or think about the hype cycle?

Rich: Hype definitely exists, but maybe less so in this space. “Big Data” has already passed through Gartner’s trough of disillusionment because it’s been out there for so long. Many hot technologies get overhyped, but once they come through the “disillusionment” phase, you know the hype has leveled off.

In Big data, something like streaming or Hadoop or Spark will come out and be followed by a big six to eight month hype phase. Everybody goes, “This is greatest thing ever! The old stuff is no longer cool”. Then you go through this realization phase and you start realizing that some parts don’t quite work out the same way and it does have some limitations you didn’t know about. I would say it takes about a year for all that to settle down until you have got a realistic understanding of what exactly this technology is and where it shines and where it doesn’t.

I think all the technologies we’ve talked about have gone beyond the disillusionment phase. Like certain cloud-enabled data warehouses, NoSQL has way gone beyond that. IoT is just starting to get there and then machine learning will come next.

DevelopIntelligence: What would you recommend that someone young or new to the field focus on learning?

Rich: First thing I would say is understand distributed system design. When you look at all these systems that do big data analytics, whether it’s Hadoop or Spark or Redshift or what else is out there, even some of the NoSQL engines that we used big data analytics for, they all are distributed systems. So, understand how these systems work and what their limitations are — things like CAP Theorem, Horizontal Linear Scale, MapReduce, Eventual Consistency — those concerns underpin all of the modern distributed systems. Then find one you believe in and become an expert — I would say the big one right now is probably Spark. And a good language to learn is SparkR. It’s kind of the statistical language that came out of the data sciences specifically for analytics. Learning public clouds is probably important as well as streaming.

I think the recipe for success these days with any technologies is become jack of all trades and a master of some. So that you don’t go deer in the headlights when somebody on your team talks about any other technology and you are the go-to guy or gal when they have questions about those two or three technologies in your wheelhouse.

The challenge with both cloud and big data is if you are not paying attention to this space for like a year, it just completely leapfrogs over what was available last year. You don’t want to be a company making last year’s design when all your competitors have something faster, cheaper and more feature rich.

Get Big Data Training for Teams