About the Author:

What machine learning is today, and what it could be soon

February 18th, 2019

If AI is a broad umbrella that includes the likes of sci-fi movies, the development of robots, and all sorts of technology that fuels legacy companies and startups, then machine learning is one of the metal tongs (perhaps the strongest) that holds the AI umbrella up and open.

So, what is machine learning offering us today? And what could it offer us soon? Let’s explore the potential for ML technologies.

Intro to machine learning

Machine learning is the process of machines sorting through large amounts of data, looking for patterns that can’t be seen by the human eye. A theory for decades, the application of machine learning requires two major components: machines that can handle the amount of processing necessary, plus a lot (a lot!) of gathered, cleaned data.

Thanks to cloud computing, we finally have both. With cloud computing, we can speed through data processing. With cloud storage, we can collect huge amounts of data to actually sort through. Before all this, machines had to be explicitly programmed to accomplish a specific task. Now, however, computers can learn to find patterns, and perhaps act on them, without such programming. The more data, the more precise machine learning can be.

Current examples of machine learning

Unless you are a complete luddite, machine learning has already entered folds of your life. Choosing a Netflix title based on prompted recommendations? Browsing similar titles for your Kindle based on the book you just finished? These recommendations are actually tailor-made for you. (In the recent past, they relied on an elementary version of “if you liked x, you may like y”, culled from a list that was put together manually.)

Today, companies have developed proprietary algorithms that machine learnings train, or look for patterns, on, using your data combined with the data of millions of other customers. This is why your Netflix may be chock full of action flicks and superhero movies and your partner’s queue leans heavily on crime drama and period pieces.

But machine learning is doing more than just serving up entertainment. Credit companies and banks are getting more sophisticated with credit scores. Traditionally, credit companies relied on a long-established pattern of credit history, debt and loan amounts, and timely payments. This meant if you weren’t able to pay off a loan from over a decade ago, even if you’re all paid up now, your credit score likely still reflects that story. This made it very difficult to change your credit score over time – in fact, time often felt like the only way to improve your credit score.

Now, however, machine learning is changing how credit bureaus like Equifax determine your score. Instead of looking at your past payments, data from the very near past – like, the last few months – can actually better predict what you may do in the future. Data analysis from machine learning means that history doesn’t decide; data can predict your credit-worthiness based on current trends.

What the future holds for machine learning

Machine learning is just getting started. When we think of the future for machine learning, an example we also hear about are those elusive self-driving cars, also known as autonomous vehicles.

In this case, machine learning is able to understand how to respond to particular traffic situations based on reviewing millions of examples: videos of car crashes compared to accident-free traffic, how human-driven cars respond to traffic signs or signals, and watching how, where, and when pedestrians cross streets.

Machine learning is beginning to affect how we see images and videos – computers are using neural networks to cull thousands of images from the internet to fill in blanks in your own pictures.

Take, for instance, the photo you snapped on your holiday in London. You have a perfect shot of Big Ben, except for a pesky pedestrian sneaking by along a wall. You are able to remove the person from your image, but you may wonder how to fill the space on the wall that walker left behind. Adobe Photoshop and other image editors rely on an almost-standard API to cull other images of walls (that specific wall, perhaps, as well as other walls that look similar) and randomize it so that it looks natural and organic.

 

This type of machine learning is advancing rapidly and it could soon be as easy as an app on our phones. Imagine how this can affect the veracity of a video – is the person actually doing what the video shows?

Problems with machine learning

We are at a pivotal point where we can see a lot of potential for machine learning, but we can also see a lot of potential problems. Solutions are harder to grasp as the technology forges forward.

The future of machine learning is inevitable; the question is more when? Predictions indicate that nearly every kind of AI will include machine learning, no matter the size or use. Plus, as cloud computing grows and the world amasses infinite data, machines will be able to learn continuously, on limitless data, instead of on specific data sets. Once connected to the internet, there is a constant stream of emerging information and content.

This future comes with challenges. First, hardware vendors will necessarily have to make their computers and servers stronger and speedier to cope with these increased demands.

As for experts in AI, it seems there will be a steep and sudden shortage in the professional manpower who can cope with what AI will be able to day. Behind the private and pricey walls of Amazon, Google, Apple, Uber, and Facebook, most small- and medium-sized businesses (SMBs) actually aren’t stepping more than a toe or two into the world of machine learning. While this is due in part to a lack of money or resources, the lack of expert knowledge is actually the biggest reason that SMBs aren’t deeper into ML. But, as ML technologies normalize, they’ll cost less and become a lot more accessible. If your company doesn’t have experts who knows how you could be using ML to help your business, you’re missing out.

On a global level, machine learning provides some cause for concern. There’s the idea that we’ll all be replaced in our jobs by specific machines or robots – which may or may not come to fruition.

More immediately and troubling, however, is the idea that imaging can be faked. This trick is certainly impressive for an amateur photographer, but it begs an important question: how much longer can we truly believe everything that we see? Perhaps seeing is believing has a limited window as a standard truthbearer in our society.

 

About the Author:

Reaching the Cloud: Is Everything Serverless?

February 18th, 2019

As it goes in technology, as soon as we all adapt a new term, there will assuredly be another one ready to take its place. As we embrace cloud technology, migrating functions and software for organization, AI potential, timeliness, and flexibility, we are now encountering yet another buzzword: serverless.

Serverless and the cloud may sound similar, both floating off in some distant place, existing beyond your company’s cool server room. But are the cloud and serverless the same? Not quite. This article explores how serverless technology relates to the cloud, as well as, and more importantly, whether you have to adapt a serverless culture.

What is serverless?

Serverless is shorthand for two terms: serverless architecture and serverless computing.

Once we get past the name, serverless is a way of building and deploying software and apps on cloud computers. For all your developers and engineers who are tired of coping with server and infrastructure issues because they’d rather be coding, serverless could well be the answer.

Serverless architecture is the foundation of serverless computing. Generally, three types of software services can function well on serverless architecture: function-as-a-service (FaaS), backend-as-a-service (BaaS), and databases.

Serverless code, then, relies on serverless architecture to develop stand-alone apps or microservices without provisioning servers, as is required in traditional (server-necessary) coding. Of course, serverless coding can also be used in tandem with traditional coding. An app or software that runs on serverless code is triggered by events and its overall execution is managed by the cloud provider. Pricing varies but is generally based on the number of executions (as opposed to a pre-purchased compute capacity that other cloud services you use may rely on).

As for the name itself: calling something “serverless” is a bit of a misnomer because serverless anything isn’t possible. Serverless software and apps still rely on a server, it’s just not one that you maintain in-house. Instead, your cloud provider, such as Google, AWS, Azure, or IBM, acts as your server and your server manager, allocating your machine resources.

The cloud vs. serverless

While the cloud and serverless are certainly related, there’s a better reason why we are hearing about serverless technologies ad nauseum. Because cloud leaders like AWS, Google, Azure, and IBM are investing heavily in serverless (and that’s a ton of money, to be sure).

Just as these companies spearheaded a global effort to convince companies their apps and data can perform and store better in the cloud, they are now encouraging serverless coding and serverless architecture so that you continue to use their cloud services.

Serverless benefits

Is everything serverless? Will everything be serverless soon? In short, no and no.

The longer answer is that serverless architecture and serverless computing are good for simple applications. In serverless coding, your cloud provider takes care of the server-side infrastructure, freeing up your developers to focus on your business goals.

Your developers may already be working on serverless code – or they want to be. That’s because it frees them from the headache of maintaining infrastructure. They can dispense with annoying things like provisioning a server, ensuring its functionality, creating test environments, and maintaining server uptime, which means they are focused primarily on actual developing.

As long as the functionality is appropriate, serverless can provide the following benefits:

  • Efficient use of resources
  • Rapid testing and deployment, as multiple environments are a breeze to set up
  • Reduced cost (server maintenance, team support, etc.)
  • Focus on coding – may result in increased productivity around business goals
  • Familiar programming languages and environment
  • Increased scalability

Traditional code isn’t going anywhere (yet)

While focusing on your core business is always a good goal, the reality is that serverless isn’t a silver bullet for your coding or your infrastructure.

Depending on your business, it’s likely that some products and apps require more complex functions. For these, serverless may be the wrong move. Traditional coding still offers many benefits, despite still requiring fixed resources that require provisioning, states, and human maintenance. Networking is easier because everything lives within your usual environment. And, let’s face it: unless you’re a brand-new startup, you probably already have the servers and tech staff to support traditional coding and architecture.

Computationally, serverless has strict limits. Most cloud providers price serverless options based on time: how many seconds or minutes does an execution take? Unfortunately, the more complex your execution, the more likely you’re go past the maximum time allowed, which hovers around 300 seconds (five minutes). With a traditional environment, however, there is no timeout limit. Your servers are dedicated to your executions, no matter how long they take or how many external databases they have to reference. This can make activities like testing and external call up harder or impossible to accomplish.

From a business perspective, you have to decide what you value more: only paying for what you use (caveat emptor), with decreased opex costs. Or, perhaps control is tantamount, as you are skeptical of the trust and security risk factors that come with using a third party. Plus, not all developers work the same. While some devs want to use cutting-edge technology that allows them to focus on front-end logic, others prefer the control and holistic access that traditional architecture and coding provides.

About the Author:

When Technology Moves Faster Than Training, Bad Things Happen

February 18th, 2019

Technology is changing how we design training, and it should. Unfortunately, many instructional designers are not producing the learning programs and products that today’s technical talent needs. Not because they don’t want to, but because many companies don’t support their efforts to advance their work technologically or financially.

That’s a mistake. Technology has already changed learning design. Those who don’t acknowledge this appropriately are doing their organizations – and their technical talent – a disservice.

Bob Mosher, chief learning evangelist for Apply Synergies, a learning and performance solutions company, said we can now embed technology in training in ways we never could before. E-learning, for instance, has been around in some for or another, but it always sat in an LMS or outside of the technology or whatever subject matter it was created to support. That’s no longer the case.

“Now I don’t have to leave the CRM or ERP software, or cognitively leave my workflow,” Mosher explained. “I get pop ups, pushes, hints, lessons when I need them, while I’m staring at what I’m doing. These things guide me through steps; they take over my machine, they watch me perform and tell me when and where I go wrong. Technology has allowed us to make all of those things more adaptive.”

Of course, not all learning design affected by technology is adaptive, but before adaptive learning came on the scene, training was more pull than push, which can be problematic. If you don’t know what you don’t know, you may proceed blindly thinking that, “oh, I’m doing great,” when you’re really not. Mosher said adaptive learning technologies that monitor learner behavior and quiz and train based on an individual’s answers and tactics, can be extremely powerful.

But – there’s almost always a but – many instructional designers are struggling with this because they’re more familiar with event-based training design. Designing training for the workflow is very different animal.

The Classroom Is Now a Learning Lab

“It’s funny, for years we’ve been talking about personalized learning, but we’ve misunderstood it thinking we have to design the personalized experience for every learner,” Mosher said. “But how do I design something personalized for you? I can give you the building blocks, but in the end, no one can personalize better than the learners themselves. Designing training for the workflow is a very different animal.”

In other words, new and emerging technologies are brilliant because they enable learners to customize the learning experience and adapt it to the work they do every day. But it’s one thing to have these authoring technologies and environments; it’s something else for an instructional designer to make the necessary shift and use them well.

Further, learning leaders will have to use the classroom differently, leveraging the different tools at their disposal appropriately. “If I know I have this embedded technology in IT, that these pop ups are going to guide people through, say, filling out a CRM, why spend an hour of class teaching them those things? I can skip that,” Mosher said. “Then my class becomes more about trying those things out.”

That means learning strategies that promote peer learning, labs and experiential learning move to the forefront, with adaptive training technology as the perfect complement. Antiquated and frankly ineffective technical training methods filled with clicking, learning by repetition through menus, and procedural drilling should be retired post haste in favor of context-rich learning fare.

Then instructors can move beyond the sage-on-the-stage role, and act as knowledge resources and performance support partners, while developers and engineers write code and metaphorically get their hands dirty. “If I have tools that help me with the procedures when I’m not in class, in labs I can do scenarios, problem solving, use cases, have people bounce ideas and help me troubleshoot when I screw up,” Mosher said. “I’m not taking a lesson to memorize menus.”

Learning Leaders, Act Now

Learning leaders who want to adapt to technology changes in training design must first secure appropriate budget. Basically, you can’t use cool technology for training unless you actually buy said cool technology. Budgetary allocations and experimentation must be done, and instructional designers have to have the time and latitude to upgrade their skills as well because workflow learning is a new way of looking at design.

“Everyone wants agile instructional design, but they want to do it the old way,” Moshers said. “You’re not going to get apples from oranges. Leadership has to loosen the rope a little bit so instructional designers (IDs) can change from the old way of designing to the new way.

“IT’s been agile for how long now? Yet we still ask IDs to design in a waterfall, ADDIE methodology. That’s four versions behind. Leadership has to understand that to get to the next platform, there’s always a learning curve. There’s an investment that you don’t get a return on right away – that’s what an investment is.”

For learning leaders who want to get caught up quickly and efficiently, Mosher said it can be advantageous to use a vendor. They’re often on target with the latest instructional design approaches and have made the most up to date training technology investments. But leadership must communicate with instructional designers to avoid resistance.

“Good vendors aren’t trying to put anybody out of a job, or call your baby ugly,” he explained. “It’s more like, look. You’ve done great work and will continue to do great work, but you’re behind. You deserve to be caught up.”

The relationship should be a partnership where vendor and client work closely together. “Right,” Mosher said. “If you choose the right vendor.”

About the Author:

Analyzing 4 Million Yelp Reviews with Python on AWS

February 27th, 2017

Yelp runs a data challenge every year in which it invites people to explore its real-world datasets for unique insights. In this post, we’ll cover show how to load the dataset into a Jupyter Notebook running on a powerful but cheap AWS spot instance, and produce some initial explorations and visualizations.

This post is aimed at people who:

  • Have some existing Python knowledge
  • Are interested in learning more about how to process and visualise large-scale data with Python

If you are interested in taking part in the Yelp challenge, this tutorial will leave you in a good place to start more interesting analyses.

Overview

In this post, we’ll be looking at the Yelp data from the Yelp Dataset Challenge. This is an annual competition that Yelp runs where it asks participants to come up with new insights from its real-world data. We will:

  • Launch an AWS EC2 Spot instance with enough power to process the dataset (4 million reviews) quickly.
  • Configure the EC2 instance and install Jupyter Notebook as well as some data processing libraries.
  • Display some basic analysis of the data, along with visualisations using Matplotlib.

If you have a high end desktop or laptop (with at least 32GB RAM), you can probably run most of the analyses locally. However, learning how to process data in the cloud is a useful skill, so I’d recommend following along with the entire tutorial. Even if the Yelp data is small enough for your local machine, you may well want to process larger datasets in the future. And considering that AWS offer instances with up to 2TB of RAM, the method described here will work for even larger datasets.

Creating an AWS EC2 Spot Instance

Amazon Web Services (AWS) offer Elastic Cloud Compute (EC2) instances. These are on-demand servers that you can rent by the hour. They tend to be fairly expensive, especially for the more beefy machines, but luckily AWS also offer so-called ‘spot’ instances. These are instances that they currently have in excess supply, and they auction them off temporarily to the highest bidder, normally at much lower prices than their regular instances. This is very useful for short-term needs (such as data analysis) because the chance of someone else outbidding you while you still need the machine is comparatively low. To fire up a spot EC2 instance, follow the following steps:

Get AWS Training for Teams

  • Visit aws.amazon.com and sign up for an account (assuming that you don’t have an account with them already). It’s a somewhat complicated signup process, and it requires a credit card, even for their free trial, so this step might take some time. You can instead use Microsoft Azure, Google Cloud Compute, Rackspace, Linode, Digital Ocean, or any of a number of cloud providers for this step if you want, but all require a credit card for sign up and they don’t all offer the same variety of instances or the same discount pricing structure as AWS.
  • Visit the AWS Console. Pick a region using the dropdown in the top right-hand corner. For latency reasons, it’s nice to pick a region close to you, but some regions have more instances available and have cheaper spot instances. For example, even though I’m in The Netherlands, I chose the Oregon region (us-west-2) while making this post, as there were low-priced spot instances available there. (If you really need to save every cent you can, this Mozilla Python script can help you find the cheapest instance currently available worldwide.)
  • Click Services in the top left-hand corner, and choose EC2 from the list. You’ll now be taken to the main EC2 page.
  • /Users/g/Desktop/Screen Shot 2017-02-18 at 14.52.17.png
  • In the left-hand column, select Spot Requests and then click Request Spot Instances
  • /Users/g/Desktop/Screen Shot 2017-02-19 at 14.19.40.png

There are many options we can modify when creating a launch request for an instance. Luckily, we can leave almost all the defaults as they are. The ones we will change are:

  • The AMI — this is the “Amazon Machine Image.” It defaults to Amazon Linux, but we’ll be using Ubuntu-Server instead. Choose Ubuntu Server 16.04 LTS (HVM) from the AMI drop down.
  • The instance type–we’ll want an instance with lots of RAM (at least 30GB but preferably 60+ GB) and at least some SSD space. Click the x next to the default selected instance to remove it and then click the Select button next to that. You should see a popup similar to the one shown below. You can use the column headings to sort by a specific column. To choose an instance, I sorted by price and then found the first instance with 30GB RAM and some SSD space, which was an m3.2xlarge. The m instances aim to balance CPU, RAM and hard disk. The r instances focus on RAM and are also good for data analysis if you find a cheap one.

/Users/g/Desktop/Screen Shot 2017-02-17 at 17.59.59.png

At the bottom of the screen, click Next to get to the second (and last) page of settings for your instance.

  • Under Set Keypair and Role, click Create a new key pair. This will open a new tab and take you to the EC2 key management page. Choose to create a new key pair again, give it a name, and download the private key when prompted. Save it in your home directory as ec2key.pem.
  • Under Manage firewall rules select default. This will create an inbound firewall rule that allows the instance to accept SSH connections (we’ll be connecting to the instance via SSH).

Now click Review at the bottom of the page, check that everything looks as expected, and click Launch. This creates a spot instance “request,” and you might have to wait a bit for it to be “fulfilled” (meaning an instance became available that matched your request). You can see the state of the request under the Spot Requests tab (the one you used to create the request). When the request has been labeled fulfilled (and given a green icon), you’ll see the instance under the Instances section (you sometimes need to reload the Instances section to see the new instance).

Note: Be aware that the prices of spot instances can skyrocket unpredictably, leaving you with a nasty billing surprise. By default, the price is capped at the on-demand price (the price you usually pay), so if someone bids higher than that, you can lose your instance (and your work) suddenly.

/Users/g/Desktop/Screen Shot 2017-02-19 at 14.33.17.png

Scroll to the right in the Instances window to find the Public DNS of your instance, and copy this to your clipboard.

/Users/g/Desktop/Screen Shot 2017-02-19 at 14.35.18.png

Connecting to Our EC2 Spot Instance

Now open up a terminal or command prompt on your local machine. If you’re using Windows, you won’t be able to use SSH by default. Most people use PuTTy to SSH from Windows, but if you have a modern version of Windows it’s easier to enable WSL (Windows Subsystem for Linux). Once you’ve set that up, you can use SSH exactly as described in this post. As an alternative, you could install Git Tools for Windows. In the last step of the installation process, you’ll be asked if you want to Use Git and optional Unix tools from the Windows Command Prompt. Select yes, and you’ll be able to use SSH from the Windows CMD prompt.

Before we can connect to the instance, we need to change the permissions for the .pem key file that you downloaded earlier. Assuming that your key was saved in your home directory as ec2key.pem, run the following command:

chmod 600 ~/ec2key.pem

Now you can use it to connect to the instance. Make sure you still have the Public DNS name for your instance in your clipboard, and run the following command:

ssh -i ~/ec2key.pem -L 8888:localhost:8888 \
ubuntu@ec2-52-33-47-198.us-west-2.compute-amazonaws.com

This connects to your instance, allowing you to run commands on it via SSH. The -i flag points to your key file, which proves that you’re the owner of the instance and the -L flag sets up port forwarding. Here we specify that port 8888 on our local machine should be forwarded to port 8888 on the remote instance. We’ll need this in a bit so that we can view a Jupyter notebook locally and have it execute on our instance.

Configuring Our EC2 Instance

To set up our instance, we only need to configure pip and install some Python libraries for data processing. Run the following commands on the instance:

export LC_ALL="en_US.UTF-8"
sudo apt update
sudo apt install python3-pip
pip3 install pip matplotlib jupyter --user

The first command sets the LC_ALL environment variable which specifies the locale. By default, Ubuntu Server often does not specify this, and pip needs locale information to function correctly. The later commands install pip using Ubuntu’s apt package manager. We then use pip to reinstall itself, as the apt versions of software sometimes lag behind the current versions. We also install jupyter which is the notebook we’ll use and matplotlib for plotting.

If you chose an instance with an SSD, you’ll have to mount that. Run:

lsblk

to see the available disks. You’ll probably see the SSD listed as /dev/xvdb (though it might be called something else). Run the following commands to mount the SSD, substituting the xvdb if necessary:

sudo mkdir /mnt/ssd
sudo mount /dev/xvdb /mnt/ssd
sudo chown -R ubuntu /mnt/ssd

If you picked a machine with about 30GB of RAM, you can still run into some issues while loading and manipulating some of the Yelp data. I created another 20GB of swap place (virtual RAM on the hard drive) just in case (this step takes a while to run):

sudo dd if=/dev/zero of=/mnt/ssd/swapfile bs=1G count=20
sudo mkswap /mnt/ssd/swapfile
sudo swapon /mnt/ssd/swapfile
Getting the Yelp Data onto Our Machine

Currently, Yelp requires that you fill out an online form to get a link to access the data. This link is then tied to the machine where you filled out the form. There may be a workaround, but I had to download the data locally and then transfer it across to AWS, which took quite a while with my slow uplink connection. Fill out the form and obtain the download link here: https://www.yelp.com/dataset_challenge/dataset

Once you’ve downloaded the approx 1.8GB tar file, you can scp it to your instance with the following command (assuming that you saved the tar file to your Downloads folder. If not, substitute ~/Downloads/ for the path to the Yelp file). You’ll also need to substitute the DNS string for your own. Note that this command needs to be run on your local machine, not from the EC2 instance.

scp -i ~/ec2key.pem ~/Downloads/yelp_dataset_challenge_round9.tar \
ubuntu@ec2-52-33-47-198.us-west-2.compute-amazonaws.com:/mnt/ssd

Now, on the instance, you can untar the data with the following commands:

cd /mnt/ssd
tar -zxvf yelp_dataset_challenge_round9.tar

This should create a bunch of large .json files. We’ll be opening these directly in Python, so our command line work is nearly done.

Starting and Accessing the Jupyter Notebook

Now start the Jupyter notebook server on the instance by running:

jupyter-notebook

You should see output saying that no web browser was detected, and giving you a URL with a token, similar to the following:

/Users/g/Desktop/Screen Shot 2017-02-19 at 15.04.52.png

Copy the URL to your clipboard and paste it into a browser on your local machine. You’ll see the default Jupyter Notebook page. Create a new Python 3 notebook by selecting New in the top right-hand corner and then choosing Python 3.

/Users/g/Desktop/Screen Shot 2017-02-19 at 15.13.50.png

If you’ve never used Jupyter Notebooks before, take a few moments to get acquainted with how things work. You can insert cells, delete cells, or run specific cells. Cells are useful because you can run specific blocks of code after making a change without having to rerun all the code above. Each cell shares a namespace with any previously run cell, so you can always access your variables and imports from new cells. The most useful keyboard shortcut is Ctrl + Enter, which runs the code in the currently selected block and displays the output.

The default working directory is the directory from which you launched Jupyter. If you followed the commands as laid out above, this would have been /mnt/ssd/ on your instance, so the JSON Yelp data files should be in the current working directory. To check, you can type !ls into one the cell at the top and run it. This will output all the filenames in the current directory.

Starting our Data Analysis

Now it’s finally time to load the data into Python and play around with it. The Yelp data is in a bit of a strange format–although they provide JSON files, the files contain a separate JSON object on each line, instead of one single JSON object.

In the first cell, we’ll want to set up some import of libraries that we’ll be using. We’ll set matplotlib to work in notebook mode, which makes our plots interactive (mousing over them will show the X-Y coordinates). Put the following code into the first cell of the notebook, and run it.

%matplotlib notebook
from matplotlib import pyplot as plt
import json
from collections import Counter
from datetime import datetime

Now we’ll read in the entire review file and split them into an array of individual (string) reviews. This takes about half a minute, even on a nice machine.

t1 = datetime.now()
s = ""
with open("yelp_academic_dataset_review.json", encoding="utf8") as f:
reviews = f.read().strip().split("\n")
print(datetime.now() - t1)
print(len(reviews))

Note that you can use tab completion for the filename which is super useful to prevent typos and speed up your coding in general (e.g., type yelp_a and press tab instead of typing out the whole name).

You should see a printout showing that the code took about 20-30 seconds to run, and that there are a little over 4 million reviews in the dataset. (All lines of script output in this post will be prefixed with >>>, but you won’t see the prefix in the actual notebook).

>>> 0:00:21.302640
>>> 4153150

In the next cell, let’s convert all the reviews to JSON. This takes a bit longer than reading them in from the file (about 45 seconds on the machine I was using).

reviews = [json.loads(review) for review in reviews]

And it’s always nice to have one review printed in full so that we have an easy reference on how to access pieces of each review. Add the following code to a new cell and run it.

print(reviews[0])

We can get a basic distribution of the star ratings that users usually leave by using a Python Counter:

stars = Counter([review['stars'] for review in reviews])
print(stars)

>>> Counter({5: 1704200, 4: 1032654, 1: 540377, 3: 517369, 2: 358550})

We can see that there are more 5 star reviews than the others, but a visualisation would make the distribution much clearer. Let’s create a basic bar graph of these numbers. Note that we normalize by the length of the reviews, so the Y-axis shows the percentage of total reviews in each category. This post was partly inspired by one that used an Amazon review data-set, which you can find here http://minimaxir.com/2017/01/amazon-spark/. It’s interesting that those reviews followed a similar star-distribution.

Xs = sorted(list(stars.keys()))
Ys = [stars[key]/len(reviews) for key in Xs]
plt.bar(Xs, Ys)
plt.show()

This produces the following graphic:

/Users/g/Desktop/Screen Shot 2017-02-19 at 15.30.02.png

In notebook mode, Jupyter will keep track of whether or not a specific plot is “active.” This is useful as it allows you to plot different points onto the same image. However, it can also get in the way if you’re trying to create a new plot and the output keeps going to the previous one. After creating each plot, you’ll see it has a header that looks like this:

/Users/g/Desktop/Screen Shot 2017-02-19 at 15.34.36.png

Press the blue button in the top right after creating each plot to deactivate it. New calls to plt.plot, etc will then be sent to new graphs, instead of being added to the previous one.

Finding the Most Prolific Reviewers

Let’s find the users who have left the most reviews. We’ll create a Counter object by User_ID (note that Yelp has encrypted the User IDs in this dataset, so they all look a bit strange). Add the following code to a new cell and run it

users = Counter([review['user_id'] for review in reviews])
print(users.most_common(10))

>>>[(‘CxDOIDnH8gp9KXzpBHJYXw’, 3327), (‘bLbSNkLggFnqwNNzzq-Ijw’, 1795), (‘PKEzKWv_FktMm2mGPjwd0Q’, 1509), (‘QJI9OSEn6ujRCtrX06vs1w’, 1316), (‘DK57YibC5ShBmqQl97CKog’, 1266), (‘d_TBs6J3twMy9GChqUEXkg’, 1091), (‘UYcmGbelzRa0Q6JqzLoguw’, 1074), (‘ELcQDlf69kb-ihJfxZyL0A’, 1055), (‘U4INQZOPSUaj8hMjLlZ3KA’, 1028), (‘hWDybu_KvYLSdEFzGrniTw’, 988)]

We can see that one user has left an impressive 3327 Yelp reviews. Let’s name this user Mx Prolific and create a collection of only their reviews.

mx_prolific = [review for review in reviews if review['user_id'] ==
"CxDOIDnH8gp9KXzpBHJYXw"]
mp_stars = Counter([review['stars'] for review in mx_prolific])
print(mp_stars)

>>> Counter({3: 1801, 4: 1036, 2: 390, 1: 53, 5: 47})

Note that mx_prolific’s ratings diverge strongly from the overall distribution we saw before. While overall 5-star reviews are the most common, Mx Prolific has awarded only 47 5-star reviews (Perhaps these are establishments that are worth checking out!), and nearly 2000 3 star reviews.

Now we can create a second-level Counter to count the frequencies of the number of reviews left by each individual user. We summarized our original data by combining all the reviews left by the same user into a single record. Now we want to summarize further and combine, for example, all the users who have left exactly 12 reviews into a single record.

num_reviews_left = Counter([x[1] for x in users.most_common()])

This allows us to visualise how many reviews are left by most users. Because nearly all users have left only very few reviews, we’ll visualise the drop off only up to 20. (Change line 3 below to plt.bar(Xs, Ys) to plot all the records and see how plotting more data can sometimes produce a less informative result).

Xs = [x[0] for x in num_reviews_left.most_common()]
Ys = [x[1] for x in num_reviews_left.most_common()]
plt.bar(Xs[:20], Ys[:20])
plt.xticks(range(1,21))
plt.xlabel("Number of Reviews")
plt.ylabel("Number of Users")
plt.show()

This produces the following output. We can see that a huge number of users leave just one review, and that the dropoff over 2, 3, and 4 reviews is pretty steep.

/Users/g/Desktop/Screen Shot 2017-02-19 at 15.50.21.png

We can do the same to see how many reviews are typically received by a single business.

businesses = Counter([review['business_id'] for review in reviews])
num_reviews_by_business = Counter([x[1] for x in businesses.most_common()])
print(num_reviews_by_business.most_common(10))
print(len(num_reviews_by_business))
Xs = [x[0] for x in num_reviews_by_business.most_common()]
Ys = [x[1] for x in num_reviews_by_business.most_common()]
Xs = Xs[:18]
Ys = Ys[:18]
plt.bar(Xs, Ys)
plt.xticks(range(3,21))
plt.xlabel("Number of Reviews")
plt.ylabel("Number of Businesses")
plt.show()

This produces the following output and image. The most-reviewed business has received 21,908 reviews! The dropoff of the number of reviews a business receives is slower than for reviews left by users, but a low numbers of reviews is much more common. Note that Yelp has only included businesses with at least three reviews in this dataset.

>>> [(3, 21908), (4, 15473), (5, 11498), (6, 9012), (7, 7475), (8, 6061), (9, 5208), (10, 4308), (11, 3857), (12, 3508)]
>>> 947

/Users/g/Desktop/Screen Shot 2017-02-19 at 15.49.53.png

Lastly, we’ll determine whether good or bad reviews tend to have more words. The next post will focus on natural language processing, and we’ll be covering far more sophisticated text processing techniques, but for now we’ll simply use Python’s split() function to split each review into words, and then look at averages by number of stars.

import statistics
review_lengths_by_star = [[],[],[],[],[]]
for review in reviews:
length = len(review['text'].split())
idx = review['stars'] - 1
review_lengths_by_star[idx].append(length)
print([statistics.mean(review_lengths) for review_lengths in review_lengths_by_star])

>>> [146.6973890450556, 146.11345697950077, 135.06209687863014, 118.64652826600197, 93.93472010327426]

We can see that negative reviews tend to be a bit longer, with 1-star reviews having an average of 147 words, while 5-star reviews have a lower average of 94 words. We’ll make a final plot to visualise this:

plt.bar([1, 2, 3, 4, 5], [statistics.mean(rlength) for rlength in
review_lengths_by_star])
plt.xlabel("Stars")
plt.ylabel("Word length")
plt.show()

/Users/g/Desktop/Screen Shot 2017-02-19 at 15.57.58.png

Let’s recap what we did. We set up a powerful data processing environment, and took a cursory look at some of the Yelp data–but we’ve only just scratched the surface in terms of insights we can draw from these data. In a later post, we’ll be using the same dataset to introduce some machine learning and natural language processing concepts.

Get AWS Training for Teams

About the Author:

Six Trends Shaping Cloud Computing in 2017

January 18th, 2017

During the first quarter, we’re going to take a deeper dive into cloud computing training. We’ll be featuring the latest cloud computing news, learning programs for your teams, and tips and tricks in the cloud computing space.

To kick things off, we created the infographic below, modeled off a great article recently featured on CIO.

infographic-jan