About the Author:

ETL Management with Luigi Data Pipelines

October 15th, 2017

As a data engineer, you’re often dealing with large amounts of data coming from various sources and have to make sense of them. Extracting, Transforming, and Loading (ETL) data to get it where it needs to go is part of your job, and it can be a tough one when there’s so many moving parts.

Fortunately, Luigi is here to help. An open source project developed by Spotify, Luigi helps you build complex pipelines of batch jobs. It has been used to automate Spotify’s data processing (billions of log messages and terabytes of data), as well as in other well-known companies such as Foursquare, Stripe, Asana, and Red Hat.

Luigi handles dependency resolution, workflow management, and helps you recover from failures gracefully. It can be integrated into the services you already use, such as running a Spark job, dumping a table from a database, or running a Python snippet.

In this example, you’ll learn how to ingest log files, run some transformations and calculations, then store the results.

Let’s get started.


For this tutorial, we’re going to be using Python 3.6.3, Luigi, and this fake log generator. Although you could use your own production log files, not everyone has access to those. So we have a tool that can produce dummy data for us while we’re testing. Then you can try it out on the real thing.

To start, make sure you have Python 3, pip, and virtualenv installed. Then we’ll set up our environment and install the luigi package:

$ mkdir luigidemo
$ cd luigidemo
$ virtualenv luigi
$ source luigi/bin/activate
$ pip install luigi


This sets up an isolated Python environment and installs the necessary dependencies. Next, we’ll need to obtain some test data to use in our data pipeline.

Generate test data

Since this example deals with log files, we can use this Fake Apache Log Generator to generate dummy data easily. If you have your own files you would like to use, that is fine, but we will be using mocked data for the purposes of this tutorial.

To use the log generator, clone the repository into your project directory. Then navigate into the Fake-Apache-Log-Generator folder and use pip to install the dependencies:

$ pip install -r requirements.txt

After that completes, it’s pretty simple to generate a log file. You can find detailed options in the documentation, but let’s generate a 1000-line .log file:

$ python apache-fake-log-gen.py -n 1000 -o LOG

This creates a file called access_log_{timestamp}.log.

Now we can set up a workflow in Luigi to analyze this file.

Luigi Tasks

You can think of building a Luigi workflow as similar to building a Makefile. There are a series of Tasks and dependencies that chain together to create your workflow.

Let’s say we want to figure out how many unique IPs have visited our website. There are a couple of steps we need to go through to do this:

  • Read the log file
  • Pull out the IP address from each line
  • Decide how many unique IPs there are (de-duplicate them)
  • Persist that count somewhere

Each of those can be packaged as a Luigi Task. Each Task has three component parts:

  • run() — This contains the logic of your Task. Whether you’re submitting a Spark job, running a Python script, or querying a database, this function contains the actual execution logic. We break up our process, as above, into small chunks so we can run them in a modular fashion.
  • output() — This defines where the results will go. You can write to a file, update HDFS, or add new records in a database. Luigi provides multiple output targets for you to use, or you can create your own.
  • requires() — This is where the magic happens. You can define dependencies of your task here. For instance, you may need to make sure a certain dataset is updated, or wait for another task to finish first. In our example, we need to de-duplicate the IPs before we can get an accurate count.

Our first Task just needs to read the file. Since it is the first task in our pipeline, we don’t need a requires() function or a run() function. All this first task does is send the file along for processing. For the purposes of this tutorial, we’ll write the results to a local file. You can find more information on how to connect to databases, HDFS, S3, and more in the documentation.

To build a Task in Luigi, it’s just a simple Python class:

import luigi

# this class extends the Luigi base class ExternalTask
# because we’re simply pulling in an external file
# it only needs an output() function
class readLogFile(luigi.ExternalTask):

def output(self):
   return luigi.LocalTarget('/path/to/file.log')


So that’s a fairly boring task. All it does is grab the file and send it along to the next Task for processing. Let’s write that next Task now, which will pull out the IPs and put them in a list, throwing out any duplicates:

class grabIPs(luigi.Task): # standard Luigi Task class

   def requires(self):
       # we need to read the log file before we can process it
       return readLogFile()

   def run(self):
       ips = []

       # use the file passed from the previous task
       with self.input().open() as f:
           for line in f:
               # a quick glance at the file tells us the first
               # element is the IP. We split on whitespace and take
               # the first element
               ip = line.split()[0]
               # if we haven’t seen this ip yet, add it
               if ip not in ips:

       # count how many unique IPs there are
       num_ips = len(ips)

       # write the results (this is where you could use hdfs/db/etc)
       with self.output().open('w') as f:

   def output(self):
       # the results are written to numips.txt
       return luigi.LocalTarget('numips.txt')


Even though this Task is a little more complicated, you can still see that it’s built on three component parts: requires, run, and output. It pulls in the data, splits it up, then adds the IPs to a list. Then it counts the number of elements in that list and writes that to a file.

If you try and run the program now, nothing will happen. That’s because we haven’t told Luigi to actually start running these tasks yet. We do that by calling luigi.run (you can also run Luigi from the command line):

if __name__ == '__main__':

   luigi.run(["--local-scheduler"], main_task_cls=grabIPs)


The run function takes two arguments: a list of options to pass to Luigi and the task you want to start on. While you’re testing, it helps to pass the –local-scheduler option; this allows the processes to run locally. When you’re ready to move things into production, you’ll use the Central Scheduler. It provides a web interface for managing your workflows and dependency graphs.

If you run your Python file at this point, the Luigi process will kick off and you’ll see the process of each task as it moves through the pipeline. At the end, you’ll see which tasks succeeded, which failed, and the status of the run overall (Luigi gives it a :) or :( face).

Check out numips.txt to see the result of this run. In my example, it returned the number 1000. That means that 1000 unique IP addresses visited the site in this dummy log file. Try it on one of your logs and see what you get!

Extending your ETL

While this is a fairly simple example, there are a lot of ways that you could easily extend this. You could:

  • Split out dates and times as well, and filter the logs to see how many visitors there were on a certain day
  • Geolocate the IPs to find out where your visitors are coming from
  • See how many errors (404s) your visitors are encountering
  • Put the data in a more robust data store
  • Build scheduling so that you can run the workflow periodically on the updated logs

Try some of these options out and let us know if you have any questions! Although this example is limited in scope, Luigi is robust enough to handle Spotify-scale datasets, so let your imagination go wild.

About the Author:

Spotlight on Sphinx: Python Docs For Everyone

October 13th, 2017

If you’ve ever looked at the Python documentation, you’ve seen Sphinx in action.

Sphinx is an open-source project that allows people to automatically generate static websites for Python documentation. Besides code-heavy documentation, it can also be used as a static site generator.

What is Sphinx?

Sphinx is a Python project that takes in reStructuredText and outputs static web pages. It is most commonly used to host documentation. With Sphinx installed, you can write comments in your code similar to how you would with JavaDoc, and it will pull in all those comments to provide a big picture of your functions and classes. This can be extremely helpful as a programming reference, and since it pulls directly from the code, you don’t have to worry about it getting out of sync.

Who’s using it?

Sphinx was originally created to host the official Python documentation, but it’s only grown from there. Many popular libraries use Sphinx to host their documentation, and it’s become something of an industry standard among Python developers.

In addition, the popular documentation site Read The Docs makes this process even easier by allowing developers to host and update their Sphinx docs by connecting the repository and building the docs just like you would code. This “docs-as-code” approach helps ensure maximum compatibility between the code and the documentation, and helps mitigate documentation debt.

Here are some notable companies or libraries using Sphinx to host their websites or documentation:

It doesn’t stop at just documentation. Some people have written their personal sites, courses, or even whole books using Sphinx:

How can Sphinx help me?

Now that you know about all the great things Sphinx can do, I bet you’re wondering how you can use it in your work. Sphinx is adaptable enough to work for many use cases, but it really shines at documenting code, Python code in particular.

If you’re writing Python software as part of your job and having trouble maintaining the docs (or God forbid, you don’t have any docs!), Sphinx is definitely worth a try. It’s free, open source, and there are a variety of resources and tutorials out there to help you customize it to your needs.

Sphinx is great when you have structured information. In this way, Sphinx might not be such a great choice if you’re trying to host your latest novel, but it is a good idea for technical manuals with a complex table of contents that people will need to navigate. Another great feature of Sphinx is that it comes with search built-in, so you don’t have to worry about pulling in another package to do the heavy lifting for you.

Set up your first Sphinx site

Ready to get started? Let’s go through the basics of installing and setting up your first Sphinx site.

You can install Sphinx from PyPI (Python Package Index) by running this command:

$ pip install Sphinx

Once Sphinx is installed, run sphinx-quickstart from your project folder to initialize a project there:

$ sphinx-quickstart
    1. Sphinx-quickstart will go through and ask you a bunch of questions. This sets up some initial configuration values, but you can always go back and change them later. For most projects the defaults will suffice. Be sure to enter your project’s name and the project author when prompted.
    2. If you’re wanting to generate docs from your Python code, be sure to enable the autodoc extension (disabled by default).
    3. When sphinx-quickstart is finished running, you should have several new files and folders used to configure and manage your site. If you need to change any of the configuration values in the future, you can do that in conf.py.
    4. The Makefile is what allows us to build the documentation and package it into HTML for the web. To build the example skeleton project, run:
$ make html

The output files should be in _build/html. Navigate there now:

$ cd _build/html

The home page for our site is index.html. Open that file in a web browser to see the example project:

$ open index.html

You should see the basic layout of your new Sphinx site.

sphinx site

Congratulations! You have Sphinx up and running.

For next steps on how to add posts and customize Sphinx, I recommend Brandon’s Sphinx Tutorial (PDF). It’s both informative and easy to follow.

Now that you know about Sphinx, go out there and Write The Docs!

About the Author:

Who’s That Star? Recognize Celebrities With Computer Vision

September 21st, 2017

We’re constantly being bombarded with images of celebrities and athletes on TV and in social media. If you’re anything like me, you probably have a hard time keeping all the names straight.

Never fear, machine learning is here to save the day once again! Clarifai provides image recognition APIs that allow you to categorize, moderate, and predict the subjects in photos and even videos with computer vision and artificial intelligence. This tutorial will walk you through using the Clarifai API to tag images of celebrities so you can keep up with all the latest office gossip — no memorizing names required.


For this tutorial, we’re going to be using the Clarifai API. If you haven’t heard of Clarifai before, go over and check it out. They provide several machine learning models for you to make predictions against, including celebrities, demographics, and food. To use the API, you’ll need to sign up for a free account.

Once you’ve signed up, you’ll need to create an application and generate an API key. You can do so on the sidebar menu once you’ve logged in.

First, create an application:

I selected ‘whosthatstar’ as my app name, but you can use whatever is easy to remember. I left the Base Workflow as ‘General’ so we’ll have access to a wide variety of machine learning models.

Select ‘Create App’ and it initializes the application and generates an API key for you.

Take note of your key in the ‘API Keys’ section, we’ll need it.

Install Clarifai

For the purposes of this tutorial, we’re going to use the Python package for the Clarifai API. They have bindings for JS, Python, Java, Objective-C, and cURL, so you should be able to adapt it to your preferred language.

Ensure you have Python 3.x installed, then proceed to install the package using virtualenv and pip:

$ mkdir clarifai
$ cd clarifai
$ virtualenv whosthatstar
$ source whosthatstar/bin/activate
$ pip install clarifai

These commands will give you a folder to store your work, an isolated environment to install packages, and install the Clarifai libraries.

Now we’re ready to get started coding.

Setup and Auth

Since we’re using Python for this tutorial, the first thing we need to do is create a Python file for our code. I created whosthatstar.py to go along with our application’s name.

We’ll need to import the Clarifai libraries and let it know about our API key:

from clarifai import rest
from clarifai.rest import ClarifaiApp
from clarifai.rest import Image as ClImage

# create and authenticate your Clarifai app
App = ClarifaiApp(api_key=’YOUR_API_KEY’)

Remember, you can access your API key from the developer dashboard on Clarifai’s website.

Apply the Model and Predictions

Next, we need to tell Clarifai what model to use. You can see a full list of their Models on their site, and you can even build and train your own. For the purposes of this tutorial, I am going to use the Celebrity model. Hopefully the variety of different models on Clarifai’s website gives you some ideas about models that you could build with your own data.

model = app.models.get(‘celeb-v1.3’)

Now we need some images to test with. You can find your own images on social media, IMDB, or Google, just make sure they have some relatively well-known celebrities included. Clarifai’s Celebrity Model includes over 10,000 people, so you should be able to find matches for most stars you throw at it.

For my example, I’ll this picture of cast members from Game of Thrones, but you can use any photo you like. You can use video as well, but that is beyond the scope of this computer vision tutorial.

image = ClImage(url='https://i.imgur.com/lgh9E9i.jpg')
# tell Clarifai to predict who’s in the image
result = model.predict([image])

# for easier reading
import pprint

Make sure you point to a publicly accessible URL for your image. I uploaded the sample image to Imgur but you can host it or link it from wherever is convenient.

The model.predict line is what actually runs the code on the backend and returns results to us. We can manipulate these results for whatever the next step in the application is, but right here we’re just going to print out the full response so you can see what the API returns.

result is a JSON object and to make it easier to read at the command line, we import the pprint (pretty print) library and use that to print the result JSON blob.

You can read more about the full output of the API methods in the documentation, but here’s a snippet of what’s returned when we run it against that Game of Thrones image:

I’ve truncated a lot of the less-probability matches, but for each person in the photo you can see the top match (which turns out to be correct for all four actors) and the confidence the system has in their guess. As you can see, it’s pretty sure about these particular stars!

Also in the Clarifai Celebrity model are bounding boxes so you can mark off where certain faces occur. An interesting project would be to replace all detected faces in an image with a funny sticker or overlay. With the Clarifai API, this is definitely doable.

Here’s the reference page for the Celebrity model again if you want to learn more about what you can do with the API. Once you’re ready to level up your skills, you can use the Explorer on Clarifai’s site and interactively test and create new models.

Have fun, play around with the JSON responses, and let us know what you create!

About the Author:

Plotting Climate Data with Matplotlib and Python

August 17th, 2017

Is it getting hot in here, or is it just me? You’ve no doubt seen the barrage of coverage discussing climate change over the last century. But how do you separate the hype from the facts?

Let’s go straight to the source.

Today we’re going to use a dataset sourced directly from NOAA (National Oceanic and Atmospheric Administration) and plot that data in Python using Matplotlib.

NOAA has a wide variety of datasets tracking all kinds of things, some of them reaching back hundreds of years. For this tutorial, we’re going to use a dataset tracking global land and temperature anomalies each June. The dataset reaches all the way back to 1880, so that gives us a lot to work with.

Let’s see what the data has to say.

Access the Dataset

The first thing you need to do is access the proper dataset from NOAA. They have a whole data gallery you can browse, but for this example we’ll be using the Climate at a Glance dataset. It comes in CSV format and shows a date, the mean temperature, and the variation of that mean from the average temperature between 1901-2000. That way we can see how much higher or lower the mean temperature is from the “average” temperature across the last century.

Set up Dev Environment

Once you’ve downloaded the dataset, we need to get our development environment set up.

You’ll need to have Python 3.6 installed on your machine for this tutorial. We’ll begin by setting up a virtual environment to manage the dependencies. This uses the Python package virtualenv. If you don’t have it installed, you can access it by entering pip install virtualenv at your command line.

$ mkdir climate_data

$ cd climate_data

$ virtualenv -p /usr/local/bin/python3 climate

$ source climate/bin/activate

This creates and activates a Python environment within the climate_data folder, so you can install your dependencies and not deal with conflicts from other Python versions or libraries. Your shell prompt should look something like this now:

(climate) Als-MacBook-Pro:climate_data alnelson$

The next thing we need to do is install matplotlib, which will help us plot the data on a graph.

$ pip install matplotlib

Once that’s done, we’re ready to move on to the coding part of this tutorial.

Import the Data

Create a Python file called climate.py and open it in your favorite text editor. Then import the necessary libraries:

import matplotlib as mpl

import numpy as np
import matplotlib.pyplot as plt

Note: if you’re on Mac OSX, then you may see an error when you try to import pyplot. This is a known issue with matplotlib and virtualenv. Luckily, you can use this workaround. Enter these lines right after the numpy import if you’re getting errors:

import matplotlib.pylot as plt

The next thing we need to do is load in the CSV data file. We do that using NumPy’s genfromtxt function, like so:

data = np.genfromtxt('global_data.csv', delimiter=',', dtype=None, skip_header=5, names=('date', 'value', 'anomaly'))

We’ll break this down a little at a time. First, you need to enter the name and path of your data file. In my case, I have the dataset in the same directory as my Python file. Make sure you’re pointing to the correct file location. Next, I specify the delimiter which is ‘,’ since it’s a CSV. dtype=None tells the interpreter to automatically assign data types based on the data that appears in the columns. skip_header tells it to skip the first 5 rows, because if you look at the dataset in a text editor, you’ll see that the first 5 rows are description. Finally, we tell NumPy what each column is called and save it to a variable called data.

Graph the Data

Now that we’ve got our data loaded in, we need to set up matplotlib to receive it.

plt.title(“Global Land and Ocean Temperature Anomalies, June”)
plt.ylabel(‘degrees F +/- from average’)
plt.bar(data[‘date’], data[‘value’], color=”blue”)

You should now have a plot that looks like this:

Screen Shot 2017-08-16 at 4.29.06 PM.png

What does this mean for our Earth? Well, you have the data. I’ll let you be the judge.

Next Steps

If you wanted to do some other interesting experiments, you could look up the weather on the day you were born, or on major election days. Perhaps you could combine the weather with polling data to see if there was any correlation. NOAA has lots of datasets for you to play with, so go and check it out.

About the Author:

Cleaning Dirty Data with Pandas & Python

August 10th, 2017

Pandas is a popular Python library used for data science and analysis. Used in conjunction with other data science toolsets like SciPy, NumPy, and Matplotlib, a modeler can create end-to-end analytic workflows to solve business problems.

While you can do a lot of really powerful things with Python and data analysis, your analysis is only ever as good as your dataset. And many datasets have missing, malformed, or erroneous data. It’s often unavoidable–anything from incomplete reporting to technical glitches can cause “dirty” data.

Thankfully, Pandas provides a robust library of functions to help you clean up, sort through, and make sense of your datasets, no matter what state they’re in. For our example, we’re going to use a dataset of 5,000 movies scraped from IMDB. It contains information on the actors, directors, budget, and gross, as well as the IMDB rating and release year. In practice, you’ll be using much larger datasets consisting of potentially millions of rows, but this is a good sample dataset to start with.

Unfortunately, some of the fields in this dataset aren’t filled in and some of them have default values such as 0 or NaN (Not a Number).

No good. Let’s go through some Pandas hacks you can use to clean up your dirty data.

Getting started

To get started with Pandas, first you will need to have it installed. You can do so by running:

$ pip install pandas

Then we need to load the data we downloaded into Pandas. You can do this with a few Python commands:

import pandas as pd

data = pd.read_csv(‘movie_metadata.csv’)

Make sure you have your movie dataset in the same folder as you’re running the Python script. If you have it stored elsewhere, you’ll need to change the read_csv parameter to point to the file’s location.

Look at your data

To check out the basic structure of the data we just read in, you can use the head() command to print out the first five rows. That should give you a general idea of the structure of the dataset.


When we look at the dataset either in Pandas or in a more traditional program like Excel, we can start to note down the problems, and then we’ll come up with solutions to fix those problems.

Pandas has some selection methods which you can use to slice and dice the dataset based on your queries. Let’s go through some quick examples before moving on:

  • Look at the some basic stats for the ‘imdb_score’ column: data.imdb_score.describe()
  • Select a column: data[‘movie_title’]
  • Select the first 10 rows of a column: data[‘duration’][:10]
  • Select multiple columns: data[[‘budget’,’gross’]]
  • Select all movies over two hours long: data[data[‘duration’] > 120]
Deal with missing data

One of the most common problems is missing data. This could be because it was never filled out properly, the data wasn’t available, or there was a computing error. Whatever the reason, if we leave the blank values in there, it will cause errors in analysis later on. There are a couple of ways to deal with missing data:

  • Add in a default value for the missing data
  • Get rid of (delete) the rows that have missing data
  • Get rid of (delete) the columns that have a high incidence of missing data

We’ll go through each of those in turn.

Add default values

First of all, we should probably get rid of all those nasty NaN values. But what to put in its place? Well, this is where you’re going to have to eyeball the data a little bit. For our example, let’s look at the ‘country’ column. It’s straightforward enough, but some of the movies don’t have a country provided so the data shows up as NaN. In this case, we probably don’t want to assume the country, so we can replace it with an empty string or some other default value.

data.country = data.country.fillna(‘’)

This replaces the NaN entries in the ‘country’ column with the empty string, but we could just as easily tell it to replace with a default name such as “None Given”. You can find more information on fillna() in the Pandas documentation.

With numerical data like the duration of the movie, a calculation like taking the mean duration can help us even the dataset out. It’s not a great measure, but it’s an estimate of what the duration could be based on the other data. That way we don’t have crazy numbers like 0 or NaN throwing off our analysis.

data.duration = data.duration.fillna(data.duration.mean())

Remove incomplete rows

Let’s say we want to get rid of any rows that have a missing value. It’s a pretty aggressive technique, but there may be a use case where that’s exactly what you want to do.

Dropping all rows with any NA values is easy:


Of course, we can also drop rows that have all NA values:


We can also put a limitation on how many non-null values need to be in a row in order to keep it (in this example, the data needs to have at least 5 non-null values):


Let’s say for instance that we don’t want to include any movie that doesn’t have information on when the movie came out:


The subset parameter allows you to choose which columns you want to look at. You can also pass it a list of column names here.

Deal with error-prone columns

We can apply the same kind of criteria to our columns. We just need to use the parameter axis=1 in our code. That means to operate on columns, not rows. (We could have used axis=0 in our row examples, but it is 0 by default if you don’t enter anything.)

Drop the columns with that are all NA values:

data.dropna(axis=1, how=’all’)

Drop all columns with any NA values:

data.dropna(axis=1, how=’any’)

The same threshold and subset parameters from above apply as well. For more information and examples, visit the Pandas documentation.

Normalize data types

Sometimes, especially when you’re reading in a CSV with a bunch of numbers, some of the numbers will read in as strings instead of numeric values, or vice versa. Here’s a way you can fix that and normalize your data types:

data = pd.read_csv(‘movie_metadata.csv’, dtype={‘duration’: int})

This tells Pandas that the column ‘duration’ needs to be an integer value. Similarly, if we want the release year to be a string and not a number, we can do the same kind of thing:

data = pd.read_csv(‘movie_metadata.csv’, dtype={title_year: str})

Keep in mind that this data reads the CSV from disk again, so make sure you either normalize your data types first or dump your intermediary results to a file before doing so.

Change casing

Columns with user-provided data are ripe for corruption. People make typos, leave their caps lock on (or off), and add extra spaces where they shouldn’t.

To change all our movie titles to uppercase:


Similarly, to get rid of trailing whitespace:


We won’t be able to cover correcting spelling mistakes in this tutorial, but you can read up on fuzzy matching for more information.

Rename columns

Finally, if your data was generated by a computer program, it probably has some computer-generated column names, too. Those can be hard to read and understand while working, so if you want to rename a column to something more user-friendly, you can do it like this:

data.rename(columns = {‘title_year’:’release_date’, ‘movie_facebook_likes’:’facebook_likes’})

Here we’ve renamed ‘title_year’ to ‘release_date’ and ‘movie_facebook_likes’ to simply ‘facebook_likes’. Since this is not an in-place operation, you’ll need to save the DataFrame by assigning it to a variable.

data = data.rename(columns = {‘title_year’:’release_date’, ‘movie_facebook_likes’:’facebook_likes’})

Save your results

When you’re done cleaning your data, you may want to export it back into CSV format for further processing in another program. This is easy to do in Pandas:

data.to_csv(‘cleanfile.csv’ encoding=’utf-8’)

More resources

Of course, this is only the tip of the iceberg. With variations in user environments, languages, and user input, there are many ways that a potential dataset may be dirty or corrupted. At this point you should have learned some of the most common ways to clean your dataset with Pandas and Python.

For more resources on Pandas and data cleaning, see these additional resources:

About the Author:

Sending Secret Messages with Numpy

August 3rd, 2017

Numpy is a prominent Python library used for numeric computation. It is very popular with data scientists and analysts who need to do some heavy number crunching. One of the defining features of Numpy is the use of ndarrays, which perform faster than standard Python lists and can be used as mathematical matrices for computation.

Now all of that might sound a little complicated, but there are some fun things you can do with Numpy, too. We’re going to learn how you can use Numpy to store and encode messages to send to a friend, which they can then decrypt and read. No scary math involved.

Let’s get started.

Getting started

To follow along with this tutorial, you will need a couple of things:

If you don’t have Python 3 installed, you can run brew install python3 at the command line to install using Homebrew. For more information, see this guide for OSX.

Let’s set up an environment to work in:

$ mkdir numpy_project

$ cd numpy_project

$ virtualenv -p /usr/local/bin/python3 secretmsg

$ source secretmsg/bin/activate

This sets up a project folder, initializes Python 3, and activates the virtual environment. To install Numpy, we run:

$ pip install numpy

Once that finishes, we’re ready to go. Let’s make some secret messages.

Encode a secret message

Now that we’ve got everything set up, the real fun begins. How are we going to use Numpy’s arrays to encode a secret message? Well, first we need a message to encode.

Grab some data

You can make up your own message here, or use this sample message:

Keep it secret. Keep it safe.

To start out with, don’t make an awfully long message. A few words or a phrase will do. Once you learn how to do it, you can scale this example to longer messages.

Load it into a numpy array

This is where we start using Numpy. We need to get our message into a Numpy array, right? The first thing we need to do is translate this text into Numpy’s data format. We can do that by splitting up the message by each letter. This way, each “slot” in the array will have one character.

Let’s do that now (this assumes you’ve created a file called message.py in your project directory):

import numpy as np

message = “Keep it secret. Keep it safe.”
orig_array = np.array(list(message))

What we’ve done here is twofold. First, we use the list function to split up the string character-by-character, then we pass it to np.array. The output looks something like this:

[‘K’ ‘e’ ‘e’ ‘p’ ‘ ‘ ‘i’ ‘t’ ‘ ‘ ‘s’ ‘e’ ‘c’ ‘r’ ‘e’ ‘t’ ‘.’ ‘ ‘ ‘K’ ‘e’ ‘e’ ‘p’ ‘ ‘ ‘i’ ‘t’ ‘ ‘ ‘s’ ‘a’ ‘f’ ‘e’ ‘.’]

Encode the message

The text is now in a Numpy array. Great! But…whoever we send this to can still read the message pretty easily. Let’s change that.

The ROT-13 Cipher

ROT-13 is one of the most common cryptographic ciphers and can be used to “scramble” the letters in a message quickly and easily. It works by “rotating” each letter in the message by thirteen places. This means that the letter ‘a’ would turn into ‘n’, ‘b’ into ‘o’, and so on. This is a very simple cipher for the sake of example, so don’t go sending any really sensitive secrets using this method.

To encode our message using ROT-13, there’s a function in the codecs package to help us do that. You’ll just need to add the import statement at the top of your file (the library is included with Python, so you don’t need to install anything extra).

import codecs

encoded = codecs.encode(message, ‘rot_13’)

Encoded now holds the message in an “encrypted” form. Our example message now looks like this:

Xrrc vg frperg. Xrrc vg fnsr.

Sure, you could probably figure it out with a pen and paper fairly easily, but it definitely isn’t the same message we started with. Should be enough to throw any casually wondering eyes off the trail.

We can get it back into a Numpy array using the same code we used before:

encoded_array = np.array(list(encoded))

Next up — let’s export and send the message.

Send it to our friend

In order to send this message to a friend, we need to export it from our Python program. Numpy has a special function for exporting arrays called numpy.save. This serializes the data and stores it in a .npy file format. That way, our friend can load it up in Python and start using the array just as we had.

The .npy format adds an extra layer of security during transit because it’s a binary Numpy file and not a plain text file.

To export our encoded array, use the np.save function:

with open('secret.npy', 'wb') as outfile:

np.save(outfile, encoded_array)

Decode the message

Now you can send that .npy file to the friend of your choice.

To decode the message, we do a similar process. Let’s start a new Python file called decoder.py:

import numpy as np
import codecs
encoded_array = np.load(‘secret.npy’)

Just like we used np.save to dump the array to .npy format, we can use np.load to bring the data back. If you print the contents of encoded_array, it looks like this:

[‘X’, ‘r’, ‘r’, ‘c’, ‘ ‘, ‘v’, ‘g’, ‘ ‘, ‘f’, ‘r’, ‘p’, ‘e’, ‘r’, ‘g’, ‘.’, ‘ ‘, ‘X’, ‘r’, ‘r’, ‘c’, ‘ ‘, ‘v’, ‘g’, ‘ ‘, ‘f’, ‘n’, ‘s’, ‘r’, ‘.’]

There, it’s back, just as we left it.

But our job isn’t done yet — we have to put the message back together and figure out what it says!

To join the message back together as a string, we can use the join function:

encoded_string = “”.join(encoded_array)

What this does is join all the elements in encoded_array together into one string. The empty string (“”) means that we don’t want any characters in between each element of the array.

Now we’re back to:

Xrrc vg frperg. Xrrc vg fnsr.

There’s only one more step: decode the message using the codecs library.

decoded = codecs.encode(encoded_string, ‘rot-13’)

Now, all our friend needs to do is print the decoded string:

Keep it secret. Keep it safe.

And that’s that.

Going further

At this point, you should have learned how to use Numpy arrays to store data, export it, and load it from a file. You’ve learned a basic cryptographic technique for obscuring simple messages, and now you can use these programs to exchange notes with your friends.

If you wanted to expand on this project, here are some ideas and resources:

About the Author:

Building a Serverless Chatbot w/ AWS, Zappa, Telegram, and api.ai

August 2nd, 2017

If you’ve ever had to set up and maintain a web server before, you know the hassle of keeping it up-to-date, installing security patches, renewing SSL certificates, dealing with downtime, rebooting when things go wrong, rotating logs and all of the other ‘ops’ that come along with managing your own infrastructure. Even if you haven’t had to manage a web server before, you probably want to avoid all of these things.

For those who want to focus on building and running code, serverless computing provides fully-managed infrastructure that takes care of all of the nitty-gritty operations automatically.

In this tutorial, we’ll show you how to build a chatbot which performs currency conversions. We’ll make the chatbot available to the world via AWS Lambda, meaning you can write the code, hit deploy, and never worry about maintenance again. Our bot’s brain will be powered by api.ai, a natural language understanding platform owned by Google.


In this post we’ll walk you through building a Telegram Bot. We’ll write the bot in Python, wrap it with Flask and use Zappa to host it on AWS Lambda. We’ll add works-out-the-box AI to our bot by using api.ai.

By the end of this post, you’ll have a fully-functioning Chatbot that will respond to Natural Language queries. You’ll be able to invite anyone in the world to chat with your bot and easily edit your bot’s “brain” to suit your needs.

Before We Begin

To follow along with this tutorial, you’ll have to have a valid phone number and credit card (we’ll be staying within the free usage limits of all services we use, so you won’t be charged). Specifically, you’ll need:

  • …to sign up with Amazon Web Services. The signup process can be a bit long, and requires a valid credit card. AWS offers a million free Lambda requests per month, and our usage will stay within this free limit.
  • …to sign up with api.ai. Another lengthy sign-up process, as it requires integration with the Google Cloud Platform. You’ll be guided through this process when you sign up with api.ai. Usage is currently free.
  • …to sign up with Telegram, a chat platform similar to the more popular WhatsApp. You’ll need to download one of their apps (for Android, iPhone, Windows Phone, Windows, MacOS, or Linux) in order to register, but once you have an account you can also use it from web.telegram.org. You’ll also need a valid phone number. Telegram is completely free.
  • …basic knowledge of Python and a working Python environment (that is, you should be able to run Python code and install new Python packages). Preferably, you should have used Python virtual environments before, but you should be able to keep up even if you haven’t. All our code examples use Python 3, but most things should be Python 2 compatible.

If you’re aiming to learn how to use the various services covered in this tutorial, we suggest you follow along step by step, creating each component as it’s needed. If you’re impatient and want to get a functioning chatbot set up as fast as possible, you can clone the GitHub repository with all the code presented here and use that as a starting point.

Building an Echo Bot

When learning a new programming language, the first program you write is one which outputs the string “Hello, World!” When learning to build chatbots, the first bot you build is one that repeats everything you say to it.

Achieving this proves that your bot is able to accept and respond to user input. After that, it’s simple enough to add the logic to make your bot do something more interesting.

Getting a Token for Our New Bot

The first thing you need is a bot token from Telegram. You can get this by talking to the @BotFather bot through the Telegram platform.

In your Telegram app, open a chat with the official @BotFather Chatbot, and send the command /newbot. Answer the questions about what you’ll use for your new bot’s name and username, and you’ll be given a unique token similar to 14438024:AAGI6Kh8ew4wUf9-vbqtb3S4sIM7nDlcXj3. We’ll use this token to prove ownership of our new bot, which allows us to send and receive messages through the Bot.

We can now control our new bot via Telegram’s HTTP API. We’ll be using Python to make calls to this API.

Writing the First Code for Our New Bot

Create a new directory called currencybot to house the code we need for our bot’s logic, and create three Python files in this directory named config.py, currencybot.py, and bot_server.py The structure of your project should be as follows:


in config.py we need a single line of code defining the bot token, as follows (substitute with the token you received from BotFather).

bot_token = "14438024:AAGI6Kh8ew4wUf9-vbqtb3S4sIM7nDlcXj3"

In currencybot.py we need to put the logic for our bot, which revolves around receiving a message, handling the message, and sending a message. That is, our bot receives a message from some user, works out how to respond to this message, and then sends the response. For now, because we are building an echo bot, the handling logic will simply return any input passed to it back again.

Add the following code to currencybot.py:

import requests
import config

# The main URL for the Telegram API with our bot's token
BASE_URL = "https://api.telegram.org/bot{}".format(config.bot_token)

def receive_message(message):
    """Receive a raw message from Telegram"""
        message = str(msg["message"]["text"])
        chat_id = msg["message"]["chat"]["id"]
        return message, chat_id
    except Exception as e:
        return (None, None)
def handle_message(message):
    """Calculate a response to the message"""
    return message
def send_message(message, chat_id):
    """Send a message to the Telegram chat defined by chat_id"""
    data = {"text": message.encode("utf8"), "chat_id": chat_id}
    url = BASE_URL + "/sendMessage"
        response = requests.post(url, data).content
    except Exception as e:
def run(message):
    """Receive a message, handle it, and send a response"""
        message, chat_id = receive_message(message)
        response = handle_message(message)
        send_message(response, chat_id)
    except Exception as e:

Finally, bot_server.py is a thin wrapper for our bot that will allow it to receive messages via HTTP. Here we’ll run a basic Flask application. When our bot receives new messages, Telegram will send these via HTTP to our Flask app, which will pass them on to the code we wrote above. In bot_server.py, add the following code:

from flask import Flask
from flask import request
from currencybot import run

app = Flask(__name__)

@app.route("/", methods=["GET", "POST"])
def receive():
        return ""
    except Exception as e:
        return ""

This is a minimal Flask app that imports the main run() function from our currencybot script. It uses Flask’s request module (distinct from the requests library we used earlier, though the names are similar enough to be confusing) to grab the POST data from an HTTP request and convert this to JSON. We pass the JSON along to our bot, which can extract the text of the message and respond to it.

Deploying Our Echo Bot

We’re now ready to deploy our bot onto AWS Lambda so that it can receive messages from the outside world.

We’ll be using the Python library Zappa to deploy our bot, and Zappa will interact directly with our Amazon Web Services account. In order to do this, you’ll need to set up command line access for your AWS account as described here: https://aws.amazon.com/blogs/security/a-new-and-standardized-way-to-manage-credentials-in-the-aws-sdks/.

To use Zappa, it needs to be installed inside a Python virtual environment. Depending on your operating system and Python environment, there are different ways of creating and activating a virtual environment. You can read more about how to set one up here. If you’re using MacOS or Linux and have used Python before, you should be able to create one by running the following command.

virtualenv ~/currencybotenv

You should see output similar to the following:

~/git/currencybot g$ virtualenv ~/currencybotenv

Using base prefix '/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6'

New python executable in /Users/g/currencybotenv/bin/python3.6

Also creating executable in /Users/g/currencybotenv/bin/python

Installing setuptools, pip, wheel...done.

The result is that a clean Python environment has been created, which is important so Zappa will know exactly what dependencies to install on AWS Lambda. We’ll install the few dependencies we need for our bot (including Zappa) inside this environment.

Activate the environment by running:

source ~/currencybotenv/bin/activate

You should see your Terminal’s prompt change to indicate that you’re now working inside that environment. Mine looks like this:

(currencybotenv) ~/git/currencybot g$

Now we need to install the dependencies for our bot using pip. Run:

pip install zappa requests flask

At this point, we need to initialize our Zappa project. We can do this by running:

zappa init

This will begin an interactive process of setting up options with Zappa. You can accept all of the defaults by pressing Enter at each prompt. Zappa should figure out that your Flask application is inside bot_server.py and prompt to use bot_server.app as your app’s function.

You’ve now initialized the project and Zappa has created a zappa_settings.json file in your project directory. Next, deploy your bot to Lambda by running the following command (assuming you kept the default environment name of ‘dev’):

zappa deploy dev

This will package up your bot and all of its dependencies, and put them in an AWS S3 bucket, from which it can be run via AWS Lambda. If everything went well, Zappa will print out the URL where your bot is hosted. It should look something like https://l19rl52bvj.execute-api.eu-west-1.amazonaws.com/dev. Copy this URL because you’ll need to instruct Telegram to post any messages sent to our bot to this endpoint.

In your web browser, change the setting of your Telegram bot by using the Telegram API and your bot’s token. To set the URL to which Telegram should send messages to your bot, build a URL that looks like the following, but with your bot’s token and your AWS Lambda URL instead of the placeholders.


For example, your URL should look something like this:


Note that the string bot must appear directly before the token.

Testing Our Echo Bot

Visit your bot in the Telegram client by navigating to t.me/<your-bot’s-username>. You can find a link to your bot in the last message sent by BotFather when you created the bot. Open up a Chat with your bot in the Telegram client and press the /start button.

Now you can send your bot messages and you should receive the same message as a reply.


If you don’t, it’s likely that there’s a bug in your code. You can run zappa tail dev in your Terminal to view the output of your bot’s code, including any error messages.

Teaching Our Bot About Currencies

You’ll probably get bored of chatting to your echo bot pretty quickly. To make it more useful, we’ll teach it how to send us currency conversions.

Add the following two functions to the currencybot.py file. These functions allow us to use the Fixer API to get today’s exchange rates and do some basic calculations.

def get_rate(frm, to):
    """Get the raw conversion rate between two currencies"""
    url = "http://api.fixer.io/latest?base={}&symbols={}".format(frm, to)
        response = requests.get(url)
        js = response.json()
        rates = js['rates']
        return rates.popitem()[1]
    except Exception as e:
        return 0

def get_conversion(quantity=1, frm="USD", to="GBP"):
    rate = get_rate(frm.upper(), to.upper())
    to_amount = quantity * rate
    return "{} {} = {} {}".format(quantity, frm, to_amount, to)

We’ll now expect the user to send currency conversion queries for our bot to compute. For example, if a user sends “5 USD GBP” we should respond with a calculation of how many British Pounds are equivalent to 5 US Dollars. We need to change our handle_message() function to split the message into appropriate parts and pass them to our get_conversion() function. Update handle_message() in currencybot.py to look like this:

def handle_message(message):
    """Calculate a response to a message"""
        qty, frm, to = message.split(" ")[:3]
        qty = int(qty)
        response = get_conversion(qty, frm, to)
    except Exception as e:
        response = "I couldn't parse that"
    return response

This function now parses messages that match the required format into the three parts. If the message doesn’t match what we were expecting, we inform the user that we couldn’t deal with their input.

Save the code and update the bot by running the following command (make sure you are still within your Python virtual environment, in your project directory).

zappa update dev

Testing Our Currency Converter Bot

After the update has completed, you’ll be able to chat with your bot and get currency conversions. You can see an example of the bot converting US Dollars to South African Rands and US Dollars to British Pounds below:


Adding AI to Our Bot

Our bot is more useful now, but it’s not exactly smart. Users have to remember the correct input format and any slight deviations will result in the “I couldn’t parse that” error. We want our bot to be able to respond to natural language queries, such as “How much is 5 dollars in pounds?” or “Convert 3 USD to pounds”. There are an infinite number of ways that users might ask these questions, and extracting the three pieces of information (the quantity, from-currency, and to-currency) is a non-trivial task.

This is where Artificial Intelligence and Machine Learning can help us out. Instead of writing rules to account for each variation of the same question, machine learning lets us learn patterns from existing examples. Using machine learning, we can teach a program to extract the pieces of information that we want by ‘teaching’ it with a number of existing examples. Luckily, someone else has already done this for us, so we don’t need to start from scratch.

Create an account with api.ai, and go through their setup process. Once you get to the main screen, select the “Prebuilt Agents” tab, as shown below


Select the “Currency Converter” agent from the list of options, and choose a Google Cloud Project (or create a new one) to host this agent. Now you can test your agent by typing in a query in the top right-hand corner of the page, as indicated below:


Hit the “Copy Curl” link, which will copy a URL with the parameters you need to programmatically make the same request you just made manually through the web page. It should have copied a string that looks similar to the following into your clipboard.

curl 'https://api.api.ai/api/query?v=20150910&query=convert%201%20usd%20to%20zar&lang=en&sessionId=fed2f39e-6c38-4d42-aa97-0a2076de5c6b&timezone=2017-07-15T18:12:03+0200' -H 'Authorization:Bearer a5f2cc620de338048334f68aaa1219ff'

The important part is the Authorization argument, which we’ll need to make the same request from our Python code. Copy the whole token, including Bearer into your config.py file, which should now look similar to the following:

bot_token = "14438024:AAGI6Kh8ew4wUf9-vbqtb3S4sIM7nDlcXj3"

apiai_bearer = "Bearer a5f2cc620de338048334f68aaa1219ff"

Add the following line to the top of your currencybot.py file:

from datetime import datetime

And add a parse_conversion_query() function below in the same file, as follows:

def parse_conversion_query(query):
    url_template = "https://api.api.ai/api/query?v=20150910&query={}&lang=en&sessionId={}"
    url = url_template.format(query, datetime.now())
    headers = {"Authorization":  config.apiai_bearer}
    response = requests.get(url, headers=headers)
    js = response.json()
    currency_to = js['result']['parameters']['currency-to']
    currency_from = js['result']['parameters']['currency-from']
    amount = js['result']['parameters']['amount']
    return amount, currency_from, currency_to

This reconstructs the cURL command that we copied from the api.ai site for Python. Note that the v=20150910 in the url_template is fixed and should not be updated for the current date. This selects the current version of the api.ai API. We omit the optional timezone argument but use datetime.now() as a unique sessionId.

Now we can pass a natural language query to the api.ai API (if you think that’s difficult to say, just look at the url_template which contains api.api.ai/api/!) It will work out what the user wants in terms of quantity, from-currency and to-currency, and return structured JSON for our bot to parse. Remember that api.ai doesn’t do the actual conversion–its only role is to extract the components we need from a natural language query, so we’ll pass these pieces to the fixer.io API as before. Update the handle_message() function to use our new NLU parser. It should look as follows:

def handle_message(message):
    """Calculate a response to a message"""
        qty, frm, to = parse_conversion_query(message)
        qty = int(qty)
        response = get_conversion(qty, frm, to)
    except Exception as e:
        response = "I couldn't parse that"
    return response

Make sure you’ve saved all your files, and update your deployment again with:

zappa update dev

Testing Our Bot’s AI

Now our bot should be able to convert between currencies based on Natural Language queries such as “How much is 3 usd in Indian Rupees”.


If this doesn’t work, run zappa tail dev again to look at the error log and figure out what went wrong.

Our bot is by no means perfect, and you should easily be able to find queries that break it and cause unexpected responses, but it can handle a lot more than the strict input format we started with! If you want to teach it to handle queries in specific formats, you can use the api.ai web page to improve your bot’s understanding and pattern recognition.


Serverless computing and Chatbots are both growing in popularity, and in this tutorial you learned how to use both of them.

We showed you how to set up a Telegram Chatbot, make it accessible to the world, and plug in a prebuilt brain.

You can now easily do the same using the other pre-built agents offered by api.ai, or start building your own. You can also look at the other Bot APIs offered by Facebook Messenger, Skype, and many similar platforms to make your Bots accessible to a wider audience.

About the Author:

Machine and Deep Learning in Python: What You Need to Know

July 19th, 2017

Big Data. Deep Learning. Data Science. Artificial Intelligence.

It seems like a day doesn’t go by when we’re not bombarded with these buzzwords. But what’s with all the hype? And how can you use it in your own business?

What is Machine Learning?

At its simplest level, machine learning is simply the process of optimizing mathematical equations. There are several different kinds of machine learning, all with a different purpose. Two of the most popular forms of machine learning are supervised and unsupervised learning. We’ll go through how they work below:

  • Supervised Learningsupervised learning uses labeled examples of known data to predict future outcomes. For example, if you kept track of weather conditions and whether your favorite sports team was playing that day, you could learn from those patterns over time and predict if the game would be rained out or not based on the weather forecast. The “supervised” part means that you have to supply the system with “answers” that you already know. That is, you already knew when your team did and didn’t play, and you know what the weather was on those days. The computer reads through this information iteratively and uses it to form patterns and make predictions. Other applications of supervised learning could be predicting if people will default on their loan payments.
  • Unsupervised Learningunsupervised learning refers to a type of machine learning where you don’t necessarily know what the “answer” is you’re looking for. Unlike our “will my sports game get rained out” example, unsupervised learning is more suitable for exploratory or clustering work. Clustering groups things that are similar or connected, so you could feed it a group of Twitter posts and have it tell you what people are most commonly talking about. Some algorithms that apply unsupervised learning are K-Means and LDA.
What is Deep Learning?

Deep learning, despite the hype, is simply the application of multi-layered artificial neural networks to machine learning problems. It’s called “deep” learning because the neural networks contain many levels of classification instead of one layer as a whole. For example, a deep learning algorithm that wanted to classify faces in photos would first learn to classify the shape of eyes, then noses, then mouths, and then the spatial relationship of them all together. This is instead of trying to recognize the whole face at once. It breaks it down into component parts to get a better understanding.

Deep learning has been in the news a lot lately. You may remember the trippy image generation project called DeepDream that Google released in 2015. Also noteworthy was AlphaGo’s triumph over a professional Go player, also using deep learning. Before this, a computer had never been able to beat a human at a game of Go, so this marked a new milestone in artificial intelligence.


credit: jessica mullen from austin, tx – Deep Dreamscope

Machine Learning in Python

One of the best things about Python is the fact that there are so many libraries available. Since anyone can create a Python package and submit it to PyPI (Python Package Index), there are packages out there for just about everything you can think of. Machine and Deep Learning are no exception.

In fact, Python is one of the most popular languages for data scientists due to its ease of use and wealth of scientific packages available. Many Python developers, especially in the data space, like to use Jupyter Notebooks because it allows them to iterate and refine code and models without running the entire program each time.


scikit-learn is the frontrunner and longtime favorite of data scientists. It’s been around the longest and has whole books devoted to the topic. If you want a wealth of machine learning algorithms and customizations, scikit-learn likely has what you need. However, if you’re looking for something that’s more heavily stats-focused, you may want to go with StatsModels instead.


Caffe is a fast open framework for deep learning written in Python. Developed by an AI research team at UC Berkeley, it performs well in image processing scenarios and is used by large companies such as Facebook, Microsoft, Pinterest, and more.


TensorFlow made waves in the machine learning community as Google’s open source deep learning offering. It currently stands as the most prominent deep learning framework in the space, with many developers participating. TensorFlow works well with object recognition and speech recognition tasks.


Theano is a Python library for fast numerical computation. Many developers use it on GPUs for data-intensive operations. It also has symbolic computation capabilities so you can calculate derivatives for functions with many variables. In fact, with GPU optimization, it can even outperform C. If you’re crunching some serious data, Theano could be your go-to.

Who’s using Machine Learning?

A better question would be: who’s not using machine learning in their business? And if not, why not?

The possibilities of data analytics at scale have been realized across industries, from healthcare to finance to oil and gas. Here are some notable firms betting on machine learning:

  • Google — Google uses machine learning across their company, from Google Translate to helping you categorize your photos to self-driving car research. Teams at Google also develop TensorFlow, a leading deep learning framework.
  • Facebook — Facebook makes heavy use of machine learning in the ad space. By looking at your interests, pages you visit, and things you ‘like’, Facebook gets a very good idea of who you are as a person and what kind of things you may be interested in buying. It uses this information to show you advertisements and posts in your newsfeed. Facebook also uses machine learning to recognize faces in your photos and help you tag them.
  • Netflix — Netflix uses the movies you watch, rate, and search for to create customized recommendations. One machine learning algorithm for product recommendations that both Netflix and Amazon employ is called collaborative filtering. In fact, Netflix hosts a contest called The Netflix Prize that awards people that can develop new and better recommendation systems.
Pros and Cons of Machine Learning in Python
  • Python is a general-purpose language, which means it can be used in a variety of scenarios and has a wealth of packages available for just about any purpose.
  • Python is easy to learn and read.
  • Developers can use Jupyter Notebooks to iteratively build their code and test it as they go.
  • There’s no industry-standard IDE for Python like there is for R. Still, many good options exist.
  • In most cases, Python’s performance cannot compare with C/C++.
  • The wealth of options in Python can be both a pro and a con. There are lots of choices, but it may take more digging and research to find what you need. In addition, setting up separate packages can be complicated if you’re a novice programmer.

The era of Big Data is here, and it’s not going away. You have learned a little more about the different types of machine learning, deep learning, and the major technologies that companies are using. Next time you have a data-intensive problem to solve, look no further than Python!

About the Author:

Python 2 vs. Python 3 Explained in Simple Terms

July 13th, 2017

Python is a high level, versatile, object-oriented programming language. Python is simple and easy to learn while also being powerful and highly effective. These advantages make it suitable for programmers of all backgrounds, and Python has become one of the most widely used languages across a variety of fields.

Python differs from most other programming languages in that two incompatible versions, Python 2 and Python 3, are both widely used. This article presents a brief overview of a few of the differences between Python 2 and Python 3 and is primarily aimed at a less-technical audience.

Python 2 (aka Python 2.x)

The second version of Python, Python 2.0, arrived in 2000. Upon its launch, Python introduced many new features that improved upon the previous version. Notably, it included support for Unicode and added garbage collection for better memory management. The Python Foundation also introduced changes in the way the language itself was developed; the development process became more open and included input from the community.

Python 2.7 is the latest (and final) Python 2 release. One feature included in this version is the Ordered Dictionary. The Ordered Dictionary enables the user to create dictionaries in an ordered manner, i.e., they remember the order in which their elements are inserted, and therefore it is possible to print the elements in that order. Another feature of Python 2.x is set literals. Previously, one had to create a set from another type, such as a list, resulting in slower and more cumbersome code.

While these are some prominent features that were included with Python 2.7, there are other features in this release. For instance, Input/Output modules, which are used to write to text files in Python, are faster than before. All the aforementioned features are also present in Python 3.1 and later versions.

Python 3 (aka Python 3.x)

Even though Python 2.x had matured considerably, many issues remained. The print statement was complicated to use and did not behave like Python functions, resulting in more code in comparison to other programming languages. In addition, Python strings were not Unicode by default, which meant that programmers needed to invoke functions to convert strings to Unicode (and back) when manipulating non-ASCII characters (i.e., characters which are not represented on the QWERTY keyboard).

Python 3, which was launched in 2008, was created to solve these problems and bring Python into the modern world. Nine years in, let’s consider how the adoption of Python 3 (which is currently at version 3.6) has fared against the latest Python 2.x release.

The most notable change in Python 3 is that print is now a function rather than a statement, as it was in Python 2. Since print is now a function, it is more versatile than it was in Python 2. This was perhaps the most radical change in the entire Python 3.0 release, and as a result, ruffled the most feathers. Users are now required to write print() instead of print, and programmers naturally object to having to type two additional characters and learn a new syntax. To be fair, the print() function is now able to write to external text files, something which was not possible before, and there are others advantages of it now being a function.

You might think that print becoming a function is a small change and having to type two more characters is not a big issue. But it is one of multiple changes that make Python 3 incompatible with Python 2. The problem of compatibility becomes complicated by the fact that organizations and developers may in fact have large amounts of Python 2 code that needs to be converted to Python 3.

Python 3.6 adds to these changes by allowing optional underscores in numeric literals for better readability (e.g., 1_000_000 vs. 1000000), and in addition extends Python’s functionality for multitasking. (Note that the new features which appear in each successive version of Python 3 are not “backported” to Python 2.7, and as a result, Python 3 will continue to diverge from Python 2 in terms of functionality.)

Should You Care?

It depends. If you are a professional developer who already works with Python, you should consider moving to Python 3 if you haven’t already. In order to make the transition easier, Python 3 includes a tool called 2to3 which is used to transform Python 2 code to Python 3. 2to3 will prove helpful to organizations which are already invested in Python 2.x, as it will help them convert their Python 2 code base to Python 3 as smoothly as possible.

If you are just starting out with Python, your best strategy is to embrace Python 3, although you should be aware that it is incompatible with Python 2, as you may encounter Python 2 code on websites such as stackoverflow and perhaps at your current (or future) workplace.


The overall consideration in 2017 whether to use Python 3 or Python 2 depends on the intended use. Python 2.7 will be supported till 2020 with the latest packages. According to py3readiness.org, which measures how many popular libraries are compatible with Python 3, 345 out of 360 libraries support Python 3. This number will continue to grow in the future as support for 2.7 drops. While Python 2.7 is sufficient for now, Python 3 is definitely the future of the language and is here to stay.

Takeaway: Python 2 is still widely used. Python 3 introduced several features that were not backward compatible with Python 2. It took a while for some popular libraries to support Python 3, but most major libraries now support Python 3, and support for Python 2 will eventually be phased out. Python 2 is still here in 2017 but is gradually on the way out.

About the Author:

An Overview of Python Web Development Options

July 6th, 2017

Python is a powerful language that supports many of the largest sites on the web. There are several prominent Python web development frameworks, each with their own use cases and features.

In fact, you probably visit a website powered by Python every day. Heard of Reddit? Instagram? Yelp? They all use Python.

It can be a little overwhelming to hear all the jargon being thrown around (Django? Flask? Pyramid?), so we’re going to break things down step-by-step here. Who’s using Python for web development, why, and what are the options out there as a developer?


Django is the most robust and full-featured of the pack. It has also been around the longest. Their motto is “The web framework for perfectionists with deadlines.” Because of this, the framework is very pragmatic and structured, but it can be quite opinionated at times. If you’re doing something that fits into their “way” of doing things, great. But if you have a more off-the-wall project, it may be more difficult to work around Django’s design constraints.

While it interfaces with databases quite well, it can be a lot of overhead if you’re just wanting to make a small project.

Here is a selection of popular sites that use Django:


If you’re starting out and Django is a little too complicated, look no further than Flask. It bills itself as a “microframework” and can set up a running web server in less than 10 lines of code. It’s lightweight, fast, and very customizable.

However, extra libraries or configuration may be needed for more complex sites on Flask. That’s the downside of having creative freedom within the framework. It doesn’t enforce standards like Django, which can be both a pro and a con depending on your use case.

If you’re just looking for a small web server or a personal web site, Flask is a good option.

Sites that use Flask include:


Pyramid seeks to bridge the gap between “megaframeworks” like Django and “microframeworks” like Flask. Their motto is “smart small, finish big, stay finished.” This means that you can get a web service up and running easily (similar to Flask), but Pyramid provides more resources and libraries to support scaling your site as well.

Companies using Pyramid include:

Static Site Generators

Static site generators are the new kids on the web development block. Instead of describing your website in a programming language you may or may not fully understand, static site generators allow you to write posts in (more or less) plain text. Many static site generators let you write in Markdown, which is basically just plain text with a little extra seasoning for formatting text and links. They then use a rendering engine to make your text appear on the web page in a structured and styled form. These sites are even more lightweight than Flask, which means there’s very little if any overhead to learn and set up.

The word “static” here means that you cannot interface with a database within this website. That means stuff like databases, registering new users, and dynamic code execution are not possible within this model. Some people see this as a perk rather than a limitation; many of the most nefarious web security vulnerabilities come from leaving the database exposed. If there’s no database to begin with, many of those vulnerabilities do not exist for your site.

If you’re looking for something that you can update easily and doesn’t have all the security worries of something like WordPress, give a static site generator a try.

Some of the most popular static site generators for Python include:

Pros and Cons of Python for Web Development

Because Python is a pretty simple and intuitive language to pick up, it’s accessible to coders and non-coders alike. The thriving Python development community ensures there’s a wealth of packages available to help you program just about anything you can imagine. Put a couple libraries together with one of the frameworks mentioned above, and the possibilities are limitless.

That being said, it’s not perfect. Here are some situations where you might not want to use Python:

  • Mobile development
  • Memory-intensive calculations
  • Performance-critical applications

Despite the shortcomings, Python is a strong choice for web developers old and new.