Get your team started on a custom learning journey today!
Our Boulder, CO-based learning experts are ready to help!
Get your team started on a custom learning journey today!
Our Boulder, CO-based learning experts are ready to help!
Follow us on LinkedIn for our latest data and tips!
Pandas is a popular Python library used for data science and analysis. Used in conjunction with other data science toolsets like SciPy, NumPy, and Matplotlib, a modeler can create end-to-end analytic workflows to solve business problems.
While you can do a lot of really powerful things with Python and data analysis, your analysis is only ever as good as your dataset. And many datasets have missing, malformed, or erroneous data. It’s often unavoidable–anything from incomplete reporting to technical glitches can cause “dirty” data.
Thankfully, Pandas provides a robust library of functions to help you clean up, sort through, and make sense of your datasets, no matter what state they’re in. For our example, we’re going to use a dataset of 5,000 movies scraped from IMDB. It contains information on the actors, directors, budget, and gross, as well as the IMDB rating and release year. In practice, you’ll be using much larger datasets consisting of potentially millions of rows, but this is a good sample dataset to start with.
Unfortunately, some of the fields in this dataset aren’t filled in and some of them have default values such as 0 or NaN (Not a Number).
No good. Let’s go through some Pandas hacks you can use to clean up your dirty data.
To get started with Pandas, first you will need to have it installed. You can do so by running:
$ pip install pandas
Then we need to load the data we downloaded into Pandas. You can do this with a few Python commands:
import pandas as pd
data = pd.read_csv(‘movie_metadata.csv’)
Make sure you have your movie dataset in the same folder as you’re running the Python script. If you have it stored elsewhere, you’ll need to change the read_csv parameter to point to the file’s location.
To check out the basic structure of the data we just read in, you can use the head() command to print out the first five rows. That should give you a general idea of the structure of the dataset.
data.head()
When we look at the dataset either in Pandas or in a more traditional program like Excel, we can start to note down the problems, and then we’ll come up with solutions to fix those problems.
Pandas has some selection methods which you can use to slice and dice the dataset based on your queries. Let’s go through some quick examples before moving on:
One of the most common problems is missing data. This could be because it was never filled out properly, the data wasn’t available, or there was a computing error. Whatever the reason, if we leave the blank values in there, it will cause errors in analysis later on. There are a couple of ways to deal with missing data:
We’ll go through each of those in turn.
First of all, we should probably get rid of all those nasty NaN values. But what to put in its place? Well, this is where you’re going to have to eyeball the data a little bit. For our example, let’s look at the ‘country’ column. It’s straightforward enough, but some of the movies don’t have a country provided so the data shows up as NaN. In this case, we probably don’t want to assume the country, so we can replace it with an empty string or some other default value.
data.country = data.country.fillna(‘’)
This replaces the NaN entries in the ‘country’ column with the empty string, but we could just as easily tell it to replace with a default name such as “None Given”. You can find more information on fillna() in the Pandas documentation.
With numerical data like the duration of the movie, a calculation like taking the mean duration can help us even the dataset out. It’s not a great measure, but it’s an estimate of what the duration could be based on the other data. That way we don’t have crazy numbers like 0 or NaN throwing off our analysis.
data.duration = data.duration.fillna(data.duration.mean())
Let’s say we want to get rid of any rows that have a missing value. It’s a pretty aggressive technique, but there may be a use case where that’s exactly what you want to do.
Dropping all rows with any NA values is easy:
data.dropna()
Of course, we can also drop rows that have all NA values:
data.dropna(how=’all’)
We can also put a limitation on how many non-null values need to be in a row in order to keep it (in this example, the data needs to have at least 5 non-null values):
data.dropna(thresh=5)
Let’s say for instance that we don’t want to include any movie that doesn’t have information on when the movie came out:
data.dropna(subset=[‘title_year’])
The subset parameter allows you to choose which columns you want to look at. You can also pass it a list of column names here.
We can apply the same kind of criteria to our columns. We just need to use the parameter axis=1 in our code. That means to operate on columns, not rows. (We could have used axis=0 in our row examples, but it is 0 by default if you don’t enter anything.)
Drop the columns with that are all NA values:
data.dropna(axis=1, how=’all’)
Drop all columns with any NA values:
data.dropna(axis=1, how=’any’)
The same threshold and subset parameters from above apply as well. For more information and examples, visit the Pandas documentation.
Sometimes, especially when you’re reading in a CSV with a bunch of numbers, some of the numbers will read in as strings instead of numeric values, or vice versa. Here’s a way you can fix that and normalize your data types:
data = pd.read_csv(‘movie_metadata.csv’, dtype={‘duration’: int})
This tells Pandas that the column ‘duration’ needs to be an integer value. Similarly, if we want the release year to be a string and not a number, we can do the same kind of thing:
data = pd.read_csv(‘movie_metadata.csv’, dtype={title_year: str})
Keep in mind that this data reads the CSV from disk again, so make sure you either normalize your data types first or dump your intermediary results to a file before doing so.
Columns with user-provided data are ripe for corruption. People make typos, leave their caps lock on (or off), and add extra spaces where they shouldn’t.
To change all our movie titles to uppercase:
data[‘movie_title’].str.upper()
Similarly, to get rid of trailing whitespace:
data[‘movie_title’].str.strip()
We won’t be able to cover correcting spelling mistakes in this tutorial, but you can read up on fuzzy matching for more information.
Finally, if your data was generated by a computer program, it probably has some computer-generated column names, too. Those can be hard to read and understand while working, so if you want to rename a column to something more user-friendly, you can do it like this:
data.rename(columns = {‘title_year’:’release_date’, ‘movie_facebook_likes’:’facebook_likes’})
Here we’ve renamed ‘title_year’ to ‘release_date’ and ‘movie_facebook_likes’ to simply ‘facebook_likes’. Since this is not an in-place operation, you’ll need to save the DataFrame by assigning it to a variable.
data = data.rename(columns = {‘title_year’:’release_date’, ‘movie_facebook_likes’:’facebook_likes’})
When you’re done cleaning your data, you may want to export it back into CSV format for further processing in another program. This is easy to do in Pandas:
data.to_csv(‘cleanfile.csv’ encoding=’utf-8’)
Of course, this is only the tip of the iceberg. With variations in user environments, languages, and user input, there are many ways that a potential dataset may be dirty or corrupted. At this point you should have learned some of the most common ways to clean your dataset with Pandas and Python.
For more resources on Pandas and data cleaning, see these additional resources:
Customized Technical Learning Solutions to Help Attract and Retain Talented Developers
Let DI help you design solutions to onboard, upskill or reskill your software development organization. Fully customized. 100% guaranteed.
DevelopIntelligence leads technical and software development learning programs for Fortune 500 companies. We provide learning solutions for hundreds of thousands of engineers for over 250 global brands.
“I appreciated the instructor’s technique of writing live code examples rather than using fixed slide decks to present the material.”
VMwareThank you for everyone who joined us this past year to hear about our proven methods of attracting and retaining tech talent.
© 2013 - 2020 DevelopIntelligence LLC - Privacy Policy
Let's review your current tech training programs and we'll help you baseline your success against some of our big industry partners. In this 30-minute meeting, we'll share our data/insights on what's working and what's not.
Training Journal sat down with our CEO for his thoughts on what’s working, and what’s not working.