In this article we will be describing the use Apache’s Airflow project to manage ETL (Extract, Transform, Load) processes in a Business Intelligence Analytics environment. First we will describe the history of Airflow, some context around its uses, and why it is fast becoming an important tool in the DevOps pipeline for managing the Extraction, Transformation, and Loading of data from large scale data warehouses. Second, we will provide a practical guide for its integration into a DevOps environment. Finally we will provide some installation instructions and describe some pitfalls when deploying with Airflow.

Key Airflow concepts

Airflow was developed by engineers at AirBnB to provide a standardized way to handle multiple ETL processes around an Enterprise Data Warehouse system. As of the time of this article, it is undergoing incubation with the Apache Software project. Airflow is written in Python but is language agnostic. It utilizes rabbitMQ, Jinja,

The idea behind Airflow is that the user will create DAGs or Directed Acyclic Graphs which are really just a visual representation of how each of the things that you are asking your ETL to do relate to each other. It also acts as a job scheduler and allows the developer or ops manager to check the status of several tasks utilizing the web interface. It is scalable to infinity, elegant, and dynamically capable of providing a layer of abstraction around multiple possible environments. Finally, it provides several built-in connectors, which means that your developers don’t need to spend time writing connection codes to various databases.

Get Airflow Training for Teams

Use Case:

Let’s use a hypothetical to demonstrate: let’s say that you want to create an ETL that would:

  1. Extract data from an hadoop cluster utilizing HIVE
  2. Transform the data
  3. Email it to a user

Breaking each of these steps into tasks the visual representation of your DAG might look something like this:

screen-shot-2017-02-10-at-2-32-49-pm
From the developer perspective

So let’s go through each step from the developer perspective:

From the developer perspective their job is now to write these DAGs, which refer to each of these tasks in code. Utilizing Airflow to do this means that the developer avoids the need to write new connectors and the systems engineer gets an easy-to-follow standardized road map of what the code is doing in each step and what to follow if there are issues.

The first thing we want to do is create our DAG, which we will do by importing the DAG object from the Airflow library and entering some parameters:

My first job, as a developer, is to connect to the HIVE database. Fortunately I can find a pre-existing HIVEServer2Hook() in the standard Airflow library by looking up my “connections” in my web ui:

Utilizing the Airflow webserver UI (found at localhost:8080 locally) I can go in and add my own connection arguments with a point and click interface. Now that I have added these it’s simply a matter of:

…and my connection is right there, pre-written, and re-usable by any other developers who are also working on the project.

Now that the connection is good, let’s create an OPERATOR to call some code and do the work!

So we have taken a “task”, here called “do_work” written in python and placed that task in a “Python Operator” (which basically means “use python to run this task”. There are all sorts of these operators, which allows for Airflow to be language agnostic).

For our second task, let’s write a simple python function that multiplies the data by 15:

Okay, so we now know that we want to run task one (called ‘get_data’) and then run task two (‘transform data’).

One quick note: ‘xcom’ is a method available in airflow to pass data in between two tasks. The function is simple to use: you “push” data from one task (xcom_push) and “pull” data from a second task (xcom_pull). The data is stored in a key->value store and is accessed via the task_id (which you can see above).

Finally let’s add in our last step, email out this number:

What we are doing here is simply adding h1 html tags to our outputs for the email and creating a EMAIL_CONTENT constant and then creating an operator to email our data out (all of the SMTP details are contained in the airflow environment. Your developers will not need to re-create these each time.

Our final step is to create an orderly method to run these tasks by looking at basic dependencies. In the below examples, we are saying “run callHook, then run transform, and finally email out the results”

email.set_upstream(transform)

This is a fairly straightforward example. Some of the things that have to go on “behind the scenes” include: setting up the connections, variables, and sub-dags.

Hopefully this has helped you see how useful Airflow can be in managing your ETL processes.

Get Airflow Training for Teams

The following two tabs change content below.
mm

Fernando Pombeiro

Fernando (“Fern”) Pombeiro is a Manager of data engineering at Fandango.com. Following his military career he transitioned to primarily data engineering roles, spending four years with Amazon.com and several years working in the Los Angeles startup scene in Facebook gaming as Data Engineer, Hadoop administrator, and Senior Data Analyst. He currently works for Fandango.com as a manager of BI Engineering in Santa Monica. Fern spends most of his day coding in Python, Java, Hive, Spark, SQL and Scala.
mm

Latest posts by Fernando Pombeiro (see all)

Comments

comments