About the Author:

How Are You Dealing with the Shift to DevSecOps?

March 9th, 2020

DevSecOps gets a lot of industry attention, but inside most companies the reaction – and adoption – is a bit more muted. DevOps, by comparison, is widely accepted in application development. For instance, 85% of respondents in Deloitte’s 2019 future of cyber survey of C-suite executives with security responsibilities said they use DevOps practices.

DevSecOps, however, is hard on its heels. According to data from Research and Markets, the DevSecOps market is expected to grow from $1.5 billion in 2018 to $5.9 billion by 2023 due to “the growing need for highly secure continuous application delivery and the increased focus on security and compliance.”

The DevSecOps approach, a combination of development, security and operations, prioritizes and integrates security into the software application lifecycle as early as the development phase. This represents a significant paradigm shift for the DevOps community.

The Need for Speed Often Predates Change

Traditionally in application development, the development team with its programming expertise made a product while the server and operating systems savvy operations team deployed it. DevSecOps merged the two teams into one so that everyone has to understand both areas in order to survive in today’s rapid software development environment.

Given DevOps popularity, the momentum is there for DevSecOps adoption, but progress has been slow on an enterprise level in part because it requires a fairly significant philosophical shift. Change can be tough. Companies are used to having an engineering team deal with code – DevOps – and a separate security team handle compliance audits and perform security scans, etc. Merging isn’t always as seamless as the DevSecOps model requires because it calls on DevOps technical talent to understand another set of skills: how to run and analyze security scans.

As is often the case with progress, some professionals will find themselves obsolete without careful career planning and additional training. For instance, security professionals who only know how to run scans will find themselves replaced by talent who can act in a more consultative fashion for DevSecOps teams: identifying best practices and vulnerabilities and then providing expert advice on how to prevent or solve those problems.

But there’s another reason DevSecOps adoption has been slower than it otherwise might have been considering the need for it in the marketplace – there is a lack of skilled talent available to take on DevSecOps roles. Companies looking for – and not finding – this kind of talent will have to develop it themselves. They can’t do without, and they can’t afford to wait until market supply catches up with market demand. Companies need agile, rapid software development as well as strong cybersecurity defense capabilities now.

Train to Build the DevSecOps Talent You Need

The only realistic option to handle this talent/skills gap is to train DevOps people how to integrate vulnerability scanning and insights into the application development process. Initially, there may be some grumbles. Some will want to know why the security team can’t simply do what its always done and conduct the necessary compliance tests, audits and scans. Leaders will have to clarify for them that when a company only releases code every few months, the old set up works fine; there is no need to change. That’s no longer the case, however.

Now teams release new code every week or even every day to keep up with increasingly accelerated development cycles. Calling on the security team in these circumstances takes too long, and in many cases, there may not be enough security talent to handle the workload. Training is the most reasonable option to ensure an organization can efficiently handle fast code deployment.

For anyone who initially rebels against DevSecOps training it might also help to relay some of the other benefits that a move from a traditional DevOps approach to DevSecOps brings. In addition to increased operational efficiency and effectiveness, organizations will find:

·     It’s easier to identify system and application vulnerabilities.

·     Teams are stronger and collaborate better.

·     Security teams are more agile.

·     Tech talent will have more time to focus on more strategic, high-value projects.

·     The environment as a whole will enjoy more transparency.


Today’s increasingly agile software development environment requires change if organizations and the technical talent that support them are to succeed. The shift from DevOps to DevSecOps is one of those changes. It leads to greater ROI, more innovation and a dynamic DevSecOps environment that supports high speed, high level new software development – if organizations start now to prepare their talent for what lies ahead and, in many cases, is already in play.

About the Author:

Stop Torturing Your Technical Talent

March 23rd, 2018

There are three ways companies consistently torture their developers and engineers. If you recognize any of these, stop it immediately. Your technical talent will thank you.

I know the training industry can be slow to change. But seriously, enough already.

We need developers. We need technical talent of all kinds, and supply does not match demand in most industries. But many companies can’t pull – and keep – the talent they need to grow their business because they refuse to give them the training they need to thrive. Instead, companies torture them, forcing them to attend training events like bad suits – ill fitting and cheap.

Don’t Make Them Sit Through a Lecture

Too many learning leaders just assume that developers want to learn everything virtually, online, or mobile. Nothing could be farther from the truth. Most developers don’t like those types of training delivery methods at all. They prefer instructor-led training, with their peers in groups.

Developers want to learn while using technology, but they want to do it in a lab, not online. Labs offer them a chance to physically interact with new technology before they get back to work. Yet labs are often not included in traditional training activities. And if they are, they’re not nearly as long as lectures, which is completely backwards. Lectures and lengthy presentations are torture for developers, and chances are if they’re sitting through one they’re not paying attention.

Don’t Deny Them Real World Experience

Let’s say you have an engineer working at credit card company. He – or she – doesn’t want to be trained in some cookie cutter, fake environment unlike the environment he’ll work in once back on the job. It needs to be familiar, personalized and specific to the challenges he will face in that industry.

Take DevOps, for instance. DI’s DeveloperAcademy doesn’t just offer different levels of Python, Chef and Docker, each course is customized to a company’s specific needs. DevOps combines development and operations to make it easier to develop, deploy and test apps, so things will naturally look different in a retail vs. a professional services organization. It’s why we’ve made customization a core pillar of our business.

Don’t Ignore Their Future Training Needs

Engineers are offered new jobs every day, and most feel no compunction about jumping ship when they find a better opportunity. That’s not because they’re disloyal, easily bored, or any other stereotype you may have heard; it’s because they’re smart.  

Technology is always changing. That means they must constantly re-skill and constantly be learning so that they can change and adapt along with it. To not think this way is to quickly become irrelevant.

Developers want to know their company will take care of all their future training needs. So, by word and deed, make sure they know they can count on you to help them grow their technical skills long-term.

Developers have distinct training preferences that learning leaders need to pay attention to. If they don’t, they’re wasting time, money, and they’re likely torturing their technical talent will ineffective training that won’t stick.

About the Author:

Docker Container and Image Cleanup on Minikube

February 28th, 2018

There are more elegant ways to do this for Docker images and containers with the Docker Daemon 1.25+, but the Docker Daemon on Minikube (at time of writing, 0.25.0) is 1.23, so does not support these commands.

So for now, we can use these two gems from Jim Hoskin’s website to remove untagged images (failed builds mostly) and stopped containers.

Removing untagged images

docker image rm $(docker images | grep "^<none>" | awk "{print $3}”)

Remove all stopped containers

docker container rm $(docker ps -a -q)

About the Author:

SQL vs NoSQL: Determining What’s Right For You

January 15th, 2018

Gone are the days of having a single SQL database to manage all of your organization’s information. In today’s data-saturated age, more storage opportunities emerge to meet these rapidly changing needs.

You may have heard the term NoSQL tossed around, but what does it mean? And what can it do for you? How can a stakeholder know when one option will be more effective over the other, and what should you choose for your business? Those are the topics we will cover in this article.

SQL Databases

The old standby of data storage since the 1970s, SQL databases store information in a relational fashion. This means the data has a relationship to other data in the database. For example, a class directory for a school might have tables for classrooms, students, teachers, and more.


SQL (or Structured Query Language) is used to describe these objects and their relations to one another. While SQL is versatile enough to create complex queries and is widely-used and tested, things start to break down if you need to add more fields or a different structure down the line.

Since SQL requires predefined schemas of information, new types of information or ill-formatted data will grind the system to a halt.


When you start getting more and more data, as many companies these days are, you have to find out a way to scale up. To scale a SQL database you need to add more resources to the server. This is called vertical scaling.

To scale vertically, you must increase system resources such as RAM, storage space, or CPU. If you’re hosting your database on a cloud server like AWS, this can get expensive very quickly.

Use Cases

These databases are best utilized with structured data such as that from our school example. Other datasets could hold weather information, inventory management data, or stock prices.

Example Products

NoSQL Databases

As the variety and type of data we produce changes, so too do the tools we use to contain that data.

NoSQL databases focus on storing collections of unstructured data. Many APIs return JSON documents that are essentially lists of key-value pairs. The structure changes over time and data is coming in rapidly, maybe even in real-time. This type of data doesn’t fit easily into a traditional relational database, but you need somewhere to easily store and access the information. So what do you do?

Enter the NoSQL database.


True to its name, NoSQL databases eschew the SQL language and format in favor of more flexible storage. Data is stored in a more amorphous fashion that allows for greater scalability and real-time data ingestion.

There are four main types of data held by NoSQL databases:

  • Key-value pairs
  • Documents
  • Graphs
  • Wide columns

The benefit of these different types means that you don’t have to have a defined schema or format before starting to ingest data. This cuts down on maintenance or upgrades down the road to add new types or structure.


NoSQL databases scale horizontally instead of vertically. This is done by a process called “sharding”, where the database’s storage is split over multiple servers. While sharding is possible with SQL databases it takes a lot more work and maintenance, while on many NoSQL stores this comes enabled by default.

Use Cases

NoSQL databases work well for lots of varied, unstructured data. If you need to hold incoming sensor data or API responses, as two examples, NoSQL would be most effective. They can also be ideal for very, very large datasets (tens or hundreds of terabytes or more) because while there is a theoretical upper limit on how much you can increase one system’s resources you can add machines over and over.

Example Products

The Big Question

So when should you stick with a relational database versus trying out a NoSQL solution? First, ask yourself how your data is structured. If it is in a fairly 2-dimensional (flat) format and has strong relations with other data in your dataset, consider a SQL database.

If you’re dealing with variable data that changes in format or a key-value store like JSON or XML, then give a NoSQL solution a try.

Here are some other basic criteria you want to look at when evaluating a new data storage solution:

  • structure of your data
  • the volume of your data
  • whether you anticipate this dataset to grow significantly in the future

In fact, the writer at TheHFTGuy has put together this handy flowchart for you to determine what kind of database best fits your data needs.

database flowchart

There are many different options out there today, but you now know the major types of data stores and where to apply them.

About the Author:

Does Your DevOps Department Need More Attention?

November 15th, 2017

How fully do you understand your DevOps department? Do you know when it’s running smoothly, and do you understand when it needs help? Knowing the latter is essential for you to help identify and fix problems as they come up. (more…)

About the Author:

ETL Management with Luigi Data Pipelines

October 15th, 2017

As a data engineer, you’re often dealing with large amounts of data coming from various sources and have to make sense of them. Extracting, Transforming, and Loading (ETL) data to get it where it needs to go is part of your job, and it can be a tough one when there’s so many moving parts.

Fortunately, Luigi is here to help. An open source project developed by Spotify, Luigi helps you build complex pipelines of batch jobs. It has been used to automate Spotify’s data processing (billions of log messages and terabytes of data), as well as in other well-known companies such as Foursquare, Stripe, Asana, and Red Hat.

Luigi handles dependency resolution, workflow management, and helps you recover from failures gracefully. It can be integrated into the services you already use, such as running a Spark job, dumping a table from a database, or running a Python snippet.

In this example, you’ll learn how to ingest log files, run some transformations and calculations, then store the results.

Let’s get started.


For this tutorial, we’re going to be using Python 3.6.3, Luigi, and this fake log generator. Although you could use your own production log files, not everyone has access to those. So we have a tool that can produce dummy data for us while we’re testing. Then you can try it out on the real thing.

To start, make sure you have Python 3, pip, and virtualenv installed. Then we’ll set up our environment and install the luigi package:

$ mkdir luigidemo
$ cd luigidemo
$ virtualenv luigi
$ source luigi/bin/activate
$ pip install luigi


This sets up an isolated Python environment and installs the necessary dependencies. Next, we’ll need to obtain some test data to use in our data pipeline.

Generate test data

Since this example deals with log files, we can use this Fake Apache Log Generator to generate dummy data easily. If you have your own files you would like to use, that is fine, but we will be using mocked data for the purposes of this tutorial.

To use the log generator, clone the repository into your project directory. Then navigate into the Fake-Apache-Log-Generator folder and use pip to install the dependencies:

$ pip install -r requirements.txt

After that completes, it’s pretty simple to generate a log file. You can find detailed options in the documentation, but let’s generate a 1000-line .log file:

$ python apache-fake-log-gen.py -n 1000 -o LOG

This creates a file called access_log_{timestamp}.log.

Now we can set up a workflow in Luigi to analyze this file.

Luigi Tasks

You can think of building a Luigi workflow as similar to building a Makefile. There are a series of Tasks and dependencies that chain together to create your workflow.

Let’s say we want to figure out how many unique IPs have visited our website. There are a couple of steps we need to go through to do this:

  • Read the log file
  • Pull out the IP address from each line
  • Decide how many unique IPs there are (de-duplicate them)
  • Persist that count somewhere

Each of those can be packaged as a Luigi Task. Each Task has three component parts:

  • run() — This contains the logic of your Task. Whether you’re submitting a Spark job, running a Python script, or querying a database, this function contains the actual execution logic. We break up our process, as above, into small chunks so we can run them in a modular fashion.
  • output() — This defines where the results will go. You can write to a file, update HDFS, or add new records in a database. Luigi provides multiple output targets for you to use, or you can create your own.
  • requires() — This is where the magic happens. You can define dependencies of your task here. For instance, you may need to make sure a certain dataset is updated, or wait for another task to finish first. In our example, we need to de-duplicate the IPs before we can get an accurate count.

Our first Task just needs to read the file. Since it is the first task in our pipeline, we don’t need a requires() function or a run() function. All this first task does is send the file along for processing. For the purposes of this tutorial, we’ll write the results to a local file. You can find more information on how to connect to databases, HDFS, S3, and more in the documentation.

To build a Task in Luigi, it’s just a simple Python class:

import luigi

# this class extends the Luigi base class ExternalTask
# because we’re simply pulling in an external file
# it only needs an output() function
class readLogFile(luigi.ExternalTask):

def output(self):
   return luigi.LocalTarget('/path/to/file.log')


So that’s a fairly boring task. All it does is grab the file and send it along to the next Task for processing. Let’s write that next Task now, which will pull out the IPs and put them in a list, throwing out any duplicates:

class grabIPs(luigi.Task): # standard Luigi Task class

   def requires(self):
       # we need to read the log file before we can process it
       return readLogFile()

   def run(self):
       ips = []

       # use the file passed from the previous task
       with self.input().open() as f:
           for line in f:
               # a quick glance at the file tells us the first
               # element is the IP. We split on whitespace and take
               # the first element
               ip = line.split()[0]
               # if we haven’t seen this ip yet, add it
               if ip not in ips:

       # count how many unique IPs there are
       num_ips = len(ips)

       # write the results (this is where you could use hdfs/db/etc)
       with self.output().open('w') as f:

   def output(self):
       # the results are written to numips.txt
       return luigi.LocalTarget('numips.txt')


Even though this Task is a little more complicated, you can still see that it’s built on three component parts: requires, run, and output. It pulls in the data, splits it up, then adds the IPs to a list. Then it counts the number of elements in that list and writes that to a file.

If you try and run the program now, nothing will happen. That’s because we haven’t told Luigi to actually start running these tasks yet. We do that by calling luigi.run (you can also run Luigi from the command line):

if __name__ == '__main__':

   luigi.run(["--local-scheduler"], main_task_cls=grabIPs)


The run function takes two arguments: a list of options to pass to Luigi and the task you want to start on. While you’re testing, it helps to pass the –local-scheduler option; this allows the processes to run locally. When you’re ready to move things into production, you’ll use the Central Scheduler. It provides a web interface for managing your workflows and dependency graphs.

If you run your Python file at this point, the Luigi process will kick off and you’ll see the process of each task as it moves through the pipeline. At the end, you’ll see which tasks succeeded, which failed, and the status of the run overall (Luigi gives it a :) or :( face).

Check out numips.txt to see the result of this run. In my example, it returned the number 1000. That means that 1000 unique IP addresses visited the site in this dummy log file. Try it on one of your logs and see what you get!

Extending your ETL

While this is a fairly simple example, there are a lot of ways that you could easily extend this. You could:

  • Split out dates and times as well, and filter the logs to see how many visitors there were on a certain day
  • Geolocate the IPs to find out where your visitors are coming from
  • See how many errors (404s) your visitors are encountering
  • Put the data in a more robust data store
  • Build scheduling so that you can run the workflow periodically on the updated logs

Try some of these options out and let us know if you have any questions! Although this example is limited in scope, Luigi is robust enough to handle Spotify-scale datasets, so let your imagination go wild.

About the Author:

What Does AWS Lambda Do and Why Use It?

April 6th, 2017

Setting up and running a backend or server can be a very resource intensive part of the development process for many companies/products. Amazon AWS Lambda is a serverless computing service that gives companies the computing power they would usually get from a server, but for much less time and money. This lets developers focus on writing great code and building great products. This short piece will give an introduction on AWS Lambda.

Get AWS Training for Teams

AWS Lambda is considered to be both serverless and event driven. An event driven platform is one that traditionally depends on input from the user to determine which process will be executed next. The input might be a mouse click, a key press, or any other sensor input. The program contains a primary loop that ‘listens’ for any event that might take place and a separate function that calls another executable code when triggered.

Lambda is priced in 100 millisecond increments. This can decrease costs if only a short amount of computing time is needed (vs. paying for a server by the hour). While the EC2 service targets large scale operations, AWS Lambda serves applications that are usually smaller and more dynamic.

Lambda can also be run alongside other Amazon S3 services. It can be programmed to respond to whenever there is a change in the data contained in an S3 Bucket. A container would be launched when the Lambda function is launched. The container is active for sometime even after the code is executed as it waits for more code.

Websites that rely on tasks like image uploading or simple data analytics can make use of Lambda to handle simple discrete processes. Paired with Amazon Gateway API, Lambda can also be used to host backend logic for websites over HTTPS.


Traditionally, developers would setup servers and processes to have computing power (to be to execute different tasks). With AWS Lambda, they can focus on their product and buy spot computing power from Lambda only when they need it.

Get AWS Training for Teams

About the Author:

Using Airflow to manage your DevOps ETLs

February 10th, 2017

In this article we will be describing the use Apache’s Airflow project to manage ETL (Extract, Transform, Load) processes in a Business Intelligence Analytics environment. First we will describe the history of Airflow, some context around its uses, and why it is fast becoming an important tool in the DevOps pipeline for managing the Extraction, Transformation, and Loading of data from large scale data warehouses. Second, we will provide a practical guide for its integration into a DevOps environment. Finally we will provide some installation instructions and describe some pitfalls when deploying with Airflow.

Key Airflow concepts

Airflow was developed by engineers at AirBnB to provide a standardized way to handle multiple ETL processes around an Enterprise Data Warehouse system. As of the time of this article, it is undergoing incubation with the Apache Software project. Airflow is written in Python but is language agnostic. It utilizes rabbitMQ, Jinja,

The idea behind Airflow is that the user will create DAGs or Directed Acyclic Graphs which are really just a visual representation of how each of the things that you are asking your ETL to do relate to each other. It also acts as a job scheduler and allows the developer or ops manager to check the status of several tasks utilizing the web interface. It is scalable to infinity, elegant, and dynamically capable of providing a layer of abstraction around multiple possible environments. Finally, it provides several built-in connectors, which means that your developers don’t need to spend time writing connection codes to various databases.

Get Airflow Training for Teams

Use Case:

Let’s use a hypothetical to demonstrate: let’s say that you want to create an ETL that would:

  1. Extract data from an hadoop cluster utilizing HIVE
  2. Transform the data
  3. Email it to a user

Breaking each of these steps into tasks the visual representation of your DAG might look something like this:

From the developer perspective

So let’s go through each step from the developer perspective:

From the developer perspective their job is now to write these DAGs, which refer to each of these tasks in code. Utilizing Airflow to do this means that the developer avoids the need to write new connectors and the systems engineer gets an easy-to-follow standardized road map of what the code is doing in each step and what to follow if there are issues.

The first thing we want to do is create our DAG, which we will do by importing the DAG object from the Airflow library and entering some parameters:

from airflow import DAG

default_args = {
   'owner': 'fpombeiro',
   'depends_on_past': True,
   'start_date': datetime(2017, 2, 1),
   'email': ['fpombeiro@airflow.com'],
   'email_on_failure': True,
   'email_on_retry': False,
   'retries': 1,
   'retry_delay': timedelta(minutes=5),
   # 'queue': 'bash_queue',
   # 'pool': 'backfill',
   # 'priority_weight': 10,
   # 'end_date': datetime(2016, 1, 1),

dag = DAG(‘read_hive’, default_args=default_args)

My first job, as a developer, is to connect to the HIVE database. Fortunately I can find a pre-existing HIVEServer2Hook() in the standard Airflow library by looking up my “connections” in my web ui:

Utilizing the Airflow webserver UI (found at localhost:8080 locally) I can go in and add my own connection arguments with a point and click interface. Now that I have added these it’s simply a matter of:

from airflow.hooks import HiveServer2Hook

…and my connection is right there, pre-written, and re-usable by any other developers who are also working on the project.

Now that the connection is good, let’s create an OPERATOR to call some code and do the work!

callHook = PythonOperator(

def do_work():
   hiveserver = HiveServer2Hook()
   hql = "SELECT COUNT(*) FROM foo.bar"
   row_count = hiveserver.get_records(hql, schema='foo')
   return row_count[0][0]

So we have taken a “task”, here called “do_work” written in python and placed that task in a “Python Operator” (which basically means “use python to run this task”. There are all sorts of these operators, which allows for Airflow to be language agnostic).

For our second task, let’s write a simple python function that multiplies the data by 15:

def multiply_count(**context):
   value = context['get_data’].xcom_pull(task_ids='do_work')
   return value * 15

transform = PythonOperator(

Okay, so we now know that we want to run task one (called ‘get_data’) and then run task two (‘transform data’).

One quick note: ‘xcom’ is a method available in airflow to pass data in between two tasks. The function is simple to use: you “push” data from one task (xcom_push) and “pull” data from a second task (xcom_pull). The data is stored in a key->value store and is accessed via the task_id (which you can see above).

Finally let’s add in our last step, email out this number:

   subject='Airflow processing report',


<b>Copied:</b> {{ transform.xcom_pull(task_ids=html_output) }}<br>

def html_output():
      return “<h1>” + value “ </h1>”

What we are doing here is simply adding h1 html tags to our outputs for the email and creating a EMAIL_CONTENT constant and then creating an operator to email our data out (all of the SMTP details are contained in the airflow environment. Your developers will not need to re-create these each time.

Our final step is to create an orderly method to run these tasks by looking at basic dependencies. In the below examples, we are saying “run callHook, then run transform, and finally email out the results”



This is a fairly straightforward example. Some of the things that have to go on “behind the scenes” include: setting up the connections, variables, and sub-dags.

Hopefully this has helped you see how useful Airflow can be in managing your ETL processes.

Get Airflow Training for Teams

About the Author:

What does Kubernetes actually do and why use it?

February 7th, 2017

Kubernetes is a vendor-agnostic cluster and container management tool, open-sourced by Google in 2014. It provides a “platform for automating deployment, scaling, and operations of application containers across clusters of hosts”. Above all, this lowers the cost of cloud computing expenses and simplifies operations and architecture.

This article will explain Kubernetes from a high-level and answer the questions:

  • What is Kubernetes and what does it do? Why should people use it?
  • What does orchestration mean?
  • Why do people use containers?
  • Why should IT people care about this, meaning what would they do if they do not use Kubernetes and containers?

Kubernetes and the Need for Containers

Before we explain what Kubernetes does, we need to explain what containers are and why people are using those.

A container is a mini-virtual machine. It is small, as it does not have device drivers and all the other components of a regular virtual machine. Docker is by far the most popular container and it is written in Linux. Microsoft also has added containers to Windows as well, because they have become so popular.

The best way to illustrate why this is useful and important is to give an example.

Suppose you want to install the nginx web server on a Linux server. You have several ways to do that. First, you could install it directly on the physical server’s OS. But most people use virtual machines now, so you would probably install it there.

But setting up a virtual machine requires some administrative effort and cost as well. And machines will be underutilized if you just dedicate it for just one task, which is how people typically use VMs. It would be better to load that one machine up with nginx, messaging software, a DNS server, etc.

The people who invented containers thought through these issues and reasoned that since nginx or any other application just needs some bare minimum operating system to run, then why not make a stripped down version of an OS, put nginx inside, and run that. Then you have a self-contained, machine-agnostic unit that can be installed anywhere.

Now containers are so popular than they threaten to make VMs obsolete, is what some people say.

Docker Hub

But making the container small is not the only advantage. The container can be deployed just like a VM template, meaning an application that is ready to go that requires little or no configuration.

There are thousands of preconfigured Docker images at the Dockerhub public repository. There, people have taken the time to assemble opensource software configurations that might take someone else hours or days to put together. People benefit from that because they can install nginx or even far more complicated items simply by downloading them from there.

For example, this one line command will down, install, and start Apache Spark with Jupyter notebooks (iPython):

docker run -d -p 8888:8888 jupyter/all-spark-notebook

As you can see it is running on port 8888. So you could install something else on another port or even install a second instance of Spark and Jupyter.

On the Need for Orchestration

Now, there is an inherent problem with containers, just like there is with virtual machines. That is the need to keep track of them. When public cloud companies bill you for CPU time or storage then you need to make sure you do not have any orphaned machines spinning out there doing nothing. Plus there is the need to automatically spin up more when a machine needs more memory, CPU, or storage, as well as shut them down when the load lightens.

Orchestration tackles these problems. This is where Kubernetes comes in.


Google built Kubernetes and has been using it for 10 years. That it has been used to run Google’s massive systems for that long is one of its key selling points. Two years ago Google pushed Kubernetes into open source.

Kubernetes is a cluster and container management tool. It lets you deploy containers to clusters, meaning a network of virtual machines. It works with different containers, not just Docker.

Kubernetes Basics

The basic idea of Kubernetes is to further abstract machines, storage, and networks away from their physical implementation. So it is a single interface to deploy containers to all kinds of clouds, virtual machines, and physical machines.

Here are a few of Kubernetes concepts to help understand what it does.


A node is a physical or virtual machine. It is not created by Kubernetes. You create those with a cloud operating system, like OpenStack or Amazon EC2, or manually install them. So you need to lay down your basic infrastructure before you use Kubernetes to deploy your apps. But from that point it can define virtual networks, storage, etc. For example, you could use OpenStack Neutron or Romana to define networks and push those out from Kubernetes.


A pod is a one or more containers that logically go together. Pods run on nodes. Pods run together as a logical unit. So they have the same shared content. They all share the share IP address but can reach other other via localhost. And they can share storage. But they do not need to all run on the same machine as containers can span more than one machine. One node can run multiple pods.

Pods are cloud-aware. For example you could spin up two Nginx instances and assign them a public IP address on the Google Compute Engine (GCE). To do that you would start the Kubernetes cluster, configure the connection to GCE, and then type something like:

kubectl expose deployment my-nginx –port=80 –type=LoadBalancer


A set of pods is a deployment. A deployment ensures that a sufficient number of pods are running at one time to service the app and shuts down those pods that are not needed. It can do this by looking at, for example, CPU utilization.

Vendor Agnostic

Kubernetes works with many cloud and server products. And the list is always growing as so many companies are contributing to the open source project. Even though it was invented by Google, Google is not said to dominate it’s development.

To illustrate, the OpenStack process to create block storage is called Cinder. OpenStack orchestration is called Heat. You can use Heat with Kubernetes to manage storage with Cinder.

Kubernetes works with Amazon EC2, Azure Container Service, Rackspace, GCE, IBM Software, and other clouds. And it works with bare-metal (using something like CoreOS), Docker, and vSphere. And it works with libvirt and KVM, which are Linux machines turned into hypervisors (i.e, a platform to run virtual machines).

Use Cases

So why would you use Kubernetes on, for example, Amazon EC2, when it has its own tool for orchestration (CloudFormation)? Because with Kubernetes you can use the same orchestration tool and command-line interfaces for all your different systems. Amazon CloudFormation only works with EC2. So with Kubernetes you could push containers to the Amazon cloud, your in-house virtual and physical machines as well, and other clouds.

Wrapping Up

So we have answered the question what is Kubernetes? It is an orchestration tool for containers. What are containers? They are small virtual machines that run ready-to-run applications on top of other virtual machines or any host OS. They greatly simplify deploying applications. And they make sure machines are fully-utilized. All of this lowers the cost of cloud subscriptions, further abstracts the data center, and simplifies operations and architecture. To get started learning about it, the reader can install MiniKube to run it all on one machine and play around with it.

About the Author:

How to Create Persistent Storage (e.g., Databases) in Docker

January 23rd, 2017

Docker containers were created with dynamic data in mind. This meant that, out of the box, Docker containers did not know how to deal with persistent data such as large databases. Two workarounds were initially used to make Docker containers work with databases. The Docker volume API was later introduced to deal with persistent data natively. This article will contains a brief introduction to working with the Docker volume API.

Get Docker Training for Teams

First, let’s review the workarounds. The first workaround to the Docker/database problem is to store the database itself elsewhere on an online platform such as the cloud or on a virtual machine. This is essentially a service via port for legacy applications. Another workaround for dealing with persistent data is to store it on, say, Amazon S3 and retrieve it if the container goes bust. This means the data present within the container is also backed up on the cloud so that it can be retrieved should the container go belly up. Given that databases are typically large files, this can be a cumbersome process.

Both of these workarounds have their drawbacks. For this reason, many developers/companies use Docker data volumes. Docker data volumes are a directory present within any number of containers used to bypass Union File Systems. It is initialized when the container is created. Thus, the directory can be shared across containers on a host and the data can be stored directly on the host. This also means that the volume can be reused by various containers. All changes are now made directly to the volume. The images are also updated independently of the volume, which means that even if the containers are lost for some reason, the changes persist.

In a scenario where Network Attached Storage (NAS) is being used, we have to know which host has the access to different mount points/volumes on the storage and map those points within the container itself.

Docker specifies a volume API explicitly created for this purpose.

docker volume create –name hello

docker run -d -v hello:/container/path/for/volume container_image my_command


The snippet above creates a new volume and then runs it by invoking its path.

This approach is different from the data-only container pattern.

For a better analogy, consider a volume created:

-v volume_name:/container/fs/path

The command above does the following:

Listed via docker volume ls

  1. The command docker volume inspect volume_name can be used to list the volume so named.
  2. Has the same redundancy as a normal directory.
  3. The redundancies are enabled via a –volumes-from connection.

In a data-only container approach, one would have to execute each of the steps above individually.

Dangling Volumes are volumes which have no images attached, consuming space on the storage. As such, they can be removed. But identifying them can be difficult–in order to do so we can use:

docker volume ls -f dangling=true

Using the command below we can delete them:

docker volume rm <volume name>

If the number of dangling volumes is large, a single line of code can be used to delete them in a batch:

docker volume rm $(docker volume ls -f dangling=true -q)

Another method of achieving persistent storage with Docker is by using a method called bind-mounting. With this method, Docker does not create a new path to mount a volume. Instead, it uses a path specified by the host itself. The path can specify either a file or a directory.

The command for bind mounting Docker files is:

docker run -v /host/path:/some/path …

Docker will check if the file exists at the specified path, and will create it if it does not.

However, bind mount volumes differ in some respects compared to a normal volume, since Docker best practices dictate that we avoid changing the attributes of a host not created by it.

Firstly, if a new directory is to be created, the data at the specified path will not be copied into it automatically, unlike with run-of-the-mill volumes.

Secondly, if the following command is used:

docker rm -v my_container

it will result in the Docker container being removed. But the bind-mount volumes will still exist. This is where the command for removing dangling volumes can be used, in the event that the data is not required.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            

Get Docker Training for Teams