Apache Flink and Spark are major technologies in the Big Data landscape. There is some overlap (and confusion) about what each do and do differently. This post will compare Spark and Flink to look at what they do, how they are different, what people use them for, and what streaming is.
Streaming data means data that comes in continuously. Processing streaming data poses technical complications not inherent in batch systems. Providing the ability to do that is the technological breakthrough that makes streaming products like Flink, Spark, Storm, and Kafka so important. It lets organizations make decisions based on what is happening right now.
When people learn about streaming data, the first use case they usually look at is Twitter. Twitter has APIs that let programmers query these tweets in real time. Streaming data examples also include traffic of any kind, stock ticker prices, weather readings coming from sensors, and temperature and vibration gauges on industrial machines.
What do Flink and Spark Do?
Apache Spark is considered a replacement for the batch-oriented Hadoop system. But it includes a component called Apache Spark Streaming as well. Contrast this to Apache Flink, which is specifically built for streaming.
Flink and Spark are in-memory databases that do not persist their data to storage. They can write their data to permanent storage, but the whole point of streaming is to keep it in memory and use it right now. There’s no reason to write it to storage as you want to analyze current data.
Spark/Flink are far more sophisticated that other real time systems, like a syslog connection over TCP built into every Linux system. This is because Flink and Spark are big data systems. They are fault tolerant and built to scale to immense amounts of data.
All of this lets programmers write big data programs (like MapReduce functions) with streaming data. They can take data in whatever format it is in, join different sets, reduce it to key-value pairs (map), and then run calculations on adjacent pairs to produce some final calculated value. They also can plug these data items into machine learning algorithms to make some projection (predictive models) or discover patterns (classification models).
How is Apache Flink different from Apache Spark, and, in particular, Apache Spark Streaming?
At first glance, Flink and Spark would appear to be the same. The main difference is Flink was built from the ground up as a streaming product. Spark added Streaming onto their product later.
Let’s take a look at the technical details of both.
Spark Micro Batches
Spark divides streaming into discrete chunks of data called micro batches. Then it repeats that in a continuous loop. Flink takes a checkpoint on streaming data to break it into finite sets. Incoming data is not lost when taking this checkpoint as it preserved in a queue in both products.
Either way you do it there will always be some lag time processing live data, so dividing it into sets should not matter. After all, when a program runs a MapReduce operation, the reduce operation is run on the map dataset that was created a few seconds ago.
Using Flink for Batch Operations
Spark was built, like Hadoop, to run over static data sets. Flink can do this by just stopping the streaming source.
The Flink Streaming Model
To say that a program processes streaming data means it opens a file and never closes it. That is like keeping a TCP socket open, which is how syslog works. Batch programs, on the other hand, open a file, process it, then close it.
Flink says it has developed a way to checkpoint streams without having to open and close them. To checkpoint means to notate where one has left off and then resume from that point. Then they run a type of iteration that lets their machine language algorithms run faster than Spark. That is not insignificant as ML routines can take many hours to run.
Flink versus Spark in Memory Management
Flink has a different approach to memory management. Flink pages out to disk when memory is full, which is what happens with Windows and Linux too. Spark crashes that node when it runs out of memory. But it does not lose data, since it is fault tolerant. To fix that issue, Spark has a new project called Tungsten.
Flink and Spark Machine Learning
What Spark and Flink do well is take machine learning algorithms and make them run over a distributed architecture. That, for example, is something that the Python scikit-learn framework and CRAN (Comprehensive R Archive Network) cannot do. Those are designed to work on one file, on one system. Spark and Flink can scale and process enormous training and testing data sets over a distributed system. It is worth noting, however, that the Flink graph processing and ML libraries are still in beta test, as of January 2017.
Command Line Interface (CLI)
Spark has CLIs in Scala, Python, and R. Flink does not really have a CLI, but the distinction is subtle.
To have a Spark CLI means a users can start up Spark, obtain a SparkContext, and write programs one line at a time. That makes walking through data and debugging easier. Walking through data and running map and reduce processes, and doing that in stages, is how data scientists work.
Flink has a Scala CLI too, but it is not exactly the same. With Flink, you write code and then run print() to submit it in batch mode and wait for the output.
Again this might be a matter of semantics. You could argue that spark-shell (Scala), pySpark (Python), are sparkR (R) are batch too. Spark is said to be “lazy.” That means when you create an object it only creates a pointer to it. It does not run any operation until you ask it to do something like count(), which would require creating the object in order to measure it. So it would submit that to its batch engine then.
Both languages, of course, support batch jobs, which is how most people would run their code once they have written and debugged it. With Flink you can run Java and Scala code in batch. With Spark you can run Java, Scala, and Python code in batch.
Support for Other Streaming Products
Spark or Flink can run locally or on a cluster. To run Hadoop or Flink on a cluster one usually uses Yarn. Spark is usually run with Mesos. If you want to use Yarn with Spark then you have to download a version of Spark that has been compiled with Yarn support.
The Flink End Game
What Spark has that Flink does not is a large install base of production users. If you look on the Flink website you can see some examples of what some companies have built using that, including Capital One and Alibaba. So it’s certainly going to move beyond beta stage and into the mainstream. Which one will have more users? No one can say, although more funding is pouring in Spark right now. Which one is more suitable for streaming data? That depends on requirements and merits further study.