Why Spark Structured Streaming Could Be The Best Choice
There are many streaming applications that are used to process data streams, both exclusively and in the open-source domain. They have different capabilities, APIs, and trade-offs in the balance between latency and throughput.
Following the right tool for the job principle, data streaming stacks should be compared and contrasted against the requirements of every new project, to make the best choice.
In this blog post, I’ll discuss Spark’s stream-processing model and what differentiates it from other streaming engines out there.
What is Spark Structured Streaming?
Apache Spark Streaming – an extension of the core Spark API – is a scalable and fault-tolerant system that natively supports batch and streaming workloads.
Data is ingested from sources such as Amazon Kinesis, Kafka, Kinesis, or TCP sockets, and processed using complex algorithms. The processed data is then pushed out to file systems, databases, and live dashboards. Since Spark Streaming was introduced, an easier-to-use streaming engine called Apache Spark Structured Streaming has been introduced for streaming applications and pipelines.
Apache Spark Structured Streaming is a high-level API for stream processing that allows you to take batch mode operations that are conducted using Spark’s structured APIs, and run them in a streaming fashion. What are the benefits? Reduced latency, incremental processing, and rapid value with virtually no code changes.
Traditional processing models
Generally, stream processing is known as the continuous processing of endless streams of data. With the advent of big data, stream processing systems transitioned from single-node processing engines to distributed processing engines. Traditionally, stream processing stacks use a record-at-a-time processing model, as shown below.
In this model, each node continuously receives one record at a time, processes it, and then forwards the generated data to the next node. It can achieve very low latencies – input data can be processed by the pipeline and generate output within milliseconds.
However, this model isn’t very efficient at recovering from node failures and nodes that are lagging. It can either recover from a failure very fast with a lot of extra failover resources, or use minimal extra resources but recover slowly. For a more detailed explanation, see the original research paper by Matei Zaharia et al. (2013).
Spark Structured Streaming: micro-batch stream processing
This traditional approach was challenged by Apache Spark, with the Spark Structured Streaming application. It introduced the idea of micro-batch stream processing, where the streaming computation is modeled as a continuous series of small, map/reduce-style batch processing jobs (hence, “micro-batches”), on small chunks of the stream data. This is illustrated below.
As shown above, the Spark Structured Streaming engine divides the data from the input stream into, say, one-second micro-batches. Each batch is processed in the Spark cluster in a distributed manner with small deterministic tasks that generate the output in micro-batches. Breaking down the streaming computation into these small tasks gives us two advantages over the traditional, continuous-operator model:
- Spark’s smart scheduling system can quickly and efficiently recover from failures by rescheduling one or more copies of the task on the other executors.
- The deterministic nature of the tasks ensures the output remains the same no matter the number of times it’s executed.
But this efficient fault tolerance comes at a cost: latency. The micro-batch model can’t achieve millisecond-level latencies – rather, from half a second to a few seconds. However, it can still be argued that in the majority of stream processing use cases, the benefit of micro-batch processing outweighs the drawback of second-scale latencies.
For example, a streaming output read by an hourly job doesn't get affected by subsecond latencies. Also, more delays could be encountered in another part of the pipeline – for example, batching data ingested into Kafka to increase throughput – which make Spark’s micro-batching delays insignificant when you consider the end-to-end latencies.
Exactly once guarantee
Spark micro batching offers strong consistency due to its batch determinations – each batch's beginning and end are deterministic and recorded. This guarantees all transformations produce the same result if rerun.
For example, let's assume we’re reading data from a single partition Kafka source, as shown in the figure below.
Each event is considered to have an ever-increasing offset. Spark’s structured streaming knows where there’s data for processing by asking the source (Kafka partition) for the current offset and comparing it with the last processed offset. This informs Spark about the next batch of data to be processed.
The source (Kafka in this case) is informed that data has been processed by committing a given offset. This guarantees all data with an offset less than or equal to the committed offset has been processed, and that subsequent requests will only include offsets greater than the committed offset. This process constantly repeats, ensuring the acquisition of streaming data. To recover from eventual failure, offsets are often checkpointed to external storage.
- At t, the system calls getOffset and obtains the current offset for the source.
- At t + 1, the system obtains the batch up to the last known offset by calling getBatch(start, end). Note: New data might have arrived in the meantime.
- At t + 2, the system commits the offset, and the source drops the corresponding records.
Apache Flink also offers snapshots to guard computation against failure, but unlike Spark, it lacks a synchronous batch boundary.
Micro-batch optimization
It’s been established that micro-batching provides strong consistency, because batch determinations record the end and beginning of each batch. This allows any kind of computation to be redone and produce the same results the second time.
Having this system of data-as-a-set that we can inspect at the beginning of the micro-batch allows us to perform different optimizations or infer ways to best process the batch of data.
Exploiting that on each micro-batch, we can consider the specific case rather than the general processing used for all possible input. For example, in the figure below, we could take a sample or compute a statistical measure before deciding to process or drop each micro-batch.
Same programming API for both batch and stream processing
When compared to alternative frameworks, Spark retains its advantage of a tightly integrated, high-level API for data processing with minimal changes between batch and streaming. To understand this better, let’s look at a data architecture called Lambda, introduced by Nathan Marz in a blog post, “How to beat the CAP Theorem.” This approach aims to consolidate streaming and batch-processing layers, as shown below.
In this architecture, data follows two paths through the system. The real-time “hot” path prepares data for querying with low latency in seconds or minutes. The hot path has access to the most recent data; therefore, its calculations are accurate over a short time window but may not be accurate overall.
The “cold” batch path prepares data over the entire data window. It has access to all the data before the batch execution, so the calculations are accurate up to the time of the last batch.
Typically, a batch-processing framework like Hadoop handles the batch layer, while a stream-processing framework like Apache Storm, Flink, or Stanza handles the stream layer. This requires writing the same processing logic in different programming APIs.
Apache Spark, as a “Unified Engine for Big Data Processing”, replaces the separate batch and stream processing with a unified stack of components that address diverse workloads under a single distributed fast engine.
Fault tolerance guarantee
To grasp Spark’s fault tolerance guarantee, we have to understand how Spark Structured Streaming works. At a higher level, let's imagine we want to build a data pipeline that involves ingesting a single batch of streaming data, aggregating the data, and saving it in a database. The image below gives a high-level overview of Spark’s streaming architecture.
A Spark application runs as an independent process on a cluster and it drives the cluster. This process is coordinated by an object in our application called a SparkSession. Your application or driver connects to a Spark cluster (led by a master), just like Apache Storm – another stream processing tool.
This project instructs the master (cluster manager) to load the data, transform it, and save it into a database. Of course, a master wouldn't do this by itself, but instead, delegates the work to the workers (also known as executors). A master allocates resources (memory, CPU, etc.) required for execution and sends application codes to the workers to execute.
Who are the workers? Spark is a distributed system, so in our case, Spark will instruct the workers to ingest our data in a distributed way into a partition in the worker’s memory. A partition is just a dedicated area in the worker’s memory, as shown in the diagram below.
The workers create tasks to read the data – a task is a unit of execution. A task maps to a single core on the worker and works on a single partition of the data.
An executor with two cores can have two or more tasks working on two or more partitions in parallel, making the execution of Spark’s tasks exceedingly parallel and fast. Each worker has access to the node’s memory and will assign a memory partition to the task. The dataset reads each task into its assigned partitions and individually saves into the database.
There are two takeaways from my explanation above that raise critical questions. These are summarized below.
- A task is a unit of execution that processes the data in a single partition. What happens when there’s task failure due to possible out-of-memory errors, network errors, or data quality problems such as parsing or syntax errors, etc.?
- The application is the driver and data doesn't have to come to the driver – it can be driven remotely. The driver connects to a master and gets a session. The driver is the repository of the block manager and knows where each block of data resides in the cluster – it’s where the DAGSs, logs, metadata, and scheduling state of the job reside. If the driver fails, the stage of computation, what the computation consists of, and where the data can be found are deleted at the same time. How do we mitigate this?
Task failure recovery
To mitigate task failure, Spark through a call to a cache() or persist() function saves a copy of the data on another machine of the cluster. In case of a failure, Spark doesn't need to recompute the input by consulting the DAG that stores the user-specified computation.
Driver failure recovery
As opposed to driving remotely from a user’s personal computer (client mode), Spark allows for the application (driver) to be hosted on a cluster. This adds the extra functionality of automatically restarting the driver when failure of any kind is detected. However, this restart is from scratch, as all temporary states of computations would have been lost.
To avoid losing the intermediate state in the case of a driver crash, Spark also offers the option of checkpointing which saves a snapshot of the intermediate state to disk.
Query optimization
Spark Structured Streaming offers a unified API in addition to the traditional batch mode. It fully benefits from the performance optimizations introduced by Spark SQL, such as query optimization and Tungsten code generation, while using fewer resources. All you need to do is write the code, and the optimizer takes the query and converts it into an execution plan. It goes through four transformational phases as highlighted below:
- Analysis
- Logical optimization
- Physical planning
- Code generation
To illustrate these query optimization phases, let us consider this PySpark code:
Customer_data = spark.read_parquet("./customer.parquet")
Sales_data = spark.read_parquet("./sales.parquet")
joindedDF = customer_data
.join(sales_data, customer_data("id") == sales_data("id") )
.filter(sales_data("date") > "2015-01-01")
This code joins the customer and sales datasets on a common column called id and then filters for when the date of the sale is greater than “2015-01-01”.
After going through an initial analysis phase, the query plan is transformed and rearranged by the catalyst optimizer as shown below.
Analysis
The analysis phase resolves column or table names by consulting an internal catalog – a programmatic interface to Spark SQL that holds a list of names of columns, data types, functions, tables, databases, etc. Once this is resolved, the query proceeds to the next phase.
Logical planning
The logical optimization phase constructs a set of multiple plans using a standard rule-based approach, and then using its cost-based optimizer (CBO) assigns costs to each plan. These plans are laid out as operator trees (see above) and they may include filtering earlier, projection pruning (reducing columns), boolean expression simplification, etc. This logical plan is the input into the physical plan.
Physical planning
In this phase, Spark SQL generates an optimal physical plan for the selected logical plan, using physical operators that match those available in the Spark execution engine.
Armed with this optimizer, a developer only needs to concentrate on writing the codes, while Spark takes care of the rest. Apache Flink also has a system close to Apache Spark’s catalyst optimizer.
Scalable and fault-tolerant Spark Structured Streaming
Spark’s stream-processing model natively supports batch and streaming workloads. It challenges traditional stream processing systems via the idea of micro-batch stream processing. With quick and efficient fault tolerance, Spark Structured Streaming also offers strong consistency and has a tightly integrated, high-level API for data processing with minimal changes between batch and streaming.
Implementing streaming projects with Spark Structured Streaming requires expertise. To learn more, get in touch – we’d love to help.