The Spark Eco-System with not very often elaborated Components
An Eco-system defines how the harmony is achieved in true sense among components. We've spotlighted the heroes with an interesting example for you.
The article’s title is justified because when I was studying the Spark’s Ecosystem, I had a lot questions with respect to which components are siblings to each other or which are nested one another, and also had “Does such a component actually exists?” aahhh moments.
For the uninitiated, Apache Spark is the modern and a better alternative to good old Hadoop’s Map Reduce.
It was a roller coaster to find one universal diagrams that encapsulated all of the necessary stuff and thus I decided to grab the FigJam board to make things happen. That’s my story, but what’s yours or rather what should be your?
Let’s take a step back and understand why Study Ecosystem. One of the reasons being that I’ve written this article but the others are as follows:
By studying Spark’s Ecosystem helps you design better data pipelines and architectures that can handle real-world data challenges effectively.
Familiarity with this ecosystem allows you to see how different data processing tasks interconnect, helping you create comprehensive solutions that leverage multiple functionalities without switching between disparate systems.
Knowledge of how Spark operates allows you to make informed decisions about data architecture and technology stacks leading to cost savings through optimized resource allocation and improved performance of data processing workflows.
What makes this Article different?
This article will also help you answer a lot of interview style questions, that are not so easily found on the web we go beyond roles and responsibilities.
We’ve also covered what does the Spark Engine actually consists of and what do the nested components actually do, which is something most of the articles unfortunately miss out on.
We’ll be able to understand you things inter connect in true sense for the Eco-System to be actually called as an Ecosystem. Let’s Go!
#0 — Introduction to Apache Spark Eco-System / Spark Stack
The Apache Spark Ecosystem is a collection of component and that make up the entire open-source platform for big data processing.
The components are often stacked on top of each other based on the representation of abstraction levels for the user / programmer and also communicate with each other to make the distributed computing happen.
The components are listed as follows:
High-Level APIs
Spark Core Low-Level APIs Spark Engine DAG Scheduler Spark Driver
Cluster Manager
In many of the diagrams found on web, there could be noticeable differences as listed below:
The components of Cluster Manager is skipped many times.
The Spark Core and Spark Engine are showed as separate components.
The DAG Scheduler and Spark Driver are not showed as the part of Spark Engine which play a huge role in making the distributed computing “efficient” in true sense by fulfilling their responsibilities.
Hence it is often paired up with other storage solutions like Amazon’s Redshift / S3, Azure Blob Storage, Google Cloud Storage, etc.
You might come across a lot of new terms different terms that support the explanations of each of the components hence, buckle up!
#1 — High Level APIs
High-Level APIs are the entry point for the programmers for accessing the distributed big-data processing.
They act as the user- friendly interface providing the libraries and components for the to interact with Apache by abstracting away all the logical complexities that one should not bother about.
This way the programmers can productively focus on writing code and logic for the required processing and the code will be translated by the Apache Spark itself which will make the magic happen.
Programmers have the liberty to code in any of the languages as follows: Python, Java, Scala, R, but again there are a few exceptions.
Commonly talked about API Components:
Spark SQL: for querying structured data.
Spark Streaming: for handling real-time stream data.
ML-Lib: for Machine Learning and Data Science related tasks.
GraphX: for Graph processing related activities.
The other API Components to learn about if you’re interested abut more diversified operation that Spark can conduct this table could be helpful.
Components | High-Level APIs | Low-Level APIs |
PySpark | DataFrame API, Dataset API | RDD API |
Spark SQL | SQL API, DataFrame API | |
Spark Streaming | Structured Streaming API | DStream API |
ML-Lib | DataFrame API | RDD API |
GraphX | GraphFrames — API | GraphX API — Scala Only |
Delta Lake | Delta Table APIs (via DataFrame API, SQL API) | |
Koalas (now Pandas API on Spark) | Pandas API on Spark | |
PySpark.ML (ML Pipelines) | Pipeline API for creating machine learning workflows | |
Structured Streaming | Structured Streaming API (DataFrame-based) | |
Spark Connect | Remote Spark Session API |
#2 — Apache Core
The Apache Core Component as the name suggests is the heart of the Spark Eco-System where the under-the-hood execution of distributed computing takes place that is abstracted from the user; to some extent depending on you needs [ and I’ll explain why. ]
The responsibilities of Apache Core component are as follows:
Task Scheduling & Execution
Memory Management
Fault Tolerance
Data Handling
RDD Implementation
The two sub-components within Apache Core Component are Low-Level API and the Spark Engine and we’ll learn more about them soon.
#2.1 — Low-Level APIs
Low-Level APIs are interfaces that provide programmers with the direct access the foundational abstractions of the Spark Core like RDD (Resilient Distributed Datasets) and DStreams.
You could say, in some sense, that you’re bypassing the abstractions provided by the higher-level APIs by using low-level APIs to get access to the internal complexities of the Spark Core.
What do Low-Level APIs do and When should you be using them?
When High-Level APIs fall short
This is idea l for the scenarios where non high-level abstraction like DataFrame or SQL can meet the requirements.
May be you need to conduct complex transformations requiring iterative or recursive computations or conduct custom partitioning logic or implement specific distributed algorithms.
When Fine-Gained Control is required
- If the use case demands for direct manipulation, they will allow developers to define how data is partitioned, persisted, and transformed across the Spark cluster or define storage levels, or build computation pipelines.
Fault-Tolerance and Performance Tuning
RDDs are inherently fault-tolerant and ensure lineage tracking, making it easier to recompute data in case of node failures.
If high-level APIs are introducing unnecessary overhead, switching to RDDs or DStreams can sometimes offer better performance.
Learning and Understanding Internals
- Low-Level APIs are a great way to learn how Spark works under the hood, such as task scheduling, data partitioning, and DAG execution.
#2.2 — Spark Engine
Spark Engine is just the same as the engine of any vehicle. Without it the all the body kit and parts ( High-Level APIs and Low-Level APIs and Spark Core ) is nothing.
In very laymen terms the primary task of the Spark Engine is to perform the physical execution of the application’s tasks a.k.a Spark Job.
Let’s understand how the execution of a job with an example of PySpark Job where we’re simply loading some dummy data of 50 employees.
Consider you’re executing the following Spark Job in the DataBricks Notebook:
# Read the CSV file into a DataFrame
employees_50__df = spark.read.format("csv")\
.option("header", "true")\
.option("inferSchema", "true")\
.load(employees_50_file_path)
# Show the contents of the DataFrame
# employees_50__df.show()
We as programmers have used the High-Level API [PySpark] to interact with the Apache Spark and it’s brain with is the Spark Core and Spark Engine using the methods or function of PySpark and the above code is submitted to the Spark.
The methods / functions of the PySpark and or any other High-Level APIs can be categorized into the following 3 categories and their roles will come into the picture soon:
Transformations [ Define a new dataset from an existing one and Create RDDs ]
Actions [ Trigger execution and return a result and create DAG ]
Neither of Both [ Create RDD but are neither Transformations nor Actions ]
When the above code is submitted to the spark, you will get no output despite error-free execution.
But behind the scenes, Spark has created a “Logical Plan” of the Job which can be transformations’ sequence OR RDD lineage, but no actual execution has happened yet.
In the provided code, spark.read
.format()
, .option()
and .load()
unfortunately are neither transformations nor actions.
For the successful execution of the Spark Job as per our intentions, it requires the generations of the DAG (Directed Acyclic Graph) which represents the physical plan which is initiated by an action ONLY.
In the provided code, .show()
is an action.
And since we have commented the code which is an action that should trigger the physical execution of the which happens by the converting the logical plan ( transformations’ sequence OR RDD lineage ) in the physical plan (DAG of Spark Job), never happens.
If we un-comment the .show()
action, the the physical execution will begin after the creating of final RDD following by the creation of DAG. This is where the main responsibilities of Spark Engine start.
The Spark Engine again consists of 2 sub-components which are DAG-Scheduler and Spark Driver.
#2.2.1 — DAG Scheduler
Dag Scheduler is responsible for managing the execution of Spark application by transforming the sequence of RDD transformations & actions into the DAG of stages and tasks.
- Stage Division
It divides the DAG into multiple stages based on the dependencies between tasks.
Each stage consists of tasks that can be executed in parallel.
- Task Scheduling
The DAG Scheduler generates TaskSets, which are collections of tasks corresponding to each stage.
These TaskSets are then passed on to the Task Scheduler, which launches them for execution on available worker nodes (executors).
- Dependency Management
It manages dependencies between stages, ensuring that tasks are executed in the correct order based on their dependencies.
This is crucial for maintaining data integrity and correctness during execution.
- Monitoring and Fault Tolerance
- The DAG Scheduler monitors task execution and can handle failures by rescheduling tasks if necessary, leveraging the lineage information stored in the DAG to recompute lost data.
#2.2.2 — Spark Driver
The Spark Driver which is the part of Spark Engine is a Java process that runs the main application code (the main()
method) and coordinates the execution of tasks in a Spark application.
It acts as the central point for controlling and managing the entire workflow of the application.
If you’re using the Python to interact with Apache Spark (PySpark), your Application Master will have the PySpark’s main() function as the Application Driver rather than then JVM’s main() function. But eventually internal conversion will happen.
But I’ve you’re using Scala, then the directly the JVM’s main() function will be the Application Driver and thus PySpark’s main() is totally optional.
The role of Spark Driver is as follows:
- Resource Management:
- It negotiates with the cluster manager (such as YARN, Mesos, or Standalone Scheduler) to allocate resources needed for executing tasks.
- Task Distribution:
- The driver divides the application into smaller tasks and distributes them across worker nodes (executors) in the cluster.
- Monitoring Execution
- It monitors the execution of tasks, handles failures by re-launching tasks if necessary, and aggregates results from executors to provide final output to the user.
#3 — Cluster Manager
The Cluster Manager in Apache Spark is a critical component responsible for managing resources across a cluster to ensure efficient execution of Spark applications.
The cluster manager may consists of standalone scheduler (of the Spark itself), YARN (Yet Another Resource Negotiator of the Apache Hadoop Ecosystem), Mesos and even Kubernetes Cluster.
Support for Apache Mesos was deprecated in Apache Spark 3.2.0 and will be removed in a future version.
The responsibilities of Cluster Manager are as follows:
- Resource Allocation*:*
The primary function of the Cluster Manager is to allocate resources (CPU, memory) to various Spark applications running on the cluster.
It ensures that each application has the necessary resources to execute effectively.
- Management of Nodes*:*
The Cluster Manager oversees both master and worker nodes within the cluster.
It manages the allocation of tasks to worker nodes, ensuring that they operate efficiently and effectively.
- Execution Coordination*:*
- It coordinates the execution of Spark applications by managing the lifecycle of executors, which are processes that run computations and store data for the application.
- Handling Failures*:*
- The Cluster Manager monitors the health of nodes and can recover from failures by restarting failed tasks or reallocating resources as needed.
- User Interaction*:*
Users submit their Spark applications to the Cluster Manager, specifying resource requirements for drivers and executors.
The Cluster Manager then allocates these resources accordingly.
Once that we’ve understood how the Eco-System with utmost clarity, learning about the Internal Architecture of the Apache Spark will be a piece of cake.
Do not mistake the Eco-System’s Diagram with the Internal Architecture’s Diagram and you’ll soon have an article on it soon.
Till then, See ya.