Apache Spark Architecture

In this blog post I will explain Apache Spark Architecture.

Spark Architecture Terminologies

Driver Program : Spark driver program is a main entry point which runs the main() method and is the place where the SparkContext is created.

SparkContext : Spark context is a gateway to all the Spark functionalities. It is similar to your database connection. Any command you execute in your database goes through the database connection.

Cluster Manager : An external service to manage resource on the cluster

Worker Node : Node that run the application program in the cluster

Executors : Executors are worker nodes’ processes in charge of running individual tasks in a given Spark job.

Task : A task is a unit of work that is sent to the executor. Each stage has some task, one task per partition. The Same task is done over different partitions of RDD.

Step by Step working of Spark

  • Step 01 : A code is submitted by the user which contains Transformations and Actions.
  • Step 02 : The driver converts the piece of code to DAG(Directed Acyclic Graph) which also optimizes i.e, pipelining transformations.
  • Step 03 : Driver node then converts DAG into a Physical Execution Plan which consists a set of Stages.
  • Step 04 : Driver Node then converts Physical Execution units (TASKS). A set of tasks is created under each Stages.
  • Step 05 : Now the Driver node talks to the Cluster Manager and asks for resources.
  • Step 06 : Driver Program initiates the spark context.
  • Step 07 : Cluster Manager in turns launches Executors on worker nodes on behalf of the driver.
  • Step 08 : Driver node will send the task to the Executors based on data placement.
  • Step 09 : When the executor starts they will register themselves to the drivers. So drivers will have complete view of the executors.
  • Step 10 : Executors starts executing the tasks assigned by the driver program. At any time, the driver program monitors the executors.
  • Step 11 : When sc.stop() is called it also terminates the executors and release resources from Cluster manager.

Looking to learn Big Data? Join our Training program Best Big Data, Hadoop Spark Developer Training Program to learn the concepts in-depth according to industry standards.

--

--

Naveen - (Founder & Trainer @ NPN Training)

A software training institute which believes that technology has to be learnt under experienced practitioners — www.npntraining.com