Apache Spark Fundamentals
I am a data engineer at Tesco and this blog is part of a mentoring process to track the progress of my career development journey.
a) Why Spark is a better choice over MapReduce?
Spark is faster and is solution for various data processing (batch, stream, real-time..). MapReduce is writing data on disk and Spark is using memory so the data processing can be done quickly. MapReduce is read write intensive and this makes bottleneck becose data is sharing via HDFS. ref_MapReduce
b) RDD and its characteristics
RDD - Resilient Distributed Dataset, data structure distributed along nodes in distributed computing environment. It is a basic data structure, how we store data in memory for spark and how data is stored in memory in spark (even for spark_sql - internaly everything is converted to RDD).
c) RDD vs Dataframe vs Dataset[https://www.databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html]
RDD and DataFrame are two storage organization strategies used in Apache Spark. RDD is a collection of data objects across nodes in an Apache Spark cluster, while a DataFrame is similar to a standard database table where the schema is laid out into columns and rows. A Dataset is a distributed collection of data. Dataset provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. A DataFrame is a Dataset organized into named columns. ref_dataset
RDD is beneficial for applications using unstructured data. You would use RDD in Apache Spark for analytics and machine learning. A DataFrame uses structured data, so it’s best used when you know the data type for each column and can fit data into a predefined column. ref_dataframe
d) Transformations vs Actions
Transformation is operation on RDD, not executed immediately, transformation builds logical execution plan, it is used by Spark's Scheduler to optimize execution. Transformations can be Narrow (within same partition, we dont need shuffle, characteristics: locality, efficiency, examples: reduceByKey(), map(), filter(), flatMap()) and Wide (we need to shuffle between partitions, example: groupByKey() - do not do mapping first).
Action is operation that trigger execution (e.g. collect(), last()).
Partition is the smallest part in spark program / job, task where spark operations are applied, spark.default.paralelism -> how many partitions (tasks) will be calculated - it will be distributed to X nodes. A task is the smallest unit of work in Spark, representing a unit of computation that can be performed on a single partition of data.
e) Spark Cluster Architecture
Spark cluster has Master - Slave architecture. The driver runs master node and on worker nodes are running many executors. Driver is the process responsible for coordinating the execution of the Spark application and creates SparkContext and connects it to cluster manager. The driver program divides the Spark job into tasks and assigns them to the executor nodes for execution. Executor is worker process responsible for executing tasks in Spark application. Executors are launched on worker nodes and communicate with driver program and cluster manager. The cluster manager is responsible for allocating resources and managing the cluster. Example is Hadoop YARN or Kubernetes. ref_SparkArchit
f) Spark Job Lifecycle - Logical Plan (DAG) vs Physical Plan
Logical plan is created when transformations are used. The logical plan represents the abstract, high-level representation of the Spark job’s computation. It describes the sequence of operations to be performed on the input data to achieve the desired result. The logical plan describes what computations need to be performed, while the physical plan outlines how those computations will be executed efficiently on the Spark cluster. The the physical plan outlines the actual execution steps that Spark will perform to process the data. These operations are executed on the distributed data across the Spark cluster. ref_logical
g) How does Spark execute the generated physical plan?
The physical plan represents the actual sequence of steps that Spark will take to process the data and compute the desired result. Operations are orchestrated by Spark’s Catalyst optimizer and executed on the Spark cluster’s worker nodes.
h) Spark Client mode vs cluster mode
In the cluster mode is the Spark application run as independent sets of processes on a cluster with bigger performance, coordinated by the SparkContext object in the main program (the driver program). SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos, YARN or Kubernetes), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run. In client mode are driver and workers nodes created localy on the machine, the driver program runs on the client machine where the application is submitted (Edge Node). This means the client machine initiates the SparkContext and submits tasks to the Spark cluster for execution. ref_client ref_cluster
What is DAG? - Directed Acyclic Graph: DAG is the scheduling layer of the Apache Spark architecture that implements stage-oriented scheduling. Compared to MapReduce which creates a graph in two stages, Map and Reduce, Apache Spark can create DAGs that contain many stages.
Yarn Cluster: hdp....:8090/cluster/apps (to see Execution plan)
job -> stage -> tasks for wide transformations it will be broken to 2 stages (each for X tasks / paralelism)