Skip to main content

Command Palette

Search for a command to run...

Spark memory

Updated
2 min read
P

I am a data engineer at Tesco and this blog is part of a mentoring process to track the progress of my career development journey.

a) Summarize Spark Memory Allocation

The executor has JVM Heap Space Memory set by parameter spark.executor.memory. This include Reserved memory, User memory, Spark memory for storage and execution and Python worker memory. Non-JVM Memory is Memory Overhead, Spark Off-Heap and PySpark Executor memory. sum_mem1

b) Explain the different measures taken by Spark team to handle contention of memory

Since Spark 1.6, a unified memory management model has been adopted. This model dynamically allocates memory between execution and storage tasks, allowing for more flexible and efficient use of memory resources. Spark can dynamically adjust the amount of memory allocated to different tasks based on their needs. This helps prevent out-of-memory errors and improves overall performance. Executor Memory Overhead parameter is used to reserve additional memory for each executor to handle tasks that require more memory than initially allocated. It helps prevent out-of-memory issues and ensures smoother execution. Spark allows users to configure the fraction of memory allocated to storage and execution tasks. By tuning these parameters, users can optimize memory usage based on their specific workload requirements. Efficient data serialization formats like Kryo can be used to reduce the memory footprint of data objects. This helps in managing memory more effectively, especially when dealing with large datasets. Spark provides options to tune the Java garbage collector, which can help in reducing the overhead associated with memory management and improve performance.

c) Explain Data spill in detail

A data spill occurs when Spark runs out of memory and starts moving data from memory to disk. This process can significantly slow down performance because disk I/O is much slower than memory access. Data spills are most common during data shuffling operations, where data is redistributed across different nodes in the cluster.

Data Spills in Spark may be caused by insufficient memory, large shuffle oprerations or data skew. Insufficient Memory: When the allocated memory for a Spark job is not enough to hold all the data being processed. Large Shuffle Operations: During operations like joins, groupBy, and aggregations, large amounts of data need to be shuffled across the network, which can lead to spills if the data cannot fit into memory. Data Skew: Uneven distribution of data across partitions can cause some partitions to be much larger than others, leading to memory overflow and spills.

More from this blog

Peter's blog

11 posts