Skip to main content

Command Palette

Search for a command to run...

Read performance concepts

Published
2 min read
P

I am a data engineer at Tesco and this blog is part of a mentoring process to track the progress of my career development journey.

1. Optimizing Schema Interference Overhead

Schema inference can add overhead to read operations, especially when dealing with large datasets. To optimize this:

  • Explicit Schema Definition: Define the schema explicitly instead of relying on Spark to infer it. This reduces the time spent on schema inference.

      from pyspark.sql.types import StructType, StructField, IntegerType, StringType
    
      schema = StructType([
          StructField("id", IntegerType(), True),
          StructField("name", StringType(), True)
      ])
    
      df = spark.read.schema(schema).csv("path/to/csv")
    
  • Schema Caching: Cache the schema if it needs to be inferred once and used multiple times. This avoids repeated schema inference overhead.

2. Caching

Caching frequently accessed data (if you will use it 3 times adn more) can significantly improve read performance by reducing the need to repeatedly read from disk.

  • In-Memory Caching: Use Spark's cache() or persist() methods to store DataFrames in memory: df.cache()

  • Efficient Storage Levels: Choose appropriate storage levels based on your use case (e.g., MEMORY_ONLY, MEMORY_AND_DISK): df.persist(StorageLevel.MEMORY_AND_DISK)

3. Column Elimination

Column elimination helps reduce I/O by only reading the necessary columns from disk. Always select only the columns you need for your operations: df.select("id", "name")

  • Parquet Format: Use columnar storage formats like Parquet, which support efficient column pruning.

      df = spark.read.parquet("path/to/parquet")
    

4. Row Elimination or Predicate Pushdown

Predicate pushdown allows Spark to filter data at the source, reducing the amount of data read into memory.

  • Filter Early: Apply filters as early as possible in your query to take advantage of predicate pushdown.

      df = df.filter(df["age"] > 30)
    
  • Use Supported Formats: Use data formats that support predicate pushdown, such as Parquet and ORC.

      df = spark.read.parquet("path/to/parquet").filter("age > 30")
    

By implementing these strategies, you can optimize read performance in Spark, making your data processing more efficient and faster.