Skip to main content

Command Palette

Search for a command to run...

ORC format

Updated
2 min read
P

I am a data engineer at Tesco and this blog is part of a mentoring process to track the progress of my career development journey.

Questions:

What is compression?

What is the meaning of the schema evolution?

What is ORC format?

What is difference between row and columnar formats?

Answers:

  1. Compression - parquet has very efficient compression, AVRO has no compression and data is in the binary format

  2. Schema evolution is a feature used to accommodate data as it changes over time. In a dataset, schemas are the column headers and types. Schema evolution enables users to automatically adapt the scheme to add additional columns using an append or overwrite operation.

  3. ORC format (Optimized Row Columnar) stores data in a series of stripes, where each stripe is a collection of rows. Each stripe is further divided into a series of data chunks (64MB), where each chunk stores the data for a specific set of columns. The chunks are compressed using a combination of techniques such as predicate filtering, dictionary encoding, and run-length encoding. ORC also stores metadata about the file, such as the schema, at the end of the file. Difference between ORC and parquet:

    ORC (RCFile ancestor), optimized read and write in Hive, with high compression and good for heavy reads of high data volumes. Parquet solving issues like writing big data series with many columns, is very effective for “write-once, read-many” and has flexibile support for complex nested data structures. Parquet is better for analytical tasks when there is need to reading only several columns.

  4. Row format store data by rows. Benefits are fast writes and reads of complete records (good for OLTP - Online Transaction Processing systems), and that it is simple. Cons is low compression (different types of data stored together), always reading all data from rows (bad for analysis).

    Columnar format stores all data from one column together. Benefit is high compression (different encoding for different data types), fast reads for column (good for OLAP - Online Analytical Processing systems). But has slow writes and complex implementation.

Joins

BroadcastHashJoin (when one table is small) .hint("broadcast")

ShuffledHashJoin (good when we are aware of data skewness, but sort merge is preffered) .hint("shuffle") .hint("shuffle_hash")

SortMergeJoin (hashing technique)

CartesianProduct Inner Join (sorting technique)

Broadcast Nested Loop Join

spark_joins

More from this blog

Peter's blog

11 posts