ORC format

Questions:

What is compression?

What is the meaning of the schema evolution?

What is ORC format?

What is difference between row and columnar formats?

Answers:

Compression - parquet has very efficient compression, AVRO has no compression and data is in the binary format
Schema evolution is a feature used to accommodate data as it changes over time. In a dataset, schemas are the column headers and types. Schema evolution enables users to automatically adapt the scheme to add additional columns using an append or overwrite operation.
ORC format (Optimized Row Columnar) stores data in a series of stripes, where each stripe is a collection of rows. Each stripe is further divided into a series of data chunks (64MB), where each chunk stores the data for a specific set of columns. The chunks are compressed using a combination of techniques such as predicate filtering, dictionary encoding, and run-length encoding. ORC also stores metadata about the file, such as the schema, at the end of the file. Difference between ORC and parquet:

ORC (RCFile ancestor), optimized read and write in Hive, with high compression and good for heavy reads of high data volumes. Parquet solving issues like writing big data series with many columns, is very effective for “write-once, read-many” and has flexibile support for complex nested data structures. Parquet is better for analytical tasks when there is need to reading only several columns.
Row format store data by rows. Benefits are fast writes and reads of complete records (good for OLTP - Online Transaction Processing systems), and that it is simple. Cons is low compression (different types of data stored together), always reading all data from rows (bad for analysis).

Columnar format stores all data from one column together. Benefit is high compression (different encoding for different data types), fast reads for column (good for OLAP - Online Analytical Processing systems). But has slow writes and complex implementation.

Joins

BroadcastHashJoin (when one table is small) .hint("broadcast")

ShuffledHashJoin (good when we are aware of data skewness, but sort merge is preffered) .hint("shuffle") .hint("shuffle_hash")

SortMergeJoin (hashing technique)

CartesianProduct Inner Join (sorting technique)

Broadcast Nested Loop Join

spark_joins

ORC format

Questions:

Answers:

Joins

Comments

More from this blog

Read performance concepts

Using Hive tables can improve spark read performance

Broadcast join & Spark Performance Optimisation

pyspark Window function example

Spark Data Spill

Command Palette

Questions:

Answers:

Joins

Comments

More from this blog