Skip to main content

Command Palette

Search for a command to run...

Distributed computing

Updated
2 min read
P

I am a data engineer at Tesco and this blog is part of a mentoring process to track the progress of my career development journey.

a) What is a Distributed System?
b) How is a Distributed System different from a centralized system?
c) Characteristics of Distributed Systems:
- Resource Sharing
- Parallel Processing
- Scalability (Vertical vs Horizontal)
- Fault Tolerance (Point of Failures)
d) Challenges of a Distributed System
- Proper Distribution of Data
- Maintaing Data consistency across all nodes
- Learning Curve
- Anything else?
e) CAP Theorem
f) Distributed Database - ACID vs BASE databases
g) Distributed Data Processing - MapReduce
h) Types of Distributed Data Processing - Batch vs Stream

What is a Distributed System?

Distributed System is system for paralel computing in data processing and distribution of work to several different computers. One computer is named as a driver and distribute work to other nodes (workers, executors). When one of the nodes is not working, the work can be rerun on the other node. Computers are on the same level.

How is a Distributed System different from a Centralized System?

Centralized system is not possible to replace one node by other. Every part has specific function.

Characteristics of the Distributed Systems:

Resource sharing: nodes are runing on the cluster of computers and are shared together

Parallel processing: processing of the data is done in parallel and nodes are not waiting for others

CAP Theorem

CAP Theorem is talking about three areas of the distributed data warehouse (Eric Brewer in 1999): Consistency, Availability and PArtition Tolerance and only two of them can be serve at once. ref_CAP1

The CAP theorem states that a distributed system can only provide two of the three characteristics of Consistency, Availability, and Partition Tolerance. ref_CAP2

Consistency means there are same data on every node. Availability means you will always get the data. Partition tolerance is a lost or delayed connection between nodes.

It is similar to project management, where we have money, time and quality, and these properties cannot be optimized separately.

Distributed Database - ACID vs. BASE databases

ACID databases are transactional databases with the properties: Atomicity, Consistency, Isolation and Durability. ref_ACID

ACID databases prioritize consistency over availability. In contrast, BASE databases prioritize availability over consistency. Instead of failing the transaction, users can access inconsistent data temporarily. Data consistency is achieved, but not immediately. ref_BASE

Transaction is smallest work to be performed by the dataabse. For example it can be compared to wire transfer. Money should leave your account and transaction is finished after transfering to another.

More from this blog

Peter's blog

11 posts