MIT6.824 Chain Replication

February 8, 2023 · 463 words · 3 min · Distributed System MIT6.824 ChainReplication

This post provides a brief overview of the Chain Replication (CR) paper, which introduces a simple but effective algorithm for providing linearizable consistency in storage services. For those interested in the detailed design, it’s best to refer directly to the original paper.

Introduction

In short, the Chain Replication (CR) paper presents a replicated state machine algorithm designed for storage services that require linearizable consistency. It uses a chain replication method to improve throughput and relies on multiple replicas to ensure service availability.

Chain Replication

The design of the algorithm is both simple and elegant. CR splits the replication workload across all nodes in the chain, with each node responsible for forwarding updates to its successor. Write requests are propagated from the head node to the tail, while read requests are served by the tail node.

To maintain relationships between nodes in the chain, Chain Replication introduces a Master service responsible for managing node configurations and handling node failures.

Failure Handling

Head Failure: If the head node fails, any pending or unprocessed requests are lost, but linearizable consistency remains unaffected. The second node in the chain is promoted to the new head.
Tail Failure: If the tail node fails, the second-to-last node becomes the new tail, and pending requests from the original tail are committed.
Middle Node Failure: When a middle node fails, the chain is reconnected in a manner similar to linked list operations. The previous node (Node_pre) is linked directly to the next node (Node_next). To ensure that no requests are lost during this failure, each CR node maintains a SendReqList that records all requests forwarded to its successor. Since requests are propagated from head to tail, Node_pre only needs to send Node_next any missing data. When the tail node receives a request, it marks it as committed, and an acknowledgment (Ack(req)) is sent back from the tail to the head, removing the request from each node’s SendReqList as the acknowledgment propagates.

Pros and Cons

The main advantages of Chain Replication include:

High Throughput: By distributing the workload across all nodes, CR effectively increases the throughput of a single node.
Balanced Load: Each node has a similar workload, resulting in balanced utilization.
Simplicity: The overall design is clean and straightforward, making it easier to implement.

However, there are some clear disadvantages:

Bottlenecks: If a node in the chain processes requests slowly, it will delay the entire chain’s processing.
Read Limitations: Only the head and tail nodes can serve requests efficiently. The data in the middle nodes is mostly there for replication purposes and not directly utilized for serving requests. However, the CRAQ (Chain Replication with Asynchronous Queries) variant allows middle nodes to serve read-only requests, similar to Raft’s Read Index, which can help alleviate this limitation.

References

MIT6.824 AuroraDB MIT6.824-ZooKeeper