MIT6.824-ZooKeeper

January 3, 2023 · 399 words · 2 min · Distributed System MIT6.824 ZooKeeper

This article mainly discusses the design and practical considerations of the ZooKeeper system, such as wait-free and lock mechanisms, consistency choices, system-provided APIs, and specific semantic decisions. These trade-offs are the most insightful aspects of this article.

Positioning

ZooKeeper is a wait-free, high-performance coordination service for distributed applications. It supports the coordination needs of distributed applications by providing coordination primitives (specific APIs and data models).

Design

Keywords

There are two key phrases in ZooKeeper’s positioning: high performance and distributed application coordination service.

ZooKeeper’s high performance is achieved through wait-free design, local reads from multiple replicas, and the watch mechanism:

Wait-free requests are handled asynchronously, which may lead to request reordering, making the state machine different from the real-time sequence. ZooKeeper provides FIFO client order guarantees to manage this. Additionally, asynchronous handling is conducive to batch processing and pipelining, further improving performance.
The watch mechanism notifies clients of updates when a znode changes, reducing the overhead of clients querying local caches.
Local reads from multiple replicas: ZooKeeper uses the ZAB protocol to achieve data consensus, ensuring that write operations are linearizable. Read requests, however, are served locally from replicas without going through the ZAB consensus protocol, which provides serializability and might return stale data, improving performance.

The distributed application coordination service refers to the data model and API semantics provided by ZooKeeper, allowing distributed applications to freely use them to fulfill coordination needs such as group membership and distributed locking.

Data Model and API

ZooKeeper provides an abstraction of data nodes called znodes, which are organized through a hierarchical namespace. ZooKeeper offers two types of znodes: regular and ephemeral. Each znode stores data and is accessed using standard UNIX filesystem paths.

In practice, znodes are not designed for general data storage. Instead, znodes map to abstractions in client applications, often corresponding to metadata used for coordination.

In other words, when coordinating through ZooKeeper, utilize the metadata associated with znodes instead of treating them as mere data storage. For example, znodes associate metadata with timestamps and version counters, allowing clients to track changes to the znodes and perform conditional updates based on the znode version.

Essentially, this data model is a simplified file system API that supports full data reads and writes. Users implement distributed application coordination using the semantics provided by ZooKeeper.

The difference between regular and ephemeral znodes is that ephemeral nodes are automatically deleted when the session ends.

Clients interact with ZooKeeper through its API, and ZooKeeper manages client connections through sessions. In a session, clients can observe state changes that reflect their operations.

CAP Guarantees

ZooKeeper provides CP (Consistency and Partition Tolerance) guarantees. For instance, during leader election, ZooKeeper will stop serving requests until a new leader is elected, ensuring consistency.

Implementation

ZooKeeper uses multiple replicas to achieve high availability.

In simple terms, ZooKeeper’s upper layer uses the ZAB protocol to handle write requests, ensuring linearizability across replicas. Reads are processed locally, ensuring sequential consistency. The underlying data state machine is stored in the replicated database (in-memory) and Write-Ahead Log (WAL) on ZooKeeper cluster machines, with periodic snapshots to ensure durability. The entire in-memory database uses fuzzy snapshots and WAL replay to ensure crash safety and fast recovery after a crash.

The advantage of fuzzy snapshots is that they do not block online requests.

Interaction with Clients

Update operations will notify and clear the relevant znode’s watch.
Read requests are processed locally, and the partial order of write requests is defined by zxid. Sequential consistency is ensured, but reads may be stale. ZooKeeper provides the sync operation, which can mitigate this to some extent.
When a client connects to a new ZooKeeper server, the maximum zxid is compared. The outdated ZooKeeper server will not establish a session with the client.
Clients maintain sessions through heartbeats, and the server handles requests idempotently.

References

ZooKeeper Paper

MIT6.824-ZooKeeper FAQ

MIT6.824 Chain Replication Flink-Iceberg-Connector Write Process