Percolator: Large-scale Incremental Processing Using Distributed Transactions and Notifications

September 28, 2023 · 1135 words · 3 min · Distributed System Transaction

It has been a while since I last studied, and I wanted to learn something interesting. This time, I’ll be covering Percolator, a distributed transaction system. I won’t translate the paper or delve into detailed algorithms; I’ll just document my understanding.

Percolator and 2PC

2PC

The Two-Phase Commit (2PC) protocol involves two types of roles: Coordinator and Participant. The coordinator manages the entire process to ensure multiple participants reach a unanimous decision. Participants respond to the coordinator’s requests, completing prepare operations and commit/abort operations based on those requests.

The 2PC protocol ensures the atomicity (ACD) of a transaction but does not implement isolation (I), relying instead on single-node transactions for ACD. The coordinator is clearly a critical point, which can become a bottleneck or cause blocking if it fails.

    Coordinator                                          Participant
                              QUERY TO COMMIT
                -------------------------------->
                              VOTE YES/NO           prepare*/abort*
                <-------------------------------
commit*/abort*                COMMIT/ROLLBACK
                -------------------------------->
                              ACKNOWLEDGMENT        commit*/abort*
                <--------------------------------
end

Percolator

Percolator can be seen as an optimized version of 2PC, with some improvements such as:

  • Optimizing the use of locks by introducing primary-secondary dual-level locks, which eliminates the reliance on a coordinator.
  • Providing full ACID semantics and supporting MVCC (Multi-Version Concurrency Control) through a timestamp service.

Percolator Protocol Details

The Percolator system consists of three main components:

  • Client: The client initiating a transaction. It acts as the control center for the entire protocol and is the coordinator of the two-phase commit process.

  • TO (Time Observer): Responsible for assigning timestamps, providing unique and incrementing timestamps to implement MVCC.

  • Bigtable: Provides single-row transactions, storing data as well as some attributes for transactional control.

    lock + write + data: for transactions, where lock indicates that a cell is held by a transaction, and write represents the data visibility.

    notify + ack: for watcher or notifier mechanisms.

    https://raw.githubusercontent.com/noneback/images/picgo/20230927163910.png

Externally, Percolator is provided to businesses through an SDK, offering transactions and R/W operations. The model is similar to Begin Txn → Sets of RW Operations → Commit or Abort or Rollback. Bigtable acts as the persistent component, hiding details about Tablet Server data sharding. Each write operation (including read-then-write) in the transaction is treated as a participant in a distributed transaction and may be dispatched to multiple Tablet Server nodes.

Algorithm Workflow

All writes in a transaction are cached on the client before being written during the commit phase. The commit phase itself is a standard two-phase commit consisting of prewrite and commit stages.

Prewrite

  1. Obtain a timestamp from TO as the start time of the transaction.
  2. Lock the data, marking it as held by the current transaction. If locking fails, it means the data is held by another transaction, and the current transaction fails.

The locking process utilizes the primary-secondary mechanism, where one write is chosen as the primary and all others as secondary. The secondary locks point to the primary.

https://raw.githubusercontent.com/noneback/images/picgo/202309271613141.png

Clearly, data in the prewrite phase is invisible to other transactions.

Commit

  1. Attempt to commit the data prewritten. The commit starts by committing the primary record, whose commit time will serve as the commit time for the entire transaction. First, the lock record is checked. If the lock does not exist, it indicates that the lock from the prewrite phase has been cleaned by another transaction, causing the current transaction to fail. If the lock exists, the write column is updated to indicate that the data is visible to the system.

In an asynchronous network, single-node failures and network delays are common. The algorithm must detect and clean up these locks to avoid deadlocks. Therefore, in the commit phase, if a lock is found to be missing, it means that an issue occurred with a participant, and the current transaction must be cleaned.

  1. After successfully committing, clean up the lock record. Lock cleanup can be done asynchronously.

These designs eliminate the dependency on a centralized coordinator. Previously, a centralized service was required to maintain information about all transaction participants. In this algorithm, the primary-secondary lock and the write column achieve the same goal. The write column indicates the visibility and version chain of the data, while the lock column shows which transaction holds the data. The primary-secondary locks record the logical relationship among participants. Thus, committing the primary record becomes the commit point for the entire transaction. Once the primary is committed, all secondary records can be asynchronously committed by checking the corresponding primary record’s write column.

Snapshot Isolation

Two-phase commit ensures the atomicity of a transaction. On top of that, Percolator also provides snapshot isolation. In simple terms, snapshot isolation requires that committed transactions do not cause data conflicts and that read operations within a transaction satisfy snapshot reads. By leveraging the transaction start time and the primary commit time, a total ordering among transactions can be maintained, solving these issues naturally.

Deadlock Issues in Asynchronous Networks

As mentioned earlier, in an asynchronous network, single-node failures and network delays are common. The algorithm must clean up locks to prevent deadlocks when such failures are detected. The failure detection strategy can be as simple as a timeout, causing the current transaction to fail. When a node fails and then recovers, its previous transaction has already failed, and the relevant lock records must be cleaned up. Lock cleanup can be asynchronous; for example, during the prewrite phase, if a record’s lock column is found to be non-empty, its primary lock can be checked. If the primary lock is not empty, it means the transaction is incomplete, and the lock can be cleaned up; if empty, the transaction has committed, and the data should be committed and the lock cleaned (RollForward).

Notification Mechanism

A notification mechanism is crucial for state observation and linkage in asynchronous systems, but it is not the focus of this article.

Percolator in TiDB

Based on our analysis above, Percolator is an optimized 2PC distributed transaction implementation, relying on a storage engine that supports single-node transactions.

Let’s briefly look at how TiDB uses Percolator to implement distributed transactions.

https://download.pingcap.com/images/docs-cn/tidb-architecture-v6.png

The architecture of TiDB and TiKV is shown above. Data from relational tables in TiDB is ultimately mapped to KV pairs in TiKV. TiKV is a distributed KV store based on Raft and RocksDB. RocksDB supports transactional operations on KV pairs.

https://download.pingcap.com/images/docs/tikv-rocksdb.png

Thus, the transaction path in TiDB is as follows: a relational table transaction is converted into a set of KV transactions, which are executed based on Percolator to achieve relational table transaction operations.

Of course, it cannot provide the same transactional semantics and performance guarantees as a single-node TP database. However, a shared-nothing architecture has its own advantages, which may make this trade-off acceptable.

References

Engineering Practice of Two-Phase Commit

PolarDB Database Kernel Monthly Report

Percolator: Online Incremental Processing System (Chinese Translation)

Percolator: Online Incremental Processing System (Chinese Translation) | A Small Bird

Two-Phase Commit - Wikipedia

Percolator and TiDB Transaction Algorithm

Two-Phase Commit | OceanBase Learning Guide

TiDB Architecture Overview