MIT6.824 Bigtable
September 16, 2021 · 1908 words · 9 min · Paper Reading MIT6.824 DFS Distributed System
I recently found a translated version of the Bigtable paper online and saved it, but hadn’t gotten around to reading it. Lately, I’ve noticed that Bigtable shares many design similarities with a current project in our group, so I took some time over the weekend to read through it.
This is the last of Google’s three foundational distributed system papers, and although it wasn’t originally part of the MIT6.824 reading list, I’ve categorized it here for consistency.
As with previous notes, I won’t dive deep into the technical details but will instead focus on the design considerations and thoughts on the problem.
Introduction
Bigtable is a distributed structured data storage system built on top of GFS, designed to store large amounts of structured and semi-structured data. It is a NoSQL data store that emphasizes scalability and performance, as well as reliable fault tolerance through GFS.
Design Goal: Wide Applicability, Scalability, High Performance, High Availability
Data Model
Bigtable’s data model is No Schema and provides a simple model. It treats all data as strings, with encoding and decoding handled by the application layer.
Bigtable is essentially a sparse, distributed, persistent multidimensional sorted Map. The index of the Map is composed of Row Key, Column Key, and TimeStamp, and the value is an unstructured byte array.
// Mapping abstraction
(row:string, column:string, time:int64) -> string
// A Row Key is essentially a multi-dimensional structure composed of {Row, Column, Timestamp}.
The paper describes the data model as follows:
A Bigtable is a sparse, distributed, persistent multidimensional sorted map.
Sparse means that columns in the same table can be null, which is quite common.
Row | Columns |
---|---|
Row1 | {ID, Name, Phone} |
Row2 | {ID, Name, Phone, Address} |
Row3 | {ID, Name, Phone, Email} |
Distributed refers to scalability and fault tolerance, i.e., Replication and Sharding. Bigtable leverages GFS replicas for fault tolerance and uses Tablet for partitioning data to achieve scalability.
Persistent Multidimensional Sorted indicates data is eventually persisted, and Bigtable optimizes write and read latency with WAL and LSM.
The open-source implementation of Bigtable is HBase, a row and column database.
Rows
Bigtable organizes data using lexicographic order of row keys. A Row Key can be any string, and read and write operations are atomic at the row level.
Lexicographic ordering helps aggregate related row records. MySQL achieves atomic row operations using an undo log.
Column Family
A set of column keys forms a Column Family, where the data often shares the same type.
A column key is composed of Column Family : Qualifier
. The column family’s name must be a printable string, whereas the qualifier name can be any string.
The paper mentions:
Access control and both disk and memory accounting are performed at the column-family level.
This is because business users tend to retrieve data by columns, e.g., reading webpage content. In practice, column data is often compressed for storage. Thus, the Column Family level is a more suitable level for access control and resource accounting than rows.
TimeStamp
The timestamp is used to maintain different versions of the same data, serving as a logical clock. It is also used as an index to query data versions.
Typically, timestamps are sorted in reverse chronological order. When the number of versions is low, a pointer to the previous version is used to maintain data versioning; when the number of versions increases, an index structure is needed. TimeStamp indexing inherently requires range queries, so a sortable data structure is appropriate for indexing. Extra version management increases maintenance overhead, usually handled by limiting the number of data versions and garbage collecting outdated versions.
Tablet
Bigtable uses a range-based data sharding strategy, and Tablet is the basic unit for data sharding and load balancing.
A tablet is a collection of rows, managed by a Tablet Server. Rows in Bigtable are ultimately stored in a tablet, which is split or merged for load balancing among Tablet Servers.
Range-based sharding is beneficial for range queries, compared to hash-based sharding.
SSTable
SSTable is a persistent, sorted, immutable Map. Both keys and values are arbitrary byte arrays.
A tablet in Bigtable is stored in the form of SSTable files.
SSTable is organized into data blocks (typically 64KB each), with an index for fast data lookup. Data is read by first reading the index, searching the index, and then reading the data block.
API
The paper provides an API that highlights the differences from RDBMS.
// Writing to Bigtable
// Open the table
Table *T = OpenOrDie("/bigtable/web/webtable");
// Write a new anchor and delete an old anchor
RowMutation r1(T, "com.cnn.www");
r1.Set("anchor:www.c-span.org", "CNN");
r1.Delete("anchor:www.abc.com");
Operation op;
Apply(&op, &r1);
// Reading from Bigtable
Scanner scanner(T);
ScanStream *stream;
stream = scanner.FetchColumnFamily("anchor");
stream->SetReturnAllVersions();
scanner.Lookup("com.cnn.www");
for (; !stream->Done(); stream->Next()) {
printf("%s %s %lld %s\n",
scanner.RowName(),
stream->ColumnName(),
stream->MicroTimestamp(),
stream->Value());
}
Architecture Design
External Components
Bigtable is built on top of other components in Google’s ecosystem, which significantly simplifies Bigtable’s design.
GFS
GFS is Bigtable’s underlying storage, providing replication and fault tolerance.
Refer to the previous notes for details.
Chubby
Chubby is a highly available distributed lock service that provides a namespace, where directories and files can serve as distributed locks.
High availability means maintaining multiple service replicas, with consistency ensured via Paxos. A lease mechanism prevents defunct Chubby clients from holding onto locks indefinitely.
Why Chubby? What is its role?
- Stores Column Family information
- Stores ACL (Access Control List)
- Stores root metadata for the Root Tablet location, which is essential for Bigtable startup.
Bigtable uses a three-layer B+ tree-like structure for metadata. The Root Tablet location is in Chubby, which helps locate other metadata tablets, which in turn store user Tablet locations.
- Tablet Server lifecycle monitoring
Each Tablet Server creates a unique file in a designated directory in Chubby and acquires an exclusive lock on it. The server is considered offline if it loses the lock.
In summary, Chubby’s functionality can be categorized into two parts. One is to store critical metadata as a highly available node, while the other is to manage the lifecycle of storage nodes (Tablet Servers) using distributed locking.
In GFS, these responsibilities are handled by the Master. By offloading them to Chubby, Bigtable simplifies the Master design and reduces its load.
Conceptually, Chubby can be seen as part of the Master node.
Internal Components
Master
Bigtable follows a Master-Slave architecture, similar to GFS and MapReduce. However, unlike GFS, Bigtable relies on Chubby and Tablet Servers to store metadata, with the Master only responsible for orchestrating the process and not storing tablet locations.
Responsibilities include Tablet allocation, garbage collection, monitoring Tablet Server health, load balancing, and metadata updates. The Master requires:
- All Tablet information to determine allocation and distribution.
- Tablet Server status information to decide on allocations.
Tablet Server
Tablet Servers manage tablets, handling reads and writes, splitting and merging tablets when necessary.
Metadata is not stored by the Master. Clients interact directly with Chubby and Tablet Servers for reading data. Tablets are split by Tablet Servers, and Master may not be notified instantly. WAL+retry mechanisms should be employed to ensure operations aren’t lost.
Client SDK
The client SDK is the entry point for businesses to access Bigtable. To minimize metadata lookup overhead, caching and prefetching are used to reduce the frequency of network interactions, making use of temporal and spatial locality.
Caching may introduce inconsistency issues, which require appropriate solutions, such as retries during inconsistent states.
Storage Design
Mapping and Addressing
Bigtable data is uniquely determined by a (Table, Row, Column)
tuple, stored in tablets, which in turn are stored in SSTable format on GFS.
Tablets are logical representations of Bigtable’s on-disk entity, managed by Tablet Servers.
Bigtable uses Root Tablet + METADATA Table
for addressing. The Root Tablet location is stored in Chubby, while the METADATA Table is maintained by Tablet Servers.
The Root Tablet stores the location of METADATA Tablets, and each METADATA Tablet contains the location of user tablets.
METADATA Table Row:
(TableID, encoding of last row in Tablet) => Tablet Location
The system uses a B+ tree-like three-layer structure to maintain tablet location information.
Scheduling and Monitoring
Scheduling
Scheduling involves Tablet allocation and load balancing.
A Tablet can only be assigned to one Tablet Server at any given time. The Master maintains Tablet Server states and sends allocation requests as needed.
The Master does not maintain addressing information but holds Tablet Server states (including tablet count, status, and available resources) for scheduling.
Monitoring
Monitoring is carried out by Chubby and the Master.
Each Tablet Server creates a unique file in a Chubby directory and acquires an exclusive lock. When the Tablet Server disconnects and loses its lease, the lock is released.
The unique file determines whether a Tablet Server is active, and the Master may delete the file as needed. In cases of network disconnection, the Tablet Server will try to re-acquire the exclusive lock if the file still exists. If the file doesn’t exist, the disconnected Tablet Server should automatically leave the cluster.
The Master ensures its uniqueness by acquiring an exclusive lock on a unique file in Chubby, and monitors a specific directory for Tablet Server files.
Once it detects a failure, it deletes the Tablet Server’s Chubby file and reallocates its tablets to other Tablet Servers.
Compaction
Bigtable provides read and write services and uses an LSM-like structure to optimize write performance. For each write operation, the ACL information is first retrieved from Chubby to verify permissions. The write is then logged in WAL and stored in Memtable before eventually being persisted in SSTable.
When Memtable grows to a certain size, it triggers a Minor Compaction to convert Memtable to SSTable and write it to GFS.
Memtable is first converted into an immutable Memtable before becoming SSTable. This intermediate step ensures that Minor Compaction does not interfere with incoming writes.
Bigtable uses Compaction to accelerate writes, converting random writes into sequential writes and writing data in the background. Compaction occurs in three types:
- Minor Compaction: Converts Memtable to SSTable, discarding deleted data and retaining only the latest version.
- Merge Compaction: Combines Memtable and SSTable into a new SSTable.
- Major Compaction: Combines multiple SSTables into one.
For reads, data aggregation is required across Memtable and multiple SSTables, as data may be distributed across these structures. Second-level caching and Bloom filters are used to speed up reads.
Tablet Servers have two levels of caching:
- Scan Cache: Caches frequently read key-value pairs.
- Block Cache: Caches SSTable blocks.
Bloom filters are also employed to reduce the number of SSTable lookups by indicating whether a key is not present.
Optimization
Locality
High-frequency columns can be grouped together into one SSTable, reducing the time to fetch related data.
Space is traded for time, leveraging locality principles.
Compression
SSTable blocks are compressed to reduce network bandwidth and latency during transfers.
Compression is performed in blocks to reduce encoding/decoding time and improve parallelism.
CommitLog Design
Tablet Servers maintain one Commit Log each, instead of one per Tablet, to minimize disk seeks and enable batch operations. During recovery, log entries must be sorted by (Table, Row, Log Seq Num)
to facilitate recovery.
Summary
- Keep it simple: Simple is better than complex.
- Cluster monitoring is crucial for distributed services. Google’s three papers emphasize cluster monitoring and scheduling.
- Do not make assumptions about other systems in your design. Issues may range from common network issues to unexpected operational problems.
- Leverage background operations to accelerate user-facing actions, such as making writes fast and using background processes for cleanups.