December 31, 2024 · 4802 words · 10 min
·
Prometheus
TSDB
Recently got promoted, I took a moment to summarize some of my previous work. A significant part of my job was building large-scale database observability systems, which are quite different from cloud-native monitoring solutions like Prometheus. Now, I’m diving into the standard open-source monitoring system. This article mainly discusses the built-in single-node time series database (TSDB) of Prometheus, outlining its TSDB design without delving into source code analysis. Analyzing the
February 19, 2024 · 557 words · 3 min
·
Borg
K8s
Cluster Management
Borg is a cluster management system, similar to the closed-source version of Kubernetes (k8s).
It achieves high utilization through admission control, efficient task packing, overcommitment, machine sharing, and process-level performance isolation. It provides runtime features to reduce failure recovery time for high-availability applications and scheduling policies that reduce the probability of correlated failures. It offers a declarative job description language, DNS integration, real-time job monitoring, and tools for analyzing and simulating system behavior, simplifying usage for end-users.
September 28, 2023 · 1135 words · 3 min
·
Distributed System
Transaction
It has been a while since I last studied, and I wanted to learn something interesting. This time, I’ll be covering Percolator, a distributed transaction system. I won’t translate the paper or delve into detailed algorithms; I’ll just document my understanding. Percolator and 2PC 2PC The Two-Phase Commit (2PC) protocol involves two types of roles: Coordinator and Participant. The coordinator manages the entire process to ensure multiple participants reach a
August 1, 2023 · 425 words · 2 min
·
Distributed System
Storage
An old paper by AWS, Dynamo has been in the market for a long time, and the architecture has likely evolved since the paper’s publication. Despite this, the paper was selected as one of the SIGMOD best papers of the year, and there are still many valuable lessons to learn.
Design Dynamo is a NoSQL product that provides a key-value storage interface. It emphasizes high availability rather than consistency, which leads to differences in architectural design and technical choices compared to other systems.
August 1, 2023 · 524 words · 2 min
·
Distributed System
Database
Cloud-Native
MIT6.824
This article introduces the design considerations of AWS’s database product, Aurora, including storage-compute separation, single-writer multi-reader architecture, and quorum-based NRW consistency protocol. The article also mentions how PolarDB was inspired by Aurora, with differences in addressing network bottlenecks and system call overhead. Aurora is a database product provided by AWS, primarily aimed at OLTP business scenarios. In terms of design, there are several aspects worth noting: The design premise of
February 8, 2023 · 463 words · 3 min
·
Distributed System
MIT6.824
ChainReplication
This post provides a brief overview of the Chain Replication (CR) paper, which introduces a simple but effective algorithm for providing linearizable consistency in storage services. For those interested in the detailed design, it’s best to refer directly to the original paper.
Introduction In short, the Chain Replication (CR) paper presents a replicated state machine algorithm designed for storage services that require linearizable consistency. It uses a chain replication method to improve throughput and relies on multiple replicas to ensure service availability.
January 3, 2023 · 399 words · 2 min
·
Distributed System
MIT6.824
ZooKeeper
This article mainly discusses the design and practical considerations of the ZooKeeper system, such as wait-free and lock mechanisms, consistency choices, system-provided APIs, and specific semantic decisions. These trade-offs are the most insightful aspects of this article.
Positioning ZooKeeper is a wait-free, high-performance coordination service for distributed applications. It supports the coordination needs of distributed applications by providing coordination primitives (specific APIs and data models).
Design Keywords There are two key phrases in ZooKeeper’s positioning: high performance and distributed application coordination service.
October 10, 2022 · 1056 words · 5 min
·
Big Data
Lake House
Stream Compute
Storage
The Iceberg community provides an official Flink Connector, and this chapter’s source code analysis is based on that.
Overview of the Write Submission Process Flink writes data through RowData -> distributeStream -> WriterStream -> CommitterStream. Before data is committed, it is stored as intermediate files, which become visible to the system after being committed (through writing manifest, snapshot, and metadata files).
private <T> DataStreamSink<T> chainIcebergOperators() { Preconditions.checkArgument(inputCreator != null, "Please use forRowData() or forMapperOutputType() to initialize the input DataStream.
May 10, 2022 · 712 words · 4 min
·
LSM
LevelDB
This is the second chapter of my notes on reading the LevelDB source code, focusing on the write flow of LevelDB. This article is not a step-by-step source code tutorial, but rather a learning note that records my questions and thoughts.
Main Process The main write logic of LevelDB is relatively simple. First, the write operation is encapsulated into a WriteBatch, and then it is executed.
Status DB::Put(const WriteOptions& opt, const Slice& key, const Slice& value) { WriteBatch batch; batch.
April 15, 2022 · 1039 words · 5 min
·
Raft
Distributed System
Consensus
MIT6.824
Earlier, I looked at the code of Casbin-Mesh because I wanted to try GSOC. Casbin-Mesh is a distributed Casbin application based on Raft. This RaftKV in MIT6.824 is quite similar, so I took the opportunity to write this blog.
Lab Overview Lab 03 involves building a distributed KV service based on Raft. We need to implement the server and client for this service.
The structure of RaftKV and the interaction between its modules are shown below:
April 9, 2022 · 1312 words · 7 min
·
LSM
LevelDB
This is the first chapter of my notes on reading the LevelDB source code, focusing on the startup process of LevelDB. This article is not a step-by-step source code tutorial, but rather a learning note that records my questions and thoughts.
A code repository with annotations will be shared on GitHub later for those interested in studying it.
Prerequisites Database Files For now, I won’t delve into the encoding and naming details of these files (as I haven’t reached that part yet).
February 21, 2022 · 953 words · 5 min
·
Paper Reading
Consensu
Distributed System
MIT6.824
Finally, I managed to complete Lab 02 during this winter break, which had been on hold for quite some time. I was stuck on one of the cases in Test 2B for a while. During the winter break, I revisited the implementations from experts, and finally completed all the tasks, so I decided to document them briefly.
Algorithm Overview The basis of consensus algorithms is the replicated state machine, which means that executing the same deterministic commands in the same order will eventually lead to a consistent state.
November 21, 2021 · 705 words · 4 min
·
DataStructure
SkipList
Some time ago, I decided to implement a simple LSM storage engine model. As part of that, I implemented a basic SkipList and BloomFilter with BitSet. However, due to work demands and after-hours laziness, the project was put on hold. Now that I’m thinking about it again, I realize I’ve forgotten some of the details, so I’m writing it down for future reference.
What is SkipList? SkipList is an ordered data structure that can be seen as an alternative to balanced trees.
October 6, 2021 · 1284 words · 7 min
·
DFS
Paper Reading
Distributed System
The primary project in my group is a distributed file system (DFS) that provides POSIX file system semantics. The approach to handle “lots of small files” (LOSF) is inspired by Haystack, which is specifically designed for small files. I decided to read through the Haystack paper and take some notes as a learning exercise.
These notes are not an in-depth analysis of specific details but rather a record of my thoughts on the problem and design approach.
September 16, 2021 · 1908 words · 9 min
·
Paper Reading
MIT6.824
DFS
Distributed System
I recently found a translated version of the Bigtable paper online and saved it, but hadn’t gotten around to reading it. Lately, I’ve noticed that Bigtable shares many design similarities with a current project in our group, so I took some time over the weekend to read through it.
This is the last of Google’s three foundational distributed system papers, and although it wasn’t originally part of the MIT6.824 reading list, I’ve categorized it here for consistency.
September 9, 2021 · 1121 words · 6 min
·
GFS
MIT6.824
Paper Reading
This article introduces the Google File System (GFS) paper published in 2003, which proposed a distributed file system designed to store large volumes of data reliably, meeting Google’s data storage needs. This write-up reflects on the design goals, trade-offs, and architectural choices of GFS.
Introduction GFS is a distributed file system developed by Google to meet the needs of data-intensive applications, using commodity hardware to provide a scalable and fault-tolerant solution.
August 15, 2021 · 834 words · 4 min
·
OS
Linux
Network
IO
Let’s start with epoll.
epoll is an I/O event notification mechanism in the Linux kernel, designed to replace select and poll. It aims to efficiently handle large numbers of file descriptors and supports the system’s maximum file open limit, providing excellent performance.
Usage API epoll has three primary system calls:
/** epoll_create * Creates an epoll instance and returns a file descriptor for it. * Needs to be closed afterward, as epfd also consumes the system's fd resources.
May 2, 2021 · 566 words · 3 min
·
CPU
Cache
The motivation for this post comes from an interview question I was asked: What is CPU false sharing?
CPU Cache Let’s start by discussing CPU cache.
CPU cache is a type of storage medium introduced to bridge the speed gap between the CPU and main memory. In the pyramid-shaped storage hierarchy, it is located just below CPU registers. Its capacity is much smaller than that of main memory, but its speed can be close to the processor’s frequency.
February 21, 2021 · 564 words · 3 min
·
Network
HTTPS
HTTP
HTTPS (HTTP over SSL) was introduced to address the security vulnerabilities of HTTP, such as eavesdropping and identity spoofing. It uses SSL or TLS to encrypt communication between the client and the server.
Problems with HTTP Communication uses plain text, making it susceptible to eavesdropping. Unable to verify the identity of the communication party, making it vulnerable to spoofing (e.g., Denial of Service attacks). Cannot guarantee message integrity, making it possible for messages to be altered (e.
January 22, 2021 · 1541 words · 8 min
·
MIT6.824
Distributed System
Paper Reading
The third year of university has been quite intense, leaving me with little time to continue my studies on 6.824, so my progress stalled at Lab 1. With a bit more free time during the winter break, I decided to continue. Each paper or experiment will be recorded in this article.
This is the first chapter of my Distributed System study notes.
About the Paper The core content of the paper is the proposed MapReduce distributed computing model and the approach to implementing the Distributed MapReduce System, including the Master data structure, fault tolerance, and some refinements.