NoneBack

CPU Profiling: What, How, and When

March 10, 2025 · 881 words · 5 min · Performance Analysis

What: What is CPU Profiling A technique for analyzing program CPU performance. By collecting detailed data during program execution (such as function call frequency, time consumption, call stacks, etc.), it helps developers identify performance bottlenecks and optimize code efficiency. Typically used in performance analysis and root cause diagnosis scenarios. How: How Profiling Data is Collected Common tools like perf are used to collect process stack information. These tools use sampling statistics to capture stack samples executing on the CPU for performance analysis.

LevelDB MVCC

February 8, 2025 · 502 words · 3 min · LevelDB MVCC

LevelDB implements concurrent sstable read/write operations and snapshot reads through MVCC. Let’s examine its implementation. Sequence Number LevelDB uses Sequence Numbers as logical clocks to maintain a total order of KV write operations. The Sequence Number is encoded in the last few bytes of the InternalKey. This encoding ensures data ordering during memory writes. Versioning Every change to the sstable file collection triggers a version upgrade in LevelDB. Each Version represents the database state at a specific moment, containing sstable metadata and compaction-related information.

Prometheus--TSDB

December 31, 2024 · 4802 words · 10 min · Prometheus TSDB

Recently got promoted, I took a moment to summarize some of my previous work. A significant part of my job was building large-scale database observability systems, which are quite different from cloud-native monitoring solutions like Prometheus. Now, I’m diving into the standard open-source monitoring system. This article mainly discusses the built-in single-node time series database (TSDB) of Prometheus, outlining its TSDB design without delving into source code analysis. Analyzing the

Borg: Large-scale Cluster Management at Google with Borg

February 19, 2024 · 557 words · 3 min · Borg K8s Cluster Management

Borg is a cluster management system, similar to the closed-source version of Kubernetes (k8s). It achieves high utilization through admission control, efficient task packing, overcommitment, machine sharing, and process-level performance isolation. It provides runtime features to reduce failure recovery time for high-availability applications and scheduling policies that reduce the probability of correlated failures. It offers a declarative job description language, DNS integration, real-time job monitoring, and tools for analyzing and simulating system behavior, simplifying usage for end-users.

Percolator: Large-scale Incremental Processing Using Distributed Transactions and Notifications

September 28, 2023 · 1135 words · 3 min · Distributed System Transaction

It has been a while since I last studied, and I wanted to learn something interesting. This time, I’ll be covering Percolator, a distributed transaction system. I won’t translate the paper or delve into detailed algorithms; I’ll just document my understanding. Percolator and 2PC 2PC The Two-Phase Commit (2PC) protocol involves two types of roles: Coordinator and Participant. The coordinator manages the entire process to ensure multiple participants reach a

Dynamo: Amazon’s Highly Available Key-value Store

August 1, 2023 · 425 words · 2 min · Distributed System Storage

An old paper by AWS, Dynamo has been in the market for a long time, and the architecture has likely evolved since the paper’s publication. Despite this, the paper was selected as one of the SIGMOD best papers of the year, and there are still many valuable lessons to learn. Design Dynamo is a NoSQL product that provides a key-value storage interface. It emphasizes high availability rather than consistency, which leads to differences in architectural design and technical choices compared to other systems.

MIT6.824 AuroraDB

August 1, 2023 · 524 words · 2 min · Distributed System Database Cloud-Native MIT6.824

This article introduces the design considerations of AWS’s database product, Aurora, including storage-compute separation, single-writer multi-reader architecture, and quorum-based NRW consistency protocol. The article also mentions how PolarDB was inspired by Aurora, with differences in addressing network bottlenecks and system call overhead. Aurora is a database product provided by AWS, primarily aimed at OLTP business scenarios. In terms of design, there are several aspects worth noting: The design premise of

MIT6.824 Chain Replication

February 8, 2023 · 463 words · 3 min · Distributed System MIT6.824 ChainReplication

This post provides a brief overview of the Chain Replication (CR) paper, which introduces a simple but effective algorithm for providing linearizable consistency in storage services. For those interested in the detailed design, it’s best to refer directly to the original paper. Introduction In short, the Chain Replication (CR) paper presents a replicated state machine algorithm designed for storage services that require linearizable consistency. It uses a chain replication method to improve throughput and relies on multiple replicas to ensure service availability.

MIT6.824-ZooKeeper

January 3, 2023 · 399 words · 2 min · Distributed System MIT6.824 ZooKeeper

This article mainly discusses the design and practical considerations of the ZooKeeper system, such as wait-free and lock mechanisms, consistency choices, system-provided APIs, and specific semantic decisions. These trade-offs are the most insightful aspects of this article. Positioning ZooKeeper is a wait-free, high-performance coordination service for distributed applications. It supports the coordination needs of distributed applications by providing coordination primitives (specific APIs and data models). Design Keywords There are two key phrases in ZooKeeper’s positioning: high performance and distributed application coordination service.

Flink-Iceberg-Connector Write Process

October 10, 2022 · 1056 words · 5 min · Big Data Lake House Stream Compute Storage

The Iceberg community provides an official Flink Connector, and this chapter’s source code analysis is based on that. Overview of the Write Submission Process Flink writes data through RowData -> distributeStream -> WriterStream -> CommitterStream. Before data is committed, it is stored as intermediate files, which become visible to the system after being committed (through writing manifest, snapshot, and metadata files). private <T> DataStreamSink<T> chainIcebergOperators() { Preconditions.checkArgument(inputCreator != null, "Please use forRowData() or forMapperOutputType() to initialize the input DataStream.

LevelDB Write

May 10, 2022 · 712 words · 4 min · LSM LevelDB

This is the second chapter of my notes on reading the LevelDB source code, focusing on the write flow of LevelDB. This article is not a step-by-step source code tutorial, but rather a learning note that records my questions and thoughts. Main Process The main write logic of LevelDB is relatively simple. First, the write operation is encapsulated into a WriteBatch, and then it is executed. Status DB::Put(const WriteOptions& opt, const Slice& key, const Slice& value) { WriteBatch batch; batch.

MIT6.824-RaftKV

April 15, 2022 · 1039 words · 5 min · Raft Distributed System Consensus MIT6.824

Earlier, I looked at the code of Casbin-Mesh because I wanted to try GSOC. Casbin-Mesh is a distributed Casbin application based on Raft. This RaftKV in MIT6.824 is quite similar, so I took the opportunity to write this blog. Lab Overview Lab 03 involves building a distributed KV service based on Raft. We need to implement the server and client for this service. The structure of RaftKV and the interaction between its modules are shown below:

LevelDB Startup

April 9, 2022 · 1312 words · 7 min · LSM LevelDB

This is the first chapter of my notes on reading the LevelDB source code, focusing on the startup process of LevelDB. This article is not a step-by-step source code tutorial, but rather a learning note that records my questions and thoughts. A code repository with annotations will be shared on GitHub later for those interested in studying it. Prerequisites Database Files For now, I won’t delve into the encoding and naming details of these files (as I haven’t reached that part yet).

MIT6.824-Raft

February 21, 2022 · 953 words · 5 min · Paper Reading Consensu Distributed System MIT6.824

Finally, I managed to complete Lab 02 during this winter break, which had been on hold for quite some time. I was stuck on one of the cases in Test 2B for a while. During the winter break, I revisited the implementations from experts, and finally completed all the tasks, so I decided to document them briefly. Algorithm Overview The basis of consensus algorithms is the replicated state machine, which means that executing the same deterministic commands in the same order will eventually lead to a consistent state.

How to Implement SkipList

November 21, 2021 · 705 words · 4 min · DataStructure SkipList

Some time ago, I decided to implement a simple LSM storage engine model. As part of that, I implemented a basic SkipList and BloomFilter with BitSet. However, due to work demands and after-hours laziness, the project was put on hold. Now that I’m thinking about it again, I realize I’ve forgotten some of the details, so I’m writing it down for future reference. What is SkipList? SkipList is an ordered data structure that can be seen as an alternative to balanced trees.

DFS-Haystack

October 6, 2021 · 1284 words · 7 min · DFS Paper Reading Distributed System

The primary project in my group is a distributed file system (DFS) that provides POSIX file system semantics. The approach to handle “lots of small files” (LOSF) is inspired by Haystack, which is specifically designed for small files. I decided to read through the Haystack paper and take some notes as a learning exercise. These notes are not an in-depth analysis of specific details but rather a record of my thoughts on the problem and design approach.

MIT6.824 Bigtable

September 16, 2021 · 1908 words · 9 min · Paper Reading MIT6.824 DFS Distributed System

I recently found a translated version of the Bigtable paper online and saved it, but hadn’t gotten around to reading it. Lately, I’ve noticed that Bigtable shares many design similarities with a current project in our group, so I took some time over the weekend to read through it. This is the last of Google’s three foundational distributed system papers, and although it wasn’t originally part of the MIT6.824 reading list, I’ve categorized it here for consistency.

MIT6.824 GFS

September 9, 2021 · 1121 words · 6 min · GFS MIT6.824 Paper Reading

This article introduces the Google File System (GFS) paper published in 2003, which proposed a distributed file system designed to store large volumes of data reliably, meeting Google’s data storage needs. This write-up reflects on the design goals, trade-offs, and architectural choices of GFS. Introduction GFS is a distributed file system developed by Google to meet the needs of data-intensive applications, using commodity hardware to provide a scalable and fault-tolerant solution.

Epoll and IO Multiplexing

August 15, 2021 · 834 words · 4 min · OS Linux Network IO

Let’s start with epoll. epoll is an I/O event notification mechanism in the Linux kernel, designed to replace select and poll. It aims to efficiently handle large numbers of file descriptors and supports the system’s maximum file open limit, providing excellent performance. Usage API epoll has three primary system calls: /** epoll_create * Creates an epoll instance and returns a file descriptor for it. * Needs to be closed afterward, as epfd also consumes the system's fd resources.

CPU False Sharing

May 2, 2021 · 566 words · 3 min · CPU Cache

The motivation for this post comes from an interview question I was asked: What is CPU false sharing? CPU Cache Let’s start by discussing CPU cache. CPU cache is a type of storage medium introduced to bridge the speed gap between the CPU and main memory. In the pyramid-shaped storage hierarchy, it is located just below CPU registers. Its capacity is much smaller than that of main memory, but its speed can be close to the processor’s frequency.

HTTPS Introduction

February 21, 2021 · 564 words · 3 min · Network HTTPS HTTP

HTTPS (HTTP over SSL) was introduced to address the security vulnerabilities of HTTP, such as eavesdropping and identity spoofing. It uses SSL or TLS to encrypt communication between the client and the server. Problems with HTTP Communication uses plain text, making it susceptible to eavesdropping. Unable to verify the identity of the communication party, making it vulnerable to spoofing (e.g., Denial of Service attacks). Cannot guarantee message integrity, making it possible for messages to be altered (e.

MIT6.824-MapReduce

January 22, 2021 · 1541 words · 8 min · MIT6.824 Distributed System Paper Reading

The third year of university has been quite intense, leaving me with little time to continue my studies on 6.824, so my progress stalled at Lab 1. With a bit more free time during the winter break, I decided to continue. Each paper or experiment will be recorded in this article. This is the first chapter of my Distributed System study notes. About the Paper The core content of the paper is the proposed MapReduce distributed computing model and the approach to implementing the Distributed MapReduce System, including the Master data structure, fault tolerance, and some refinements.

NoneBack

Featured Posts