Prometheus--TSDB

December 31, 2024 · 4802 words · 10 min · Prometheus TSDB

Recently got promoted, I took a moment to summarize some of my previous work. A significant part of my job was building large-scale database observability systems, which are quite different from cloud-native monitoring solutions like Prometheus. Now, I’m diving into the standard open-source monitoring system. This article mainly discusses the built-in single-node time series database (TSDB) of Prometheus, outlining its TSDB design without delving into source code analysis. Analyzing the

Borg: Large-scale Cluster Management at Google with Borg

February 19, 2024 · 557 words · 3 min · Borg K8s Cluster Management

Borg is a cluster management system, similar to the closed-source version of Kubernetes (k8s). It achieves high utilization through admission control, efficient task packing, overcommitment, machine sharing, and process-level performance isolation. It provides runtime features to reduce failure recovery time for high-availability applications and scheduling policies that reduce the probability of correlated failures. It offers a declarative job description language, DNS integration, real-time job monitoring, and tools for analyzing and simulating system behavior, simplifying usage for end-users.

Percolator: Large-scale Incremental Processing Using Distributed Transactions and Notifications

September 28, 2023 · 1135 words · 3 min · Distributed System Transaction

It has been a while since I last studied, and I wanted to learn something interesting. This time, I’ll be covering Percolator, a distributed transaction system. I won’t translate the paper or delve into detailed algorithms; I’ll just document my understanding. Percolator and 2PC 2PC The Two-Phase Commit (2PC) protocol involves two types of roles: Coordinator and Participant. The coordinator manages the entire process to ensure multiple participants reach a

Dynamo: Amazon’s Highly Available Key-value Store

August 1, 2023 · 425 words · 2 min · Distributed System Storage

An old paper by AWS, Dynamo has been in the market for a long time, and the architecture has likely evolved since the paper’s publication. Despite this, the paper was selected as one of the SIGMOD best papers of the year, and there are still many valuable lessons to learn. Design Dynamo is a NoSQL product that provides a key-value storage interface. It emphasizes high availability rather than consistency, which leads to differences in architectural design and technical choices compared to other systems.

MIT6.824 AuroraDB

August 1, 2023 · 524 words · 2 min · Distributed System Database Cloud-Native MIT6.824

This article introduces the design considerations of AWS’s database product, Aurora, including storage-compute separation, single-writer multi-reader architecture, and quorum-based NRW consistency protocol. The article also mentions how PolarDB was inspired by Aurora, with differences in addressing network bottlenecks and system call overhead. Aurora is a database product provided by AWS, primarily aimed at OLTP business scenarios. In terms of design, there are several aspects worth noting: The design premise of