Prometheus--TSDB

December 31, 2024 · 4802 words · 10 min · Prometheus TSDB

Recently got promoted, I took a moment to summarize some of my previous work. A significant part of my job was building large-scale database observability systems, which are quite different from cloud-native monitoring solutions like Prometheus. Now, I’m diving into the standard open-source monitoring system.

This article mainly discusses the built-in single-node time series database (TSDB) of Prometheus, outlining its TSDB design without delving into source code analysis.

Analyzing the source code of such projects can often be of low value unless I specialize in TSDBs, as the analysis can be easily forgotten, and the code may not be exceptional.

Data + Query Model

A single monitoring metric is described as a structure of time-dependent data, a timeseries.

$$ {timeseries} = \quad\lbrace \quad metric(attached\ with\ a\ set\ of\ labels) \Rightarrow (t_0,\ v_0),\ (t_1,\ v_1),\ (t_2,\ v_2),\ \ldots,\ (t_n, v_n) \quad\rbrace $$
Queries utilize the ${identifier(metric\ +\ sets\ of\ selected\ labels\ value)}$ to retrieve the corresponding timeseries. series

series
  ^   
  │   . . . . . . . . . . . . . . . . .   . . . . .   {__name__="request_total", method="GET"}
  │     . . . . . . . . . . . . . . . . . . . . . .   {__name__="request_total", method="POST"}
  │         . . . . . . .
  │       . . .     . . . . . . . . . . . . . . . .                  ... 
  │     . . . . . . . . . . . . . . . . .   . . . .   
  │     . . . . . . . . . .   . . . . . . . . . . .   {__name__="errors_total", method="POST"}
  │           . . .   . . . . . . . . .   . . . . .   {__name__="errors_total", method="GET"}
  │         . . . . . . . . .       . . . . .
  │       . . .     . . . . . . . . . . . . . . . .                  ... 
  │     . . . . . . . . . . . . . . . .   . . . . 
  v
    <-------------------- time --------------------->

The use of the identifier is crucial. Poor labeling can lead to timeseries data growth, especially in scenarios where containers are rebuilt.

Data Organization

For cloud-native scenarios, what characteristics do monitoring data have?

Short data lifecycle. The lifespan of individual containers is brief (e.g., in scaling scenarios or during extensive temporary tasks), leading to rapid timeseries growth along certain time dimensions.
Vertical writing with horizontal querying.

With these issues in mind, let’s look at how the data files are organized to address or sidestep these problems.

First, examine the logical structure:

The entire database consists of blocks and a HEAD. Each block can further be broken down into chunks, while the HEAD serves as a read-write buffer area composed of in-memory data and write-ahead logs (WAL). Chunks contain multiple timeseries.

The disk directory structure for a single block is as follows:

├── 01BKGV7JC0RY8A6MACW02A2PJD  // block 的 ULID
│   ├── chunks
│   │   └── 000001
│   ├── tombstones
│   ├── index
│   └── meta.json
├── chunks_head
│   └── 000001
└── wal
    ├── 000000002
    └── checkpoint.00000001
        └── 00000000

Block: Contains all data for a given time period (default 2 hours) and is read-only, named using a ULID. Each block includes:
- Chunks: fixed-size (max 128MB) chunks file
- Index: index file mainly containing inverted index information
- meta.json: metadata including block’s minTime and maxTime for data skipping during queries.
Chunks Head: The chunks file corresponding to the block currently being written, read-only, with a maximum of 120 data points and a maximum time span of 2 hours.
WAL: Guarantees data integrity.

The diagram provides significant insights, such as how the WAL ensures data integrity; the Head acts similarly to a buffer pool in a TSDB, managing memory data for batch flushing to disk. When certain conditions are met (e.g., time threshold, data size threshold), the Head becomes immutable (block) and is flushed to disk.

Overall, many design concepts in data organization resemble the LSM storage structure, which indeed suits TSDB well.

Prometheus’s design approach can be summarized as follows:

Using time-based data partitioning to resolve the issue of short data lifecycles.
Using in-memory batching to handle scenarios where only the latest data is written.

Setting aside similar aspects with LevelDB, let’s outline the differences.

First, the underlying models are different. LevelDB is a key-value store, while TSDB focuses on timeseries with a strong temporal connection, where time is monotonically increasing. It rarely writes historical data. Additionally, the query models differ; TSDB provides diverse query options, such as filtering timeseries based on various label set operations, necessitating more metadata for efficient querying.

Due to these requirements, new structures and functions are introduced: inverted indexes, checkpoints, tombstones, retention policies, and a compaction design distinct from the LSM key-value model. These will be analyzed in relation to the corresponding file formats.

File Organization Format

Let’s examine the components; the specifics of the organizational method are not the focus of this article.

meta.json

This file contains information about the block, particularly valuable for compaction and the minT, maxT timestamps.

minT and maxT record the block’s time access period, which can skip data during queries.

Compaction records the block’s historical information, such as the number of compaction iterations (level) and its source blocks. The precise utility of this is uncertain but may help during compaction or retention tasks to manage potential duplicates.

{
    "ulid": "01EM6Q6A1YPX4G9TEB20J22B2R", 
    "minTime": 1602237600000,
    "maxTime": 1602244800000,
    "stats": {
        "numSamples": 553673232,
        "numSeries": 1346066,
        "numChunks": 4440437
    },
    "compaction": {
        "level": 1,
        "sources": [
            "01EM65SHSX4VARXBBHBF0M0FDS",
            "01EM6GAJSYWSQQRDY782EA5ZPN"
        ]
    },
    "version": 1
}

chunks

These are standard data files, with their indexes stored in the index file. Note that a chunk can only belong to one timeseries, and a timeseries consists of multiple chunks.

┌──────────────────────────────┐
│  magic(0x85BD40DD) <4 byte>  │
├──────────────────────────────┤
│    version(1) <1 byte>       │
├──────────────────────────────┤
│    padding(0) <3 byte>       │
├──────────────────────────────┤
│ ┌──────────────────────────┐ │
│ │         Chunk 1          │ │
│ ├──────────────────────────┤ │
│ │          ...             │ │
│ ├──────────────────────────┤ │
│ │         Chunk N          │ │
│ └──────────────────────────┘ │
└──────────────────────────────┘

Every Chunk:
┌───────────────┬───────────────────┬──────────────┬────────────────┐
│ len <uvarint> │ encoding <1 byte> │ data <bytes> │ CRC32 <4 byte> │
└───────────────┴───────────────────┴──────────────┴────────────────┘

tombstone

This marks deleted data. TSDB might have delete operations under scenarios like transient jobs or container destruction, where business logic may necessitate removal. Tombstones primarily enable appending writes instead of in-place modifications, and subsequently, blocks may be compacted to reclaim disk space.

Of course, not deleting data isn’t harmful; there will be TTL expiration that removes obsolete data.

┌────────────────────────────┬─────────────────────┐
│ magic(0x0130BA30) <4b>     │ version(1) <1 byte> │
├────────────────────────────┴─────────────────────┤
│ ┌──────────────────────────────────────────────┐ │
│ │                Tombstone 1                   │ │
│ ├──────────────────────────────────────────────┤ │
│ │                      ...                     │ │
│ ├──────────────────────────────────────────────┤ │
│ │                Tombstone N                   │ │
│ ├──────────────────────────────────────────────┤ │
│ │                  CRC<4b>                     │ │
│ └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘

Every Tombstone:
┌────────────────────────┬─────────────────┬─────────────────┐
│ series ref <uvarint64> │ mint <varint64> │ maxt <varint64> │
└────────────────────────┴─────────────────┴─────────────────┘

index file

This file contains all information needed for reading, such as inverted indexes and the mapping of timeseries to chunks.

Notable structures include Series and Postings.

The Series section documents all series information corresponding to their chunks within the blocks.

The Posting Offset Table lists the locations of inverted indexes. The actual inverted index content is stored in the Postings section.

With inverted index collection operations, you can rapidly filter and retrieve timeseries that meet specified criteria.

┌────────────────────────────┬─────────────────────┐
│ magic(0xBAAAD700) <4b>     │ version(1) <1 byte> │
├────────────────────────────┴─────────────────────┤
│ ┌──────────────────────────────────────────────┐ │
│ │                 Symbol Table                 │ │
│ ├──────────────────────────────────────────────┤ │
│ │                    Series                    │ │
│ ├──────────────────────────────────────────────┤ │
│ │                 Label Index 1                │ │
│ ├──────────────────────────────────────────────┤ │
│ │                      ...                     │ │
│ ├──────────────────────────────────────────────┤ │
│ │                 Label Index N                │ │
│ ├──────────────────────────────────────────────┤ │
│ │                   Postings 1                 │ │
│ ├──────────────────────────────────────────────┤ │
│ │                      ...                     │ │
│ ├──────────────────────────────────────────────┤ │
│ │                   Postings N                 │ │
│ ├──────────────────────────────────────────────┤ │
│ │               Label Offset Table             │ │
│ ├──────────────────────────────────────────────┤ │
│ │             Postings Offset Table            │ │
│ ├──────────────────────────────────────────────┤ │
│ │                      TOC                     │ │
│ └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘

A Series:
┌──────────────────────────────────────────────────────┐
│ len <uvarint>                                        │
├──────────────────────────────────────────────────────┤
│ ┌──────────────────────────────────────────────────┐ │
│ │            labels count <uvarint64>              │ │
│ ├──────────────────────────────────────────────────┤ │
│ │  ┌────────────────────────────────────────────┐  │ │
│ │  │ ref(l_i.name) <uvarint32>                  │  │ │
│ │  ├────────────────────────────────────────────┤  │ │
│ │  │ ref(l_i.value) <uvarint32>                 │  │ │
│ │  └────────────────────────────────────────────┘  │ │
│ │                       ...                        │ │
│ ├──────────────────────────────────────────────────┤ │
│ │            chunks count <uvarint64>              │ │
│ ├──────────────────────────────────────────────────┤ │
│ │  ┌────────────────────────────────────────────┐  │ │
│ │  │ c_0.mint <varint64>                        │  │ │
│ │  ├────────────────────────────────────────────┤  │ │
│ │  │ c_0.maxt - c_0.mint <uvarint64>            │  │ │
│ │  ├────────────────────────────────────────────┤  │ │
│ │  │ ref(c_0.data) <uvarint64>                  │  │ │
│ │  └────────────────────────────────────────────┘  │ │
│ │  ┌────────────────────────────────────────────┐  │ │
│ │  │ c_i.mint - c_i-1.maxt <uvarint64>          │  │ │
│ │  ├────────────────────────────────────────────┤  │ │
│ │  │ c_i.maxt - c_i.mint <uvarint64>            │  │ │
│ │  ├────────────────────────────────────────────┤  │ │
│ │  │ ref(c_i.data) - ref(c_i-1.data) <varint64> │  │ │
│ │  └────────────────────────────────────────────┘  │ │
│ │                       ...                        │ │
│ └──────────────────────────────────────────────────┘ │
├──────────────────────────────────────────────────────┤
│ CRC32 <4b>                                           │
└──────────────────────────────────────────────────────┘

A Postings:
┌────────────────────┬────────────────────┐
│ len <4b>           │ #entries <4b>      │
├────────────────────┴────────────────────┤
│ ┌─────────────────────────────────────┐ │
│ │ ref(series_1) <4b>                  │ │
│ ├─────────────────────────────────────┤ │
│ │ ...                                 │ │
│ ├─────────────────────────────────────┤ │
│ │ ref(series_n) <4b>                  │ │
│ └─────────────────────────────────────┘ │
├─────────────────────────────────────────┤
│ CRC32 <4b>                              │
└─────────────────────────────────────────┘

Accelerating Disk Queries

Let’s focus on how a query locates the relevant data:

First, it queries the Posting Offset Table to find the position of the corresponding label’s Postings.
Based on the information from the Postings, it identifies the chunk locations via the series reference.
Finally, it locates the corresponding chunks for the timeseries.

!https://img.alicdn.com/imgextra/i4/581166664/O1CN01HsQNy31z6A3HT9l5B_!!581166664.jpg

Compaction

Similar to LevelDB, Prometheus utilizes both major and minor compaction processes, termed Compaction and Head Compaction.

Head Compaction is akin to the process of persisting the Head portion into Chunks, during which tombstones are actually deleted from memory.

Compaction is the merging of blocks, accomplishing multiple aims:

Reclaiming disk resources used by marked deletions.
Consolidating duplicate information scattered across multiple blocks, such as shared chunks and inverted index records.
Enhancing query processing speed by addressing data overlapping across different blocks—handling this during compaction is more efficient than performing in-memory processing post-read.

When does compaction occur?

The official blog doesn’t clarify this well, merely mentioning it occurs when data overlaps. However, various triggers exist, including time-based triggers, checks at each minor compaction, tombstone size evaluations, and manual triggers, following strategies observed in LevelDB.

Retention

This is straightforward—based on time or size-based TTL. Integrating this into the compaction process could also be a viable approach.

Snapshot

This process involves dumping the in-memory data to disk, likely designed to balance extensive metric data disk writes with data integrity; Otherwise, its functionality would be dupicated with wal.

References

LevelDB MVCC Borg: Large-scale Cluster Management at Google with Borg