LevelDB MVCC
February 8, 2025 · 502 words · 3 min · LevelDB MVCC
LevelDB implements concurrent sstable read/write operations and snapshot reads through MVCC. Let’s examine its implementation.
Sequence Number LevelDB uses Sequence Numbers as logical clocks to maintain a total order of KV write operations. The Sequence Number is encoded in the last few bytes of the InternalKey. This encoding ensures data ordering during memory writes.
Versioning Every change to the sstable file collection triggers a version upgrade in LevelDB. Each Version represents the database state at a specific moment, containing sstable metadata and compaction-related information. VersionEdit records version changes.
Version1 ---VersionEdit--> Version2
The VersionSet is represented as an ordered linked list of Versions, reflecting the database’s current and historical states. The LastSeq (last sequence number) and Version linked list are critical components.
The Version linked list tracks all stored Versions and their changes, with reference counting (RC) used for garbage collection. The Version Chain describes the total order of sstable write operations across different times.
WAL + Manifest ensures atomic LastSeq updates. The Manifest file acts as a WAL for version changes, working with Version Commit operations to guarantee atomic updates of the latest version in VersionSet.
During version transitions:
- Write operations generate VersionEdit records in memory
- Changes are written to the Manifest
- VersionSet updates to the new Version
This process ensures version consistency: compaction-induced version changes won’t affect ongoing read operations, and readers never access intermediate sstable states.
class VersionEdit {
/** Other code */
typedef std::set<std::pair<int, uint64_t>> DeletedFileSet;
std::string comparator_;
uint64_t log_number_;
uint64_t prev_log_number_;
uint64_t next_file_number_;
SequenceNumber last_sequence_;
bool has_comparator_;
bool has_log_number_;
bool has_prev_log_number_;
bool has_next_file_number_;
bool has_last_sequence_;
std::vector<std::pair<int, InternalKey>> compact_pointers_;
DeletedFileSet deleted_files_;
std::vector<std::pair<int, FileMetaData>> new_files_;
};
class Version {
VersionSet*vset_;
Version* next_;
Version* prev_;
int refs_;
std::vector<FileMetaData*> files_[config::kNumLevels];
FileMetaData* file_to_compact_;
int file_to_compact_level_;
double compaction_score_;
int compaction_level_;
};
class VersionSet {
Env*const env_;
const std::string dbname_;
const Options* const options_;
TableCache* const table_cache_;
const InternalKeyComparator icmp_;
uint64_t next_file_number_;
uint64_t manifest_file_number_;
uint64_t last_sequence_;
uint64_t log_number_;
uint64_t prev_log_number_;
WritableFile* descriptor_file_;
log::Writer* descriptor_log_;
Version dummy_versions_;
Version* current_;
std::string compact_pointer_[config::kNumLevels];
}
MVCC & Snapshot Read
MVCC primarily resolves concurrent read and write conflicts on SSTables.
Memtable Operations: Reads and writes to the Memtable use a skip list, which inherently introduces conflicts. SSTable Operations: Reads and writes (compaction and read operations) do not interfere with each other. Each write operation is associated with a Sequence Number, and SSTables are only appended to. Compaction merely merges SSTables into new files.
Each read operation is associated with a Sequence Number and Version. The Sequence Number ensures that subsequent writes are invisible to the read operation, and the associated Version ensures that the SSTables used during the read are not garbage collected. This guarantees that the read operation does not encounter changes to the SSTables it uses.
If a read and write operate within the same Version, the compaction must complete first, and the Version does not change. This ensures that the compaction process does not interfere with the read. For other cases, reads always precede writes, making writes invisible to the read and eliminating conflicts.
Reference
https://leveldb-handbook.readthedocs.io/en/latest/index.html