Apache-ORC Quick Investigation

October 5, 2022 · 565 words · 3 min · Column Store Big Data Storage

Iceberg supports both ORC and Parquet columnar formats. Compared to Parquet, ORC offers advantages in query performance and ACID support. Considering the future data lakehouse requirements for query performance and ACID compliance, we are researching ORC to support a future demo involving Flink, Iceberg, and ORC. Research Focus: ORC file encoding, file organization, and indexing support. File Layout An ORC file can be divided into three main sections: Header: Identifies the file type.

Apache-Iceberg Quick Investigation

October 5, 2022 · 1208 words · 6 min · Lake House Storage Big Data

A table format for large-scale analysis of datasets. A specification for organizing data files and metadata files. A schema semantic abstraction between storage and computation. Developed and open-sourced by Netflix to enhance scalability, reliability, and usability. Background Issues encountered when migrating HIVE to the cloud: Dependency on List and Rename semantics makes it impossible to replace HDFS with cheaper OSS. Scalability issues: Schema information in Hive is centrally stored in metastore, which can become a performance bottleneck.

LevelDB Write

May 10, 2022 · 712 words · 4 min · LSM LevelDB

This is the second chapter of my notes on reading the LevelDB source code, focusing on the write flow of LevelDB. This article is not a step-by-step source code tutorial, but rather a learning note that records my questions and thoughts. Main Process The main write logic of LevelDB is relatively simple. First, the write operation is encapsulated into a WriteBatch, and then it is executed. Status DB::Put(const WriteOptions& opt, const Slice& key, const Slice& value) { WriteBatch batch; batch.

MIT6.824-RaftKV

April 15, 2022 · 1039 words · 5 min · Raft Distributed System Consensus MIT6.824

Earlier, I looked at the code of Casbin-Mesh because I wanted to try GSOC. Casbin-Mesh is a distributed Casbin application based on Raft. This RaftKV in MIT6.824 is quite similar, so I took the opportunity to write this blog. Lab Overview Lab 03 involves building a distributed KV service based on Raft. We need to implement the server and client for this service. The structure of RaftKV and the interaction between its modules are shown below:

LevelDB Startup

April 9, 2022 · 1312 words · 7 min · LSM LevelDB

This is the first chapter of my notes on reading the LevelDB source code, focusing on the startup process of LevelDB. This article is not a step-by-step source code tutorial, but rather a learning note that records my questions and thoughts. A code repository with annotations will be shared on GitHub later for those interested in studying it. Prerequisites Database Files For now, I won’t delve into the encoding and naming details of these files (as I haven’t reached that part yet).