Apache-ORC Quick Investigation
October 5, 2022 · 565 words · 3 min · Column Store Big Data Storage
Iceberg supports both ORC and Parquet columnar formats. Compared to Parquet, ORC offers advantages in query performance and ACID support. Considering the future data lakehouse requirements for query performance and ACID compliance, we are researching ORC to support a future demo involving Flink, Iceberg, and ORC.
Research Focus: ORC file encoding, file organization, and indexing support.
File Layout
An ORC file can be divided into three main sections:
- Header: Identifies the file type.
- Body: Contains row data and indexes, as shown below.
- Tail: Contains top-level file information.
ORC Specification v1
File Tail
Since distributed storage generally supports only append-only semantics, the ORC file maintains a tail section for top-level file information.
The tail contains:
- Postscript: Contains essential information for parsing the footer and metadata, such as the length of each section and compression method.
- Footer: Stores schema information, row count, column statistics, and more.
- Stripe Statistics and Metadata: Includes column-level statistics.
Postscript
The postscript is uncompressed and contains:
- Footer length
- Compression type
- Metadata length
- File identifier (“ORC”)
Footer
The footer includes the schema, row count, column-level statistics, and a list of stripes that make up the file body.
message Footer {
optional uint64 headerLength = 1;
optional uint64 contentLength = 2;
repeated StripeInformation stripes = 3;
repeated Type types = 4;
repeated UserMetadataItem metadata = 5;
optional uint64 numberOfRows = 6;
repeated ColumnStatistics statistics = 7;
optional uint32 rowIndexStride = 8;
optional uint32 writer = 9;
optional Encryption encryption = 10;
optional uint64 stripeStatisticsLength = 11;
}
- Stripe Information: Data in the body is organized into multiple stripes. Stripes contain row indexes, row data, and stripe footers, which are stored column-wise.
message StripeInformation {
optional uint64 offset = 1;
optional uint64 indexLength = 2;
optional uint64 dataLength = 3;
optional uint64 footerLength = 4;
optional uint64 numberOfRows = 5;
}
- Type Information: ORC uses a tree structure to represent nested data types, and the type schema must remain consistent.
create table Foobar (
myInt int,
myMap map<string, struct<myString : string, myDouble: double>>,
myTime timestamp
);
- Column Statistics: Simple statistics for each column are available to support coarse-grained filtering.
Stripes
The body of an ORC file is split into stripes, which are large chunks of data (typically ~200MB) that contain:
- Index Data
- Row Data
- Stripe Footer
The Stripe Footer holds column encoding details and stream-related information, such as compression and encryption methods.
Index Support
ORC supports three levels of indexing:
Level | Location | Data Content |
---|---|---|
File Level | File Footer | Column-level statistics for the entire file |
Stripe Level | File Footer | Column-level statistics for each stripe |
Row Level | Beginning of Stripe | Statistics for each row group and their start position |
Row Level Index
The row-level index contains Row Group Index and Bloom Filter Index.
Row Group Index
Indexes for primitive types are represented by ROW_INDEX streams, with each row group containing a RowIndexEntry.
Default row group size: 10,000 rows
message RowIndex {
repeated RowIndexEntry entry = 1;
}
message RowIndexEntry {
repeated uint64 positions = 1 [packed=true];
optional ColumnStatistics statistics = 2;
}
Bloom Filter Index
Each column has a BLOOM_FILTER stream to help speed up searches.
message BloomFilter {
optional uint32 numHashFunctions = 1;
repeated fixed64 bitset = 2;
}
Data Access Path
- Postscript -> Footer -> Retrieve Stripe Information -> Stripe Footer -> Stripe Index -> Row Group -> Column