Apache-ORC Quick Investigation

October 5, 2022 · 565 words · 3 min · Column Store Big Data Storage

Iceberg supports both ORC and Parquet columnar formats. Compared to Parquet, ORC offers advantages in query performance and ACID support. Considering the future data lakehouse requirements for query performance and ACID compliance, we are researching ORC to support a future demo involving Flink, Iceberg, and ORC.

Research Focus: ORC file encoding, file organization, and indexing support.

File Layout

An ORC file can be divided into three main sections:

  • Header: Identifies the file type.
  • Body: Contains row data and indexes, as shown below.
  • Tail: Contains top-level file information.

ORC Specification v1

File Layout

File Tail

Since distributed storage generally supports only append-only semantics, the ORC file maintains a tail section for top-level file information.

The tail contains:

  • Postscript: Contains essential information for parsing the footer and metadata, such as the length of each section and compression method.
  • Footer: Stores schema information, row count, column statistics, and more.
  • Stripe Statistics and Metadata: Includes column-level statistics.

Postscript

The postscript is uncompressed and contains:

  • Footer length
  • Compression type
  • Metadata length
  • File identifier (“ORC”)

The footer includes the schema, row count, column-level statistics, and a list of stripes that make up the file body.

message Footer {
  optional uint64 headerLength = 1;
  optional uint64 contentLength = 2;
  repeated StripeInformation stripes = 3;
  repeated Type types = 4;
  repeated UserMetadataItem metadata = 5;
  optional uint64 numberOfRows = 6;
  repeated ColumnStatistics statistics = 7;
  optional uint32 rowIndexStride = 8;
  optional uint32 writer = 9;
  optional Encryption encryption = 10;
  optional uint64 stripeStatisticsLength = 11;
}
  • Stripe Information: Data in the body is organized into multiple stripes. Stripes contain row indexes, row data, and stripe footers, which are stored column-wise.
message StripeInformation {
  optional uint64 offset = 1;
  optional uint64 indexLength = 2;
  optional uint64 dataLength = 3;
  optional uint64 footerLength = 4;
  optional uint64 numberOfRows = 5;
}
  • Type Information: ORC uses a tree structure to represent nested data types, and the type schema must remain consistent.
create table Foobar (
  myInt int,
  myMap map<string, struct<myString : string, myDouble: double>>,
  myTime timestamp
);
  • Column Statistics: Simple statistics for each column are available to support coarse-grained filtering.

Stripes

The body of an ORC file is split into stripes, which are large chunks of data (typically ~200MB) that contain:

  • Index Data
  • Row Data
  • Stripe Footer

The Stripe Footer holds column encoding details and stream-related information, such as compression and encryption methods.

Index Support

ORC supports three levels of indexing:

Level Location Data Content
File Level File Footer Column-level statistics for the entire file
Stripe Level File Footer Column-level statistics for each stripe
Row Level Beginning of Stripe Statistics for each row group and their start position

Row Level Index

The row-level index contains Row Group Index and Bloom Filter Index.

Row Group Index

Indexes for primitive types are represented by ROW_INDEX streams, with each row group containing a RowIndexEntry.

Default row group size: 10,000 rows

message RowIndex {
  repeated RowIndexEntry entry = 1;
}

message RowIndexEntry {
  repeated uint64 positions = 1 [packed=true];
  optional ColumnStatistics statistics = 2;
}

Bloom Filter Index

Each column has a BLOOM_FILTER stream to help speed up searches.

message BloomFilter {
  optional uint32 numHashFunctions = 1;
  repeated fixed64 bitset = 2;
}

Data Access Path

  • Postscript -> Footer -> Retrieve Stripe Information -> Stripe Footer -> Stripe Index -> Row Group -> Column

References