Apache-Iceberg Quick Investigation

October 5, 2022 · 1208 words · 6 min · Lake House Storage Big Data

  • A table format for large-scale analysis of datasets.
  • A specification for organizing data files and metadata files.
  • A schema semantic abstraction between storage and computation.
  • Developed and open-sourced by Netflix to enhance scalability, reliability, and usability.


Issues encountered when migrating HIVE to the cloud:

  • Dependency on List and Rename semantics makes it impossible to replace HDFS with cheaper OSS.
  • Scalability issues: Schema information in Hive is centrally stored in metastore, which can become a performance bottleneck.
  • Unsafe operations, CBO unfriendly, etc.


  • Supports secure and efficient schema, partition changes, and evolution, self-defined schema, hidden partition.
    • Abstracts its own schema, not tied to any computation engine schema; partition is maintained at the schema level. Partition and sort order provide transformer functions, such as date(timestamp).
  • Supports object storage with minimal dependency on FS semantics.
  • ACID semantics support, parallel reads, serialized write operations:
    • Separation of read and write snapshots.
    • Optimistic handling of write parallel conflicts, retry to ensure writes.
  • Snapshot support:
    • Data rollback and time travel.
    • Supports snapshot expiration (by default, data files are not deleted, but customizable deletion behavior is available) (related API doc).
    • Incremental reading can be achieved by comparing snapshot differences.
  • Query optimization-friendly: predicate pushdown, data file statistics. Currently, compaction is not supported, but invalid files can be deleted during snapshot expiration (deleteWith).
  • High abstraction level, easy for modification, optimization, and extension. Catalog, read/write paths, file formats, storage dependencies are all pluggable. Iceberg’s design goal is to define a standard, open, and general data organization format while hiding differences in underlying data storage formats, providing a unified operational API for different engines to connect through its API.
  • Others: file-level encryption and decryption.


  • Community support for OSS, Flink, Spark, and Presto:
  • Integration with other components:
    • Integration with lower storage layers: Only relies on three semantics: In-place write, Seekable reads, Deletes, supports AliOSS (# pr 3689).
    • Integration with other file formats: High abstraction level, currently supports Avro, Parquet, ORC.
    • Catalog: Customizable (Doc: Custom Catalog Implementation), currently supports JDBC, Hive Metastore, Hadoop, etc.
    • Integration with computation layer: Provides native JAVA & Python APIs, with a high level of abstraction, supporting most computation engines.
  • Open and neutral community, allowing contributions to improve influence.

Table Specification

Specification for organizing data files and metadata files.


Case: Spark + Iceberg + Local FS

Iceberg supports Parquet, Avro, ORC file formats.

# Storage organization
├── data
   ├── 00000-1-ccff6767-12cc-481c-93fc-db9f1a57438c-00001.parquet
   └── 00001-2-6c1e5a0b-89fe-4e77-b90a-1773a7fbbcc8-00001.parquet
└── metadata
├── 2c1dc0e8-1843-4cb9-9c55-ae43f800bf3f-m0.avro // manifest file
├── snap-8512048775051875497-1-2c1dc0e8-1843-4cb9-9c55-ae43f800bf3f.avro // manifest list file
├── v1.metadata.json // metadata file
├── v2.metadata.json
    └── version-hint.text // catalog


Data files in columnar format: Parquet, ORC.

There are three types of Data Files: data file, partition delete file, equality delete file.

Manifest File

Indexes data files, including statistics and partition information.


       // another data file meta


  • Represents the state of a Table at a specific point in time, saved via a Manifest List File.
  • A new Snapshot is generated every time a data change is made to the Table.

Manifest List File

  • Contains information about all Manifest files in a Snapshot, as well as partition stats and data file count.
  • One Snapshot corresponds to one Manifest List File, and each submission generates a manifest list file.
  • Optimistic concurrency: when concurrent Snapshot submissions conflict, the later submission retries to ensure submission.

Each manifest list stores metadata about manifests, including partition stats and data file counts.

      // another manifest file

Metadata File

Tracks the state of the table. When the state changes, a new metadata file is generated and replaces the previous one, ensuring atomicity.

The table metadata file tracks the table schema, partitioning config, custom properties, and snapshots of the table contents.





Records the latest metadata file path.



  • ACID semantics guarantee: Atomic table state changes + snapshot-based reads and writes.
  • Flexible partition management: hidden partition, seamless partition changes.
  • Supports incremental reads: incremental read of each change using snapshots.
  • Multi-version data: beneficial for data rollback.
  • No side effects, safe schema, and partition changes.

Data Types

Data files in different formats define different types.

  • Nested Types:

    • struct: A tuple of typed values.
    • list: A collection of values with an element type.
    • map: A collection of key-value pairs with a key type and a value type.
  • Primitive Types:

Primitive type Description Requirements
boolean True or false
int 32-bit signed integers Can promote to long
long 64-bit signed integers
float 32-bit IEEE 754 floating point Can promote to double
double 64-bit IEEE 754 floating point
decimal(P,S) Fixed-point decimal; precision P, scale S Scale is fixed [1], precision must be 38 or less
date Calendar date without timezone or time
time Time of day without date, timezone Microsecond precision [2]
timestamp Timestamp without timezone Microsecond precision [2]
timestamptz Timestamp with timezone Stored as UTC [2]
string Arbitrary-length character sequences Encoded with UTF-8 [3]
uuid Universally unique identifiers Should use 16-byte fixed
fixed(L) Fixed-length byte array of length L
binary Arbitrary-length byte array

Read & Write Paths

select: catalog -> manifest list file -> manifest file -> data file -> data group.

insert: reverse (catalog -> manifest list file -> manifest file -> data file -> data group).

update: delete & insert, data file + partition delete file + equality delete file.

Using Partition delete file transaction: issue of repeatedly inserting and deleting the same row within a transaction.

delete: row-level delete.


Iceberg Spec

Flink+Iceberg Data Lake Construction

Construction Practice of Real-time Data Warehouse with Flink + Iceberg (Chinese)

Iceberg Aliyun OSS

Iceberg Flink Support

Building Enterprise-grade Real-time Data Lake with Flink + Iceberg

How Flink Analyzes CDC Data in Iceberg Real-time Data Lake

Iceberg GitHub


Comparison of Delta, Iceberg, and Hudi Open-source Data Lake Solutions