DFS-Haystack

October 6, 2021 · 1284 words · 7 min · DFS Paper Reading Distributed System

The primary project in my group is a distributed file system (DFS) that provides POSIX file system semantics. The approach to handle “lots of small files” (LOSF) is inspired by Haystack, which is specifically designed for small files. I decided to read through the Haystack paper and take some notes as a learning exercise.

These notes are not an in-depth analysis of specific details but rather a record of my thoughts on the problem and design approach.

Introduction

Haystack is a storage system designed by Facebook for small files. In traditional DFS, file addressing typically involves using caches to store metadata, reducing disk interaction and improving lookup efficiency. For each file, a separate set of metadata must be maintained, with the volume of metadata depending on the number of files. In high-concurrency scenarios, metadata is cached in memory to reduce disk I/O.

With a large number of small files, the volume of metadata becomes significant. Considering the maintenance overhead of in-memory metadata, this approach becomes impractical. Therefore, Haystack was developed specifically for small files, with the core idea of aggregating multiple small files into a larger one to reduce metadata.

Background

The “small files” in the paper specifically refer to image data.

Facebook, as a social media company, deals heavily with image uploads and retrieval. As the business scaled, it became necessary to have a dedicated service to handle the massive, high-concurrency requests for image reads and writes.

In the social networking context, this type of data is characterized as written once, read often, never modified, and rarely deleted. Based on this, Facebook developed Haystack to support image sharing services.

Design

Traditional Design

The paper describes two historical designs: CDN-based and NAS-based solutions.

CDN-based Solution

The core of this solution is to use CDN (Content Delivery Network) to cache hot image data, reducing network transmission.

This approach optimizes access to hot images but also has some issues. Firstly, CDN is expensive and has limited capacity. Secondly, image sharing includes many less popular images, which leads to the long tail effect, slowing down access.

CDNs are generally used to serve static data and are often pre-warmed before an event, making them unsuitable as an image cache service. Many less popular images do not enter the CDN, leading to the long tail effect.

NAS-based Solution

This was Facebook’s initial design and is essentially a variation of the CDN-based solution.

They introduced NAS (Network Attached Storage) for horizontal storage expansion, incorporating file system semantics, but disk I/O remained an issue. Similar to local files, reading uncached data requires at least three disk I/O operations:

  • Read directory metadata into memory
  • Load the inode into memory
  • Read the content of the file

PhotoStore was used as a caching layer to store some metadata like file handles to speed up the addressing process.

The NAS-based design did not solve the fundamental issue of excessive metadata that could not be fully cached. When the number of files reaches a certain threshold, disk I/O becomes inevitable.

The fundamental issue is the one-to-one relationship between files and addressing metadata, causing the volume of metadata to change with the number of files.

Thus, the key to optimization is changing the one-to-one relationship between files and metadata, reducing the frequency of disk I/O during addressing.

Haystack-based Solution

The core idea of Haystack is to aggregate multiple small files into a larger one, maintaining a single piece of metadata for the large file. This changes the mapping between metadata and files, making it feasible to keep all metadata in memory.

Metadata is maintained only for the aggregated file, and the position of small files within the large file is maintained separately.

Implementation

Haystack mainly consists of three components: Haystack Directory, Haystack Cache, and Haystack Store.

File Mapping and Storage

File data is ultimately stored on logical volumes, each of which corresponds to multiple physical volumes across machines.

Users first access the Directory to obtain access paths and then use the URL generated by the Directory to access other components to retrieve the required data.

Components

Haystack Directory

This is Haystack’s access layer, responsible for file addressing and access control.

Read and write requests first go through the Directory. For read requests, the Directory generates an access URL containing the path: http://{cdn}/{cache}/{machine id}/{logicalvolume,Photo}. For write requests, it provides a volume to write into.

The Directory has four main functions:

  1. Load balancing for read and write requests.
  2. Determine request access paths (e.g., CDN or direct access) and generate access URLs.
  3. Metadata and mapping management, e.g., logical attributes to volume mapping.
  4. Logical volume read/write management, where volumes can be read-only or write-enabled.

This design is based on the data characteristics: “write once, read more.” This setup improves concurrency.

The Directory stores metadata such as file-to-volume mappings, logical-to-physical mappings, and volume attributes (size, owner, etc.). It relies on a distributed key-value store and a cache service to ensure low latency and high availability.

Proxy, Metadata Mapping, Access Control

Haystack Cache

The Cache layer optimizes addressing and image retrieval. The core design is the Cache Rule, which determines what data should be cached and how to handle cache misses.

Images are cached if they meet these criteria:

  1. The request is directly from a user, not from a CDN.
  2. The photo is retrieved from a write-enabled store machine.

If a cache miss occurs, the Cache fetches the image from the Store and pushes it to both the user and the CDN.

The caching policy is based on typical access patterns.

Haystack Store

The Store layer is responsible for data storage operations.

The addressing abstraction is: filename + offset => logical volume id + offset => data.

Multiple physical volumes constitute a logical volume. In the Store, small files are encapsulated as Needles managed by physical volumes.

Needle Abstraction

Needles represent a way to encapsulate small files and manage volume blocks.

Store data is accessed at the Needle level. To speed up addressing, a memory map is used: key/alternate key => needle's flag/offset/other attributes.

These maps are persisted in Index Files on disk to provide a checkpoint for quick metadata recovery after a crash.

Index File

Volume Mapping

Each volume maintains its own in-memory mapping and index file.

When updating the in-memory mapping (e.g., adding or modifying a file), the index file is updated asynchronously. Deleted files are only marked as deleted, not removed from the index file.

The index serves as a lookup aid. Needles without an index can still be addressed, making the asynchronous update and index retention strategy feasible.

Workloads

Read

(Logical Volume ID, key, alternate key, cookies) => photo

For a read request, Store queries the in-memory mapping for the corresponding Needle. If found, it fetches the data from the volume and verifies the cookie and integrity; otherwise, it returns an error.

Cookies are randomly generated strings that prevent malicious attacks.

Write

(Logical Volume ID, key, alternate key, cookies, data) => result

Haystack only supports appending data rather than overwriting. When a write request is received, Store asynchronously appends data to a Needle and updates the in-memory mapping. If it’s an existing file, the Directory updates its metadata to point to the latest version.

Older volumes are frozen as read-only, and new writes are appended, so a larger offset indicates a newer version.

Delete

Deletion is handled using Mark Delete + Compact GC.

Fault Tolerance

Store ensures fault tolerance through monitoring + hot backup. Directory and Cache use Raft-like consistency algorithms for data replication and availability.

Optimization

The main optimizations include: Compaction, Batch Load, and In-Memory processing.

Summary

  • Key abstraction optimizations include asynchronous processing, batch operations, and caching.
  • Identifying the core issues, such as metadata management burden for a large number of small files, is crucial.

References

Finding a needle in Haystack: Facebook’s photo storage