DFS-Haystack
October 6, 2021 · 1284 words · 7 min · DFS Paper Reading Distributed System
The primary project in my group is a distributed file system (DFS) that provides POSIX file system semantics. The approach to handle “lots of small files” (LOSF) is inspired by Haystack, which is specifically designed for small files. I decided to read through the Haystack paper and take some notes as a learning exercise.
These notes are not an in-depth analysis of specific details but rather a record of my thoughts on the problem and design approach.
Introduction
Haystack is a storage system designed by Facebook for small files. In traditional DFS, file addressing typically involves using caches to store metadata, reducing disk interaction and improving lookup efficiency. For each file, a separate set of metadata must be maintained, with the volume of metadata depending on the number of files. In high-concurrency scenarios, metadata is cached in memory to reduce disk I/O.
With a large number of small files, the volume of metadata becomes significant. Considering the maintenance overhead of in-memory metadata, this approach becomes impractical. Therefore, Haystack was developed specifically for small files, with the core idea of aggregating multiple small files into a larger one to reduce metadata.
Background
The “small files” in the paper specifically refer to image data.
Facebook, as a social media company, deals heavily with image uploads and retrieval. As the business scaled, it became necessary to have a dedicated service to handle the massive, high-concurrency requests for image reads and writes.
In the social networking context, this type of data is characterized as written once, read often, never modified, and rarely deleted
. Based on this, Facebook developed Haystack to support image sharing services.
Design
Traditional Design
The paper describes two historical designs: CDN-based and NAS-based solutions.
CDN-based Solution
The core of this solution is to use CDN (Content Delivery Network) to cache hot image data, reducing network transmission.
This approach optimizes access to hot images but also has some issues. Firstly, CDN is expensive and has limited capacity. Secondly, image sharing includes many less popular
images, which leads to the long tail effect, slowing down access.
CDNs are generally used to serve static data and are often pre-warmed before an event, making them unsuitable as an image cache service. Many
less popular
images do not enter the CDN, leading to the long tail effect.
NAS-based Solution
This was Facebook’s initial design and is essentially a variation of the CDN-based solution.
They introduced NAS (Network Attached Storage) for horizontal storage expansion, incorporating file system semantics, but disk I/O remained an issue. Similar to local files, reading uncached data requires at least three disk I/O operations:
- Read directory metadata into memory
- Load the inode into memory
- Read the content of the file
PhotoStore was used as a caching layer to store some metadata like file handles to speed up the addressing process.
The NAS-based design did not solve the fundamental issue of excessive metadata that could not be fully cached. When the number of files reaches a certain threshold, disk I/O becomes inevitable.
The fundamental issue is the one-to-one relationship between files and addressing metadata, causing the volume of metadata to change with the number of files.
Thus, the key to optimization is changing the one-to-one relationship between files and metadata, reducing the frequency of disk I/O during addressing.
Haystack-based Solution
The core idea of Haystack is to aggregate multiple small files into a larger one, maintaining a single piece of metadata for the large file. This changes the mapping between metadata and files, making it feasible to keep all metadata in memory.
Metadata is maintained only for the aggregated file, and the position of small files within the large file is maintained separately.
Implementation
Haystack mainly consists of three components: Haystack Directory, Haystack Cache, and Haystack Store.
File Mapping and Storage
File data is ultimately stored on logical volumes, each of which corresponds to multiple physical volumes across machines.
Users first access the Directory to obtain access paths and then use the URL generated by the Directory to access other components to retrieve the required data.
Components
Haystack Directory
This is Haystack’s access layer, responsible for file addressing and access control.
Read and write requests first go through the Directory. For read requests, the Directory generates an access URL containing the path: http://{cdn}/{cache}/{machine id}/{logicalvolume,Photo}
. For write requests, it provides a volume to write into.
The Directory has four main functions:
- Load balancing for read and write requests.
- Determine request access paths (e.g., CDN or direct access) and generate access URLs.
- Metadata and mapping management, e.g., logical attributes to volume mapping.
- Logical volume read/write management, where volumes can be read-only or write-enabled.
This design is based on the data characteristics: “write once, read more.” This setup improves concurrency.
The Directory stores metadata such as file-to-volume mappings, logical-to-physical mappings, and volume attributes (size, owner, etc.). It relies on a distributed key-value store and a cache service to ensure low latency and high availability.
Proxy, Metadata Mapping, Access Control
Haystack Cache
The Cache layer optimizes addressing and image retrieval. The core design is the Cache Rule, which determines what data should be cached and how to handle cache misses.
Images are cached if they meet these criteria:
- The request is directly from a user, not from a CDN.
- The photo is retrieved from a write-enabled store machine.
If a cache miss occurs, the Cache fetches the image from the Store and pushes it to both the user and the CDN.
The caching policy is based on typical access patterns.
Haystack Store
The Store layer is responsible for data storage operations.
The addressing abstraction is: filename + offset => logical volume id + offset => data
.
Multiple physical volumes constitute a logical volume. In the Store, small files are encapsulated as Needles managed by physical volumes.
Needles represent a way to encapsulate small files and manage volume blocks.
Store data is accessed at the Needle level. To speed up addressing, a memory map is used: key/alternate key => needle's flag/offset/other attributes
.
These maps are persisted in Index Files on disk to provide a checkpoint for quick metadata recovery after a crash.
Each volume maintains its own in-memory mapping and index file.
When updating the in-memory mapping (e.g., adding or modifying a file), the index file is updated asynchronously. Deleted files are only marked as deleted, not removed from the index file.
The index serves as a lookup aid. Needles without an index can still be addressed, making the asynchronous update and index retention strategy feasible.
Workloads
Read
(Logical Volume ID, key, alternate key, cookies) => photo
For a read request, Store queries the in-memory mapping for the corresponding Needle. If found, it fetches the data from the volume and verifies the cookie and integrity; otherwise, it returns an error.
Cookies are randomly generated strings that prevent malicious attacks.
Write
(Logical Volume ID, key, alternate key, cookies, data) => result
Haystack only supports appending data rather than overwriting. When a write request is received, Store asynchronously appends data to a Needle and updates the in-memory mapping. If it’s an existing file, the Directory updates its metadata to point to the latest version.
Older volumes are frozen as read-only, and new writes are appended, so a larger offset indicates a newer version.
Delete
Deletion is handled using Mark Delete + Compact GC.
Fault Tolerance
Store ensures fault tolerance through monitoring + hot backup. Directory and Cache use Raft-like consistency algorithms for data replication and availability.
Optimization
The main optimizations include: Compaction, Batch Load, and In-Memory processing.
Summary
- Key abstraction optimizations include asynchronous processing, batch operations, and caching.
- Identifying the core issues, such as metadata management burden for a large number of small files, is crucial.