Borg: Large-scale Cluster Management at Google with Borg

February 19, 2024 · 557 words · 3 min · Borg K8s Cluster Management

Borg is a cluster management system, similar to the closed-source version of Kubernetes (k8s).

It achieves high utilization through admission control, efficient task packing, overcommitment, machine sharing, and process-level performance isolation.
It provides runtime features to reduce failure recovery time for high-availability applications and scheduling policies that reduce the probability of correlated failures.
It offers a declarative job description language, DNS integration, real-time job monitoring, and tools for analyzing and simulating system behavior, simplifying usage for end-users.

The paper aims to introduce the system design and share the experiences Google has gained behind it. This blog mainly focuses on system design, specifically the services Borg offers in terms of SLA, its abstraction of workloads, resources, and scheduling.

System Abstraction

Borg manages two primary workloads: long-running services and batch jobs, corresponding to two types of jobs (prod/non-prod). A job consists of several tasks, and different jobs have different priorities.

In terms of deployment architecture, a Borg cluster consists of several cells, each containing multiple machines.

For task scheduling, all physical or logical units on machines are treated as resources, including CPU, memory, IO, etc.

System Architecture

Borg uses a master-slave architecture, consisting of a BorgMaster and several Borglet nodes. The scheduler is an independent service.

BorgMaster is a logical node responsible for interacting with both external components and Borglets, as well as maintaining the internal state of the cluster. It uses Paxos to achieve multi-replication and high availability.

Borglet is the Borg proxy on each machine in the cell. It is responsible for starting/stopping tasks, managing node physical resources, and reporting status.

Scheduler is the service responsible for task scheduling. It uses the state recorded by the master to asynchronously handle task scheduling and informs the master for a secondary check.

Resource Scheduling

The scheduler is a key service in Borg. The quality of the scheduling algorithm directly affects resource utilization and is closely related to cost efficiency.

Basic Process

The scheduling algorithm has two parts:

Feasibility Check: Finds a set of machines capable of running the task.
Scoring: Selects the most suitable machine from that set.

During the feasibility check, the scheduler finds a set of machines that meet task constraints and have enough available resources. Available resources include those already allocated to lower-priority tasks that can be preempted.

During the scoring phase, the scheduler determines the suitability of each feasible machine. Scoring considers user-specific preferences but primarily depends on built-in criteria, such as minimizing the number and priority of preempted tasks, selecting machines that already have the task package, distributing tasks across different power and failure domains, and optimizing packing quality (mixing high- and low-priority tasks on a single machine to allow high-priority tasks to expand during load spikes).

The scheduler uses a cached copy of the cell state and performs the following steps repeatedly:

Retrieves state changes (including assigned and pending jobs) from the elected master and updates its local copy.
Runs a round of scheduling to assign tasks and sends assignment information to the master.
The master accepts and applies the assignments, but if they are unsuitable (e.g., based on outdated state), it waits for the scheduler’s next round.

Additional Aspects

The paper also discusses how to provide oversubscription and handle performance contention, though these are not the focus of this blog. Readers can refer to the original paper for more details.

References

https://www.cnblogs.com/hellojamest/p/16526159.html

Prometheus--TSDB Percolator: Large-scale Incremental Processing Using Distributed Transactions and Notifications