Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Multi Cluster Load Balance #4929

Closed
hzy46 opened this issue Sep 22, 2020 · 2 comments · Fixed by #5082
Closed

Multi Cluster Load Balance #4929

hzy46 opened this issue Sep 22, 2020 · 2 comments · Fixed by #5082
Assignees
Labels

Comments

@hzy46
Copy link
Contributor

hzy46 commented Sep 22, 2020

Motivation

A user can access to multiple clusters, and the clusters may vary significantly in availability, cost, and congestion. For example, there may be an expensive powerful cluster with V100 GPUs, but always busy, while another cheap K80 cluster is much idle. It is reasonable to balance the workloads (jobs) to the most appropriate cluster to reduce total waiting time or cost. Since different customers may have diverse needs, the policies to direct the load balance must be highly customizable. (Note: Only waiting jobs could be balanced, running jobs will not be touched.)

  • Scenario A: The user has one on-prem cluster (limited resource) and an on-cloud cluster (scalable). If the on-prem cluster is too congested (too many long-waiting jobs), some of the long-waiting jobs may be balanced to the on-cloud cluster to reduce the total waiting time.

  • Scenario A+ (generalization of A): The user has access to multiple clusters. If some clusters are too congested (too many long-waiting jobs), some of the long-waiting jobs could be balanced to another (free) cluster to reduce total waiting time.

Proposal - multiple cluster load balance service

Design Goal: Provide a mechanism (service) to empower users take full advantage of multiple clusters automatically, thus meet users' custom requirement (e.g. reduce completed time, reduce waiting time, reduce cost, etc.)

The service provides the following functions to users:

  1. monitor the congestion of multiple clusters, and filter the balanceable jobs per the job-selection policies (e.g. jobs waiting for a given time in a cluster)

  2. for each selected job, find the most appropriate target cluster according to cluster-selection policies (e.g. available and most powerful, or cheapest) and clone the job from current cluster to target cluster

  3. after cloning the job to target cluster, stop the original job intermediately or let the old/new jobs compete (until one of them is started) according to post-action policies

Notes:

  • All the policies mentioned above would be customizable by users (admins), there may be also other polies such as

    • historical policies to setup the historical (or stateful) constraints for a job (e.g. max times of migrations for a job, max number of clones allowed at the same time)
    • policies are defined when service starting, not support dynamic changing in the first; per-job policy setting could be an option for further discussion
  • To support above policies, the scheduling service may communicate with each cluster for

    • cluster status, such as available sku types and # of skus in the VC ...
    • job status, such as job status, submitting time, job tags ...
    • (future) job execution statistics and estimations for advanced scheduling, e.g. estimated data throughput, estimated job execution time, ...
  • A job is balanceable means a job is in WAITING state and defined by a template (and cluster specific context). The format of template and context will be in a separate issue

  • The balancing attempts (history) will be traced in database or by job tagging

  • This service is backend only. A webportal to view, manage multiple clusters, nodes, and jobs will be in a separate issue

@mydmdm

This comment has been minimized.

@mydmdm mydmdm changed the title Multi Cluster Scheduling Proposal Multi Cluster Load Balance Sep 24, 2020
@hzy46
Copy link
Contributor Author

hzy46 commented Oct 26, 2020

Final Design:

  1. a section on user profile page to let users register bounded clusters (change the name to Bounded Clusters in the following pic)

image

  1. User can click a transfer button to transfer job on job detail page

  2. The transfer page looks like:

image

  1. After job is transferred, show the message to user:

image

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants