Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Refactor] Overhaul of scheduler component #1169

Merged
merged 1 commit into from
Jul 17, 2024
Merged

[Refactor] Overhaul of scheduler component #1169

merged 1 commit into from
Jul 17, 2024

Commits on Jul 16, 2024

  1. [Refactor] Overhaul of scheduler component

    This is a significant overhaul to Nativelink's scheduler
    component. This new scheduler design is to enable a distributed
    scheduling system.
    
    The new components & definitions:
    * AwaitedActionDb - An interface that is easier to work with when
      dealing with key-value storage systems.
    * MemoryAwaitedActionDb - An in-memory set of hashmaps & btrees used
      to satisfy the requirements of AwaitedActionDb interface.
    * ClientStateManager - A minimal interface required to satisfy the
      requirements of a client-facing scheduler.
    * WorkerStateManager - A minimal interface required to satisfy the
      requirements of a worker-facing scheduler.
    * MatchingEngineStateManager - A minimal interface required to
      satisfy a engine that matches queued jobs to workers.
    * SimpleSchedulerStateManager - An implements that satisfies
      ClientStateManager, WorkerStateManager & MatchingEngineStateManager
      with all the logic of the previous "SimpleScheduler" logic moved
      behind each interface.
    * ApiWorkerScheduler - A component that handles all knowledge about
      workers state and implmenets the WorkerScheduler interface and
      translates them into the WorkerStateManager interface.
    * SimpleScheduler - Translation calls of the ClientScheduler
      interface into ClientStateManager & MatchingEngineStateManager.
      This component is currently always forwards calls to
      SimpleSchedulerStateManager then to MemoryAwaitedActionDb.
      Future changes will make these inner components dynamic via config.
    
    In addition we have hardened the interactions of different kind of
    IDs in NativeLink. Most relevant is the separation & introduction of:
    * OperationId - Represents an individual  operation being requested
      to be executed that is unique across all of time.
    * ClientOperationId - An ID issued to the client when the client
      requests to execute a job. This ID will point to an OperationId
      internally, but the client is never exposed to the OperationId.
    * AwaitedActionHashKey - A key used to uniquely identify an action
      that is not unique across time. This means that this key might
      have multiple OperationId's that have executed it across different
      points in time. This key is used as a "fingerprint" of an operation
      that the client wants to execute and the scheduler may decide to
      join the stream onto an existing operation if this key has a hit.
    
    Overall these changes pave the way for more robust scheduler
    implementations, most notably, distributed scheduler implementations
    will be easier to implement and will be introduced in followup PRs.
    
    This commit was developed on a side branch and consisted of the
    following commits with corresponding code reviews:
    54ed73c
        Add scheduler metrics back (#1171)
    50fdbd7
        fix formatting (#1170)
    8926236
        Merge in main and format (#1168)
    9c2c7b9
        key as u64 (#1166)
    0192051
        Cleanup unused code and comments (#1165)
    080df5d
        Add versioning to AwaitedAction (#1163)
    73c19c4
        Fix sequence bug in new memory store manager (#1162)
    6e50d2c
        New AwaitedActionDb implementation (#1157)
    18db991
        Fix test on running_actions_manager_test (#1141)
    e50ef3c
        Rename workers to `worker_scheduler`
    1fdd505
        SimpleScheduler now uses config for action pruning (#1137)
    eaaa872
        Change encoding for items that are cachable (#1136)
    d647056
        Errors are now properly handles in subscription (#1135)
    7c3e730
        Restructure files to be more appropriate (#1131)
    5e98ec9
        ClientAwaitedAction now uses a channel to notify drops happened (#1130)
    52beaf9
        Cleanup unused structs (#1128)
    e86fe08
        Remove all uses of salt and put under ActionUniqueQualifier (#1126)
    3b86036
        Remove all need for workers to know about ActionId (#1125)
    5482d7f
        Fix bazel build and test on dev (#1123)
    ba52c7f
        Implement get_action_info to all ActionStateResult impls (#1118)
    2fa4fee
        Remove MatchingEngineStateManager::remove_operation (#1119)
    34dea06
        Remove unused proto field (#1117)
    3070a40
        Remove metrics from new scheduler (#1116)
    e95adfc
        StateManager will now cleanup actions on client disconnect (#1107)
    6f8c001
        Fix worker execution issues (#1114)
    d353c30
        rename set_priority to upgrade_priority (#1112)
    0d93671
        StateManager can now be notified of noone listeneing (#1093)
    cfc0cf6
        ActionScheduler will now use ActionListener instead of tokio::watch (#1091)
    d70d31d
        QA fixes for scheduler-v2 (#1092)
    f2cea0c
        [Refactor] Complete rewrite of SimpleScheduler
    34d93b7
        [Refactor] Move worker notification in SimpleScheduler under Workers
    b9d9702
        [Refactor] Moves worker logic back to SimpleScheduler
    7a16e2e
        [Refactor] Move scheduler state behind mute
    allada committed Jul 16, 2024
    Configuration menu
    Copy the full SHA
    2954bfa View commit details
    Browse the repository at this point in the history