Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StateManager will now cleanup actions on client disconnect #1107

Merged

Conversation

allada
Copy link
Member

@allada allada commented Jul 8, 2024

StateManager will now properly remove items from the maps if the client disconnects after a set amount of time. Currently these values are hard codded, but will be easy to transition them to use config variables once we design it out.


This change is Reviewable

@allada allada force-pushed the cleanup-actions-on-client-disconnect branch from 7ed4501 to b60e10f Compare July 8, 2024 05:27
Copy link
Member Author

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+@zbirenbaum

Reviewable status: 0 of 1 LGTMs obtained, and 1 discussions need to be resolved (waiting on @zbirenbaum)


nativelink-scheduler/src/scheduler_state/workers.rs line 165 at r1 (raw file):

                .err_tip(|| "in update_operation on SimpleScheduler::update_action");
            if let Err(err) = update_operation_res {
                let result = Result::<(), _>::Err(err.clone())

note: A worker can now send updates about an action that the state manager does not know about. This is ok, since the Workers struct now owns the platform properties and resource management of the node instead of state manager.

Copy link
Member Author

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and pending CI: vale, and 1 discussions need to be resolved (waiting on @zbirenbaum)


nativelink-scheduler/src/simple_scheduler.rs line 57 at r2 (raw file):

/// Default time in seconds before a client is evicted.
// TODO!(make this a config and documented)

fyi: My system is to use todo! anywhere that we need to cleanup code before merging into main.

Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 8 of 12 files at r1, 4 of 4 files at r2, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained, and 8 discussions need to be resolved (waiting on @zbirenbaum)


nativelink-scheduler/src/scheduler_state/state_manager.rs line 48 at r2 (raw file):

/// How often the owning database will have the AwaitedAction touched
/// to keep it from being evicted.
const KEEPALIVE_TIMER: Duration = Duration::from_secs(10);

nit: its not really a timer, MAYBE KEEPALIVE_GRACE_PERIOD KEEPALIVE_DURATION ?


nativelink-scheduler/src/scheduler_state/state_manager.rs line 97 at r2 (raw file):

    // ClientAwaitedAction is dropped, but will never actually be
    // None except during the drop.
    client_operation_id: Option<ClientOperationId>,

Would be it overly pedantic to build a custom type for holding a client_operation_id that allows for it to be set once and dropped? That way we are not implying state around Option in corner conditions, also Option doesn't encode if it was set multiple times?

pub struct SetVar<T> {
    value: Option<T>,
    set_count: u8,
    set_count_max: u8,
}

impl<T> SetVar<T> {
    pub fn new(set_count_max: u8) -> Self {
        Self {
            value: None,
            set_count: 0,
            set_count_max: set_count_max
        }
    }

    pub fn set(&mut self, value: T) -> Result<(), &'static str> {
        if self.set_count_max == self.set_count {
           Err("Value can only be set once")
        } else {
            self.value = Some(value);
            self.set_count += 1;
            Ok(())
        }
    }

    pub fn drop_value(&mut self) -> Result<(), &'static str> {
        if self.set_count >= 1 {
            self.value = None;
            self.set_count = self.set_count_max;
            Ok(())
        } else if self.set_count == 0 {
            Err("Value has not been set yet")
        } else {
            Err("Value has already been dropped")
        }
    }

    pub fn get(&self) -> Option<&T> {
        self.value.as_ref()
    }

    pub fn is_set(&self) -> bool {
        self.set_count >= 1
    }
}

// Example:
let client_operation_id = SetVar<ClientOperationId>::new(1).set(a_client_operation_id);

nativelink-scheduler/src/scheduler_state/state_manager.rs line 140 at r2 (raw file):

}

/// Trit to be able to use the [`EvictingMap`] with [`ClientAwaitedAction`].

nit: Trit?


nativelink-scheduler/src/scheduler_state/state_manager.rs line 179 at r2 (raw file):

#[allow(clippy::mutable_key_type)]
impl AwaitedActionDb {
    /// Touches the client operation id to keep it from being evicted.

nit: /// Refreshes/Updates the time to live of the [`ClientOperationId`] in the [`EvictionMap`] by touching the key. ?


nativelink-scheduler/src/scheduler_state/state_manager.rs line 209 at r2 (raw file):

        if !did_remove {
            event!(
                Level::ERROR,

nit: Leave a simple comment to downgrade this event Level if it shows to be noisy and not an error?


nativelink-scheduler/src/scheduler_state/state_manager.rs line 254 at r2 (raw file):

        let sort_info = awaited_action.get_sort_info();
        let sort_key = sort_info.get_previous_sort_key();
        let btree = self.get_sort_map_for_state(&awaited_action.get_current_state().stage);

nit: sorted_awaited_action_btree


nativelink-scheduler/src/scheduler_state/state_manager.rs line 523 at r2 (raw file):

                        client_operation_to_awaited_action: EvictingMap::new(
                            config,
                            SystemTime::now(),

nit: For better control on testing would it make sense to pass in SystemTime::now() in StateManager::new() since could be a fixed / mocked type later? Puts the burden on the caller tho generally not having control of time under test tends to be a pita.

Copy link
Member Author

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and 3 discussions need to be resolved (waiting on @zbirenbaum)


nativelink-scheduler/src/scheduler_state/state_manager.rs line 48 at r2 (raw file):

Previously, adam-singer (Adam Singer) wrote…

nit: its not really a timer, MAYBE KEEPALIVE_GRACE_PERIOD KEEPALIVE_DURATION ?

Done.


nativelink-scheduler/src/scheduler_state/state_manager.rs line 97 at r2 (raw file):

Previously, adam-singer (Adam Singer) wrote…

Would be it overly pedantic to build a custom type for holding a client_operation_id that allows for it to be set once and dropped? That way we are not implying state around Option in corner conditions, also Option doesn't encode if it was set multiple times?

pub struct SetVar<T> {
    value: Option<T>,
    set_count: u8,
    set_count_max: u8,
}

impl<T> SetVar<T> {
    pub fn new(set_count_max: u8) -> Self {
        Self {
            value: None,
            set_count: 0,
            set_count_max: set_count_max
        }
    }

    pub fn set(&mut self, value: T) -> Result<(), &'static str> {
        if self.set_count_max == self.set_count {
           Err("Value can only be set once")
        } else {
            self.value = Some(value);
            self.set_count += 1;
            Ok(())
        }
    }

    pub fn drop_value(&mut self) -> Result<(), &'static str> {
        if self.set_count >= 1 {
            self.value = None;
            self.set_count = self.set_count_max;
            Ok(())
        } else if self.set_count == 0 {
            Err("Value has not been set yet")
        } else {
            Err("Value has already been dropped")
        }
    }

    pub fn get(&self) -> Option<&T> {
        self.value.as_ref()
    }

    pub fn is_set(&self) -> bool {
        self.set_count >= 1
    }
}

// Example:
let client_operation_id = SetVar<ClientOperationId>::new(1).set(a_client_operation_id);

Hmmm, the current way this is normally handled is to create an Inner struct, but that feels overkill for such a small struct.

Other libraries do it worse, they use: https://doc.rust-lang.org/std/mem/fn.forget.html


nativelink-scheduler/src/scheduler_state/state_manager.rs line 140 at r2 (raw file):

Previously, adam-singer (Adam Singer) wrote…

nit: Trit?

Done.


nativelink-scheduler/src/scheduler_state/state_manager.rs line 179 at r2 (raw file):

Previously, adam-singer (Adam Singer) wrote…

nit: /// Refreshes/Updates the time to live of the [`ClientOperationId`] in the [`EvictionMap`] by touching the key. ?

Done.


nativelink-scheduler/src/scheduler_state/state_manager.rs line 209 at r2 (raw file):

Previously, adam-singer (Adam Singer) wrote…

nit: Leave a simple comment to downgrade this event Level if it shows to be noisy and not an error?

Done.


nativelink-scheduler/src/scheduler_state/state_manager.rs line 254 at r2 (raw file):

Previously, adam-singer (Adam Singer) wrote…

nit: sorted_awaited_action_btree

Done.


nativelink-scheduler/src/scheduler_state/state_manager.rs line 523 at r2 (raw file):

Previously, adam-singer (Adam Singer) wrote…

nit: For better control on testing would it make sense to pass in SystemTime::now() in StateManager::new() since could be a fixed / mocked type later? Puts the burden on the caller tho generally not having control of time under test tends to be a pita.

Normally I'd 100% agree, however in this case, because of all the Weak calls, it'd require a lot of changes for no tests to use it currently.

I started to do it, but realized a lot of structs would need to be templated. Keep in mind in the long run I plan on deprecating AwaitedActionDb and moving to MemoryStore, but that's further down the line.

StateManager will now properly remove items from the maps if the
client disconnects after a set amount of time. Currently these
values are hard codded, but will be easy to transition them to
use config variables once we design it out.
@allada allada force-pushed the cleanup-actions-on-client-disconnect branch from b60e10f to a964c30 Compare July 8, 2024 23:14
Copy link
Member Author

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-@zbirenbaum

Reviewable status: :shipit: complete! 1 of 1 LGTMs obtained

@allada allada merged commit e95adfc into TraceMachina:dev Jul 8, 2024
3 checks passed
@allada allada deleted the cleanup-actions-on-client-disconnect branch July 8, 2024 23:16
allada added a commit to allada/nativelink-fork that referenced this pull request Jul 16, 2024
This is an extremely significant overhaul to Nativelink's scheduler
component. This new scheduler design is to enable a distributed
scheduling system.

The new components & definitions:
* AwaitedActionDb - An interface that is easier to work with when
  dealing with key-value storage systems.
* MemoryAwaitedActionDb - An in-memory set of hashmaps & btrees used
  to satisfy the requirements of AwaitedActionDb interface.
* ClientStateManager - A minimal interface required to satisfy the
  requirements of a client-facing scheduler.
* WorkerStateManager - A minimal interface required to satisfy the
  requirements of a worker-facing scheduler.
* MatchingEngineStateManager - A minimal interface required to
  satisfy a the engine that matches queued jobs to workers.
* SimpleSchedulerStateManager - An implementation that satisfies
  ClientStateManager, WorkerStateManager & MatchingEngineStateManager
  with all the logic of the previous "SimpleScheduler" logic moved
  behind each interface.
* ApiWorkerScheduler - A component that handles all knowledge about
  workers state and implmenets the WorkerScheduler interface and
  translates them into the WorkerStateManager interface.
* SimpleScheduler - Translation calls of the ClientScheduler
  interface into ClientStateManager & MatchingEngineStateManager.
  This component is currently always fowards calls to
  SimpleSchedulerStateManager then to MemoryAwaitedActionDb.
  Future changes will make these inner components dynamic via config.

In addition we have hardened the interactions of different kind of
IDs in NativeLink. Most relivent is the separation & introduction of:
* OperationId - Represents an individal operation being requested to
  be executed that is unique across all of time.
* ClientOperationId - An ID issued to the client when the client
  requests to execute a job. This ID will point to an OperationId
  internally, but the client is never exposed to the OperationId.
* AwaitedActionHashKey - A key used to uniquely identify an action
  that is not unique across time. This means that this key might
  have multiple OperationId's that have executed it across different
  points in time. This key is used as a "fingerprint" of an operation
  that the client wants to execute and the scheduler may decide to
  join the stream onto an existing operation if this key has a hit.

Overall these changes pave the way for more robust scheduler
implementations, most notably, distributed scheduler implementations
will be easier to impelemnt and will be introduced in followup PRs.

This commit was developed on a side branch and consisted of the
following commits with corresponding code reviews:
54ed73c
    Add scheduler metrics back (TraceMachina#1171)
50fdbd7
    fix formatting (TraceMachina#1170)
8926236
    Merge in main and format (TraceMachina#1168)
9c2c7b9
    key as u64 (TraceMachina#1166)
0192051
    Cleanup unused code and comments (TraceMachina#1165)
080df5d
    Add versioning to AwaitedAction (TraceMachina#1163)
73c19c4
    Fix sequence bug in new memory store manager (TraceMachina#1162)
6e50d2c
    New AwaitedActionDb implementation (TraceMachina#1157)
18db991
    Fix test on running_actions_manager_test (TraceMachina#1141)
e50ef3c
    Rename workers to `worker_scheduler`
1fdd505
    SimpleScheduler now uses config for action pruning (TraceMachina#1137)
eaaa872
    Change encoding for items that are cachable (TraceMachina#1136)
d647056
    Errors are now properly handles in subscription (TraceMachina#1135)
7c3e730
    Restructure files to be more appropriate (TraceMachina#1131)
5e98ec9
    ClientAwaitedAction now uses a channel to notify drops happened (TraceMachina#1130)
52beaf9
    Cleanup unused structs (TraceMachina#1128)
e86fe08
    Remove all uses of salt and put under ActionUniqueQualifier (TraceMachina#1126)
3b86036
    Remove all need for workers to know about ActionId (TraceMachina#1125)
5482d7f
    Fix bazel build and test on dev (TraceMachina#1123)
ba52c7f
    Implement get_action_info to all ActionStateResult impls (TraceMachina#1118)
2fa4fee
    Remove MatchingEngineStateManager::remove_operation (TraceMachina#1119)
34dea06
    Remove unused proto field (TraceMachina#1117)
3070a40
    Remove metrics from new scheduler (TraceMachina#1116)
e95adfc
    StateManager will now cleanup actions on client disconnect (TraceMachina#1107)
6f8c001
    Fix worker execution issues (TraceMachina#1114)
d353c30
    rename set_priority to upgrade_priority (TraceMachina#1112)
0d93671
    StateManager can now be notified of noone listeneing (TraceMachina#1093)
cfc0cf6
    ActionScheduler will now use ActionListener instead of tokio::watch (TraceMachina#1091)
d70d31d
    QA fixes for scheduler-v2 (TraceMachina#1092)
f2cea0c
    [Refactor] Complete rewrite of SimpleScheduler
34d93b7
    [Refactor] Move worker notification in SimpleScheduler under Workers
b9d9702
    [Refactor] Moves worker logic back to SimpleScheduler
7a16e2e
    [Refactor] Move scheduler state behind mute
allada added a commit to allada/nativelink-fork that referenced this pull request Jul 16, 2024
This is an extremely significant overhaul to Nativelink's scheduler
component. This new scheduler design is to enable a distributed
scheduling system.

The new components & definitions:
* AwaitedActionDb - An interface that is easier to work with when
  dealing with key-value storage systems.
* MemoryAwaitedActionDb - An in-memory set of hashmaps & btrees used
  to satisfy the requirements of AwaitedActionDb interface.
* ClientStateManager - A minimal interface required to satisfy the
  requirements of a client-facing scheduler.
* WorkerStateManager - A minimal interface required to satisfy the
  requirements of a worker-facing scheduler.
* MatchingEngineStateManager - A minimal interface required to
  satisfy a engine that matches queued jobs to workers.
* SimpleSchedulerStateManager - An implements that satisfies
  ClientStateManager, WorkerStateManager & MatchingEngineStateManager
  with all the logic of the previous "SimpleScheduler" logic moved
  behind each interface.
* ApiWorkerScheduler - A component that handles all knowledge about
  workers state and implmenets the WorkerScheduler interface and
  translates them into the WorkerStateManager interface.
* SimpleScheduler - Translation calls of the ClientScheduler
  interface into ClientStateManager & MatchingEngineStateManager.
  This component is currently always forwards calls to
  SimpleSchedulerStateManager then to MemoryAwaitedActionDb.
  Future changes will make these inner components dynamic via config.

In addition we have hardened the interactions of different kind of
IDs in NativeLink. Most relevant is the separation & introduction of:
* OperationId - Represents an individual  operation being requested
  to be executed that is unique across all of time.
* ClientOperationId - An ID issued to the client when the client
  requests to execute a job. This ID will point to an OperationId
  internally, but the client is never exposed to the OperationId.
* AwaitedActionHashKey - A key used to uniquely identify an action
  that is not unique across time. This means that this key might
  have multiple OperationId's that have executed it across different
  points in time. This key is used as a "fingerprint" of an operation
  that the client wants to execute and the scheduler may decide to
  join the stream onto an existing operation if this key has a hit.

Overall these changes pave the way for more robust scheduler
implementations, most notably, distributed scheduler implementations
will be easier to implement and will be introduced in followup PRs.

This commit was developed on a side branch and consisted of the
following commits with corresponding code reviews:
54ed73c
    Add scheduler metrics back (TraceMachina#1171)
50fdbd7
    fix formatting (TraceMachina#1170)
8926236
    Merge in main and format (TraceMachina#1168)
9c2c7b9
    key as u64 (TraceMachina#1166)
0192051
    Cleanup unused code and comments (TraceMachina#1165)
080df5d
    Add versioning to AwaitedAction (TraceMachina#1163)
73c19c4
    Fix sequence bug in new memory store manager (TraceMachina#1162)
6e50d2c
    New AwaitedActionDb implementation (TraceMachina#1157)
18db991
    Fix test on running_actions_manager_test (TraceMachina#1141)
e50ef3c
    Rename workers to `worker_scheduler`
1fdd505
    SimpleScheduler now uses config for action pruning (TraceMachina#1137)
eaaa872
    Change encoding for items that are cachable (TraceMachina#1136)
d647056
    Errors are now properly handles in subscription (TraceMachina#1135)
7c3e730
    Restructure files to be more appropriate (TraceMachina#1131)
5e98ec9
    ClientAwaitedAction now uses a channel to notify drops happened (TraceMachina#1130)
52beaf9
    Cleanup unused structs (TraceMachina#1128)
e86fe08
    Remove all uses of salt and put under ActionUniqueQualifier (TraceMachina#1126)
3b86036
    Remove all need for workers to know about ActionId (TraceMachina#1125)
5482d7f
    Fix bazel build and test on dev (TraceMachina#1123)
ba52c7f
    Implement get_action_info to all ActionStateResult impls (TraceMachina#1118)
2fa4fee
    Remove MatchingEngineStateManager::remove_operation (TraceMachina#1119)
34dea06
    Remove unused proto field (TraceMachina#1117)
3070a40
    Remove metrics from new scheduler (TraceMachina#1116)
e95adfc
    StateManager will now cleanup actions on client disconnect (TraceMachina#1107)
6f8c001
    Fix worker execution issues (TraceMachina#1114)
d353c30
    rename set_priority to upgrade_priority (TraceMachina#1112)
0d93671
    StateManager can now be notified of noone listeneing (TraceMachina#1093)
cfc0cf6
    ActionScheduler will now use ActionListener instead of tokio::watch (TraceMachina#1091)
d70d31d
    QA fixes for scheduler-v2 (TraceMachina#1092)
f2cea0c
    [Refactor] Complete rewrite of SimpleScheduler
34d93b7
    [Refactor] Move worker notification in SimpleScheduler under Workers
b9d9702
    [Refactor] Moves worker logic back to SimpleScheduler
7a16e2e
    [Refactor] Move scheduler state behind mute
allada added a commit that referenced this pull request Jul 16, 2024
This is a significant overhaul to Nativelink's scheduler
component. This new scheduler design is to enable a distributed
scheduling system.

The new components & definitions:
* AwaitedActionDb - An interface that is easier to work with when
  dealing with key-value storage systems.
* MemoryAwaitedActionDb - An in-memory set of hashmaps & btrees used
  to satisfy the requirements of AwaitedActionDb interface.
* ClientStateManager - A minimal interface required to satisfy the
  requirements of a client-facing scheduler.
* WorkerStateManager - A minimal interface required to satisfy the
  requirements of a worker-facing scheduler.
* MatchingEngineStateManager - A minimal interface required to
  satisfy a engine that matches queued jobs to workers.
* SimpleSchedulerStateManager - An implements that satisfies
  ClientStateManager, WorkerStateManager & MatchingEngineStateManager
  with all the logic of the previous "SimpleScheduler" logic moved
  behind each interface.
* ApiWorkerScheduler - A component that handles all knowledge about
  workers state and implmenets the WorkerScheduler interface and
  translates them into the WorkerStateManager interface.
* SimpleScheduler - Translation calls of the ClientScheduler
  interface into ClientStateManager & MatchingEngineStateManager.
  This component is currently always forwards calls to
  SimpleSchedulerStateManager then to MemoryAwaitedActionDb.
  Future changes will make these inner components dynamic via config.

In addition we have hardened the interactions of different kind of
IDs in NativeLink. Most relevant is the separation & introduction of:
* OperationId - Represents an individual  operation being requested
  to be executed that is unique across all of time.
* ClientOperationId - An ID issued to the client when the client
  requests to execute a job. This ID will point to an OperationId
  internally, but the client is never exposed to the OperationId.
* AwaitedActionHashKey - A key used to uniquely identify an action
  that is not unique across time. This means that this key might
  have multiple OperationId's that have executed it across different
  points in time. This key is used as a "fingerprint" of an operation
  that the client wants to execute and the scheduler may decide to
  join the stream onto an existing operation if this key has a hit.

Overall these changes pave the way for more robust scheduler
implementations, most notably, distributed scheduler implementations
will be easier to implement and will be introduced in followup PRs.

This commit was developed on a side branch and consisted of the
following commits with corresponding code reviews:
54ed73c
    Add scheduler metrics back (#1171)
50fdbd7
    fix formatting (#1170)
8926236
    Merge in main and format (#1168)
9c2c7b9
    key as u64 (#1166)
0192051
    Cleanup unused code and comments (#1165)
080df5d
    Add versioning to AwaitedAction (#1163)
73c19c4
    Fix sequence bug in new memory store manager (#1162)
6e50d2c
    New AwaitedActionDb implementation (#1157)
18db991
    Fix test on running_actions_manager_test (#1141)
e50ef3c
    Rename workers to `worker_scheduler`
1fdd505
    SimpleScheduler now uses config for action pruning (#1137)
eaaa872
    Change encoding for items that are cachable (#1136)
d647056
    Errors are now properly handles in subscription (#1135)
7c3e730
    Restructure files to be more appropriate (#1131)
5e98ec9
    ClientAwaitedAction now uses a channel to notify drops happened (#1130)
52beaf9
    Cleanup unused structs (#1128)
e86fe08
    Remove all uses of salt and put under ActionUniqueQualifier (#1126)
3b86036
    Remove all need for workers to know about ActionId (#1125)
5482d7f
    Fix bazel build and test on dev (#1123)
ba52c7f
    Implement get_action_info to all ActionStateResult impls (#1118)
2fa4fee
    Remove MatchingEngineStateManager::remove_operation (#1119)
34dea06
    Remove unused proto field (#1117)
3070a40
    Remove metrics from new scheduler (#1116)
e95adfc
    StateManager will now cleanup actions on client disconnect (#1107)
6f8c001
    Fix worker execution issues (#1114)
d353c30
    rename set_priority to upgrade_priority (#1112)
0d93671
    StateManager can now be notified of noone listeneing (#1093)
cfc0cf6
    ActionScheduler will now use ActionListener instead of tokio::watch (#1091)
d70d31d
    QA fixes for scheduler-v2 (#1092)
f2cea0c
    [Refactor] Complete rewrite of SimpleScheduler
34d93b7
    [Refactor] Move worker notification in SimpleScheduler under Workers
b9d9702
    [Refactor] Moves worker logic back to SimpleScheduler
7a16e2e
    [Refactor] Move scheduler state behind mute
allada added a commit that referenced this pull request Jul 17, 2024
This is a significant overhaul to Nativelink's scheduler
component. This new scheduler design is to enable a distributed
scheduling system.

The new components & definitions:
* AwaitedActionDb - An interface that is easier to work with when
  dealing with key-value storage systems.
* MemoryAwaitedActionDb - An in-memory set of hashmaps & btrees used
  to satisfy the requirements of AwaitedActionDb interface.
* ClientStateManager - A minimal interface required to satisfy the
  requirements of a client-facing scheduler.
* WorkerStateManager - A minimal interface required to satisfy the
  requirements of a worker-facing scheduler.
* MatchingEngineStateManager - A minimal interface required to
  satisfy a engine that matches queued jobs to workers.
* SimpleSchedulerStateManager - An implements that satisfies
  ClientStateManager, WorkerStateManager & MatchingEngineStateManager
  with all the logic of the previous "SimpleScheduler" logic moved
  behind each interface.
* ApiWorkerScheduler - A component that handles all knowledge about
  workers state and implmenets the WorkerScheduler interface and
  translates them into the WorkerStateManager interface.
* SimpleScheduler - Translation calls of the ClientScheduler
  interface into ClientStateManager & MatchingEngineStateManager.
  This component is currently always forwards calls to
  SimpleSchedulerStateManager then to MemoryAwaitedActionDb.
  Future changes will make these inner components dynamic via config.

In addition we have hardened the interactions of different kind of
IDs in NativeLink. Most relevant is the separation & introduction of:
* OperationId - Represents an individual  operation being requested
  to be executed that is unique across all of time.
* ClientOperationId - An ID issued to the client when the client
  requests to execute a job. This ID will point to an OperationId
  internally, but the client is never exposed to the OperationId.
* AwaitedActionHashKey - A key used to uniquely identify an action
  that is not unique across time. This means that this key might
  have multiple OperationId's that have executed it across different
  points in time. This key is used as a "fingerprint" of an operation
  that the client wants to execute and the scheduler may decide to
  join the stream onto an existing operation if this key has a hit.

Overall these changes pave the way for more robust scheduler
implementations, most notably, distributed scheduler implementations
will be easier to implement and will be introduced in followup PRs.

This commit was developed on a side branch and consisted of the
following commits with corresponding code reviews:
54ed73c
    Add scheduler metrics back (#1171)
50fdbd7
    fix formatting (#1170)
8926236
    Merge in main and format (#1168)
9c2c7b9
    key as u64 (#1166)
0192051
    Cleanup unused code and comments (#1165)
080df5d
    Add versioning to AwaitedAction (#1163)
73c19c4
    Fix sequence bug in new memory store manager (#1162)
6e50d2c
    New AwaitedActionDb implementation (#1157)
18db991
    Fix test on running_actions_manager_test (#1141)
e50ef3c
    Rename workers to `worker_scheduler`
1fdd505
    SimpleScheduler now uses config for action pruning (#1137)
eaaa872
    Change encoding for items that are cachable (#1136)
d647056
    Errors are now properly handles in subscription (#1135)
7c3e730
    Restructure files to be more appropriate (#1131)
5e98ec9
    ClientAwaitedAction now uses a channel to notify drops happened (#1130)
52beaf9
    Cleanup unused structs (#1128)
e86fe08
    Remove all uses of salt and put under ActionUniqueQualifier (#1126)
3b86036
    Remove all need for workers to know about ActionId (#1125)
5482d7f
    Fix bazel build and test on dev (#1123)
ba52c7f
    Implement get_action_info to all ActionStateResult impls (#1118)
2fa4fee
    Remove MatchingEngineStateManager::remove_operation (#1119)
34dea06
    Remove unused proto field (#1117)
3070a40
    Remove metrics from new scheduler (#1116)
e95adfc
    StateManager will now cleanup actions on client disconnect (#1107)
6f8c001
    Fix worker execution issues (#1114)
d353c30
    rename set_priority to upgrade_priority (#1112)
0d93671
    StateManager can now be notified of noone listeneing (#1093)
cfc0cf6
    ActionScheduler will now use ActionListener instead of tokio::watch (#1091)
d70d31d
    QA fixes for scheduler-v2 (#1092)
f2cea0c
    [Refactor] Complete rewrite of SimpleScheduler
34d93b7
    [Refactor] Move worker notification in SimpleScheduler under Workers
b9d9702
    [Refactor] Moves worker logic back to SimpleScheduler
7a16e2e
    [Refactor] Move scheduler state behind mute
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants