Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dispatch ML task to ML node first #346

Merged
merged 1 commit into from
Jun 17, 2022

Conversation

ylwu-amzn
Copy link
Collaborator

@ylwu-amzn ylwu-amzn commented Jun 15, 2022

Signed-off-by: Yaliang Wu ylwu@amazon.com

Description

We plan to support dynamic role in OpenSearch. Refer to this PR opensearch-project/OpenSearch#3436 and issue opensearch-project/OpenSearch#2877

This PR is to change the task dispatcher to support ML node. Will check if cluster has any node with "ml" role first. If yes, will dispatch ML task to "ml" nodes; otherwise will dispatch to data nodes.

Issues Resolved

close #79

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Yaliang Wu <ylwu@amazon.com>
@@ -115,14 +113,7 @@ public class MachineLearningPlugin extends Plugin implements ActionPlugin {
private ClusterService clusterService;
private ThreadPool threadPool;

public static final Setting<Boolean> IS_ML_NODE_SETTING = Setting.boolSetting("node.ml", false, Setting.Property.NodeScope);
Copy link
Collaborator

@jackiehanyang jackiehanyang Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curiosity question: what is the purpose of IS_ML_NODE_SETTING before and why we don't need it now? I saw this part of logic was moved to TestHelper class, what's the reason for that?

Copy link
Collaborator Author

@ylwu-amzn ylwu-amzn Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That part was from prototype when we tried to support ML node. But actually it's not being used in our formal release as OpenSearch core doesn't support ML role and we have to postpone that. Now OpenSearch plan to support ML role with dynamic role feature in 2.1. We can add this back but we don't need this prototype/experiment code any more. Just move it to test part.

DiscoveryNode[] mlNodes = eligibleMLNodes.toArray(new DiscoveryNode[0]);
log.debug("Find {} dedicated ML nodes: {}", eligibleMLNodes.size(), Arrays.toString(mlNodes));
return mlNodes;
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: This "else" should be redundant.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's code style preference. People have different preference for "No-else-after-return" or not, check https://stackoverflow.com/questions/46875442/unnecessary-else-after-return-no-else-return.

For me, I feel the code is more readable to keep else to make the returns have same indentation.

Comment on lines +110 to +116
/**
* Get eligible node to run ML task. If there are nodes with ml role, will return all these
* ml nodes; otherwise return all data nodes.
*
* @return array of discovery node
*/
protected DiscoveryNode[] getEligibleNodes() {
Copy link
Collaborator

@Zhangxunmt Zhangxunmt Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it only preferable to run ML tasks in ml node? I assume ml-common can run in data node as well. Also is there any logic in the ClusterState.nodes() to evaluate if any ml node is overloaded, etc? I just wonder, in the future, if we want to add more priority based strategy here to prioritize ML node, but still use data node if ML node is heavy loaded, etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it only preferable to run ML tasks in ml node? I assume ml-common can run in data node as well.

Check the comment "If there are nodes with ml role, will return all these ml nodes; otherwise return all data nodes."

Also is there any logic in the ClusterState.nodes() to evaluate if any ml node is overloaded, etc?

Yes, we check JVM heap usage and how many ML task running on a node. If exceeds limit, will not dispatch new ML task to that node.

I just wonder, in the future, if we want to add more priority based strategy here to prioritize ML node, but still use data node if ML node is heavy loaded, etc.

I think we'd better ask user to scale the cluster by adding more ML node or switch to more powerful node type if ML node is heavy/over loaded. But this is not the one way door, we can always tune the code if cx really needs to run model on data nodes if ML node overloaded.

@ylwu-amzn ylwu-amzn merged commit 6cbb626 into opensearch-project:main Jun 17, 2022
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jun 17, 2022
Signed-off-by: Yaliang Wu <ylwu@amazon.com>
(cherry picked from commit 6cbb626)
ylwu-amzn added a commit that referenced this pull request Jun 17, 2022
Signed-off-by: Yaliang Wu <ylwu@amazon.com>
(cherry picked from commit 6cbb626)

Co-authored-by: Yaliang Wu <ylwu@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support dedicated ml node
3 participants