Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for CPU ranges in desired nodes #86434

Merged
merged 27 commits into from
May 20, 2022

Conversation

fcofdez
Copy link
Contributor

@fcofdez fcofdez commented May 4, 2022

This commit adds support for CPU ranges in the desired nodes API.

This aligns better with environments where administrators/orchestrators
can define lower and upper bounds for the amount of CPUs that the
desired node would get once deployed.

This allows to provide information about the expected CPU and possible
allowed overcommit that the desired node will run on.

This was the previous expected body for the desired nodes API (we still support it):

PUT /_internal/desired_nodes/history/1
{
    "nodes" : [
        {
            "settings" : {
                 "node.name" : "instance-000187",
                 "node.external_id": "instance-000187",
                 "node.roles" : ["data_hot", "master"],
                 "node.attr.data" : "hot",
                 "node.attr.logical_availability_zone" : "zone-0"
            },
            "processors" : 8, 
            "memory" : "58gb",
            "storage" : "1700gb",
            "node_version" : "8.3.0"
        }
    ]
}

Now it's possible to define processors or processors_range as in:

PUT /_internal/desired_nodes/history/1
{
    "nodes" : [
        {
            "settings" : {
                 "node.name" : "instance-000187",
                 "node.external_id": "instance-000187",
                 "node.roles" : ["data_hot", "master"],
                 "node.attr.data" : "hot",
                 "node.attr.logical_availability_zone" : "zone-0"
            },
            "processors_range" : {"min": 8.0, "max": 16.0},
            "memory" : "58gb",
            "storage" : "1700gb",
            "node_version" : "8.3.0"
        }
    ]
}

Note that max in processors_range is optional.

This commit also moves from representing CPUs as integers to
accept floating point numbers.

Note: I disabled the bwc yamlRestTests for versions < 8.3 since we introduced
a few "breaking changes" but since this is an internal API it should be fine.

@fcofdez fcofdez marked this pull request as ready for review May 5, 2022 10:07
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine
Copy link
Collaborator

Hi @fcofdez, I've created a changelog YAML for you.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/clients-team (Team:Clients)

Copy link
Contributor

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, two questions:

  • Are the memory and storage fields going to also have this change eventually?
  • Can the processors.min be set without processors.max or is the value always required?

@fcofdez
Copy link
Contributor Author

fcofdez commented May 5, 2022

Thanks for the review @sethmlarson

Are the memory and storage fields going to also have this change eventually?

It might change as more internal users adopt this API.

Can the processors.min be set without processors.max or is the value always required?

When processors is represented as an object, min and max are required.
The API still supports the old version, where you can specify the number of processors as a number. In that case min and max are the supplied value.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did an initial read and left a few comments.

@@ -77,7 +77,7 @@ PUT /_internal/desired_nodes/Ywkh3INLQcuPT49f6kcppA/100
"node.attr.data" : "hot",
"node.attr.logical_availability_zone" : "zone-0"
},
"processors" : 8,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we would not want to keep the processors: 8 case in the documentation (perhaps using 8.0 instead). I think the best world is if we have just the one number rather than multiple and having that case be simpler for those orchestrations that can supply it seems desirable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us also mention that processors can be specified as a range, that makes it easier to see which releases the range syntax is supported in (and documents it).

@@ -774,3 +767,135 @@ teardown:
- { settings: { "node.external_id": "instance-000187"}, processors: 64, memory: "1b", storage: "1b", node_version: null }
- match: { status: 400 }
- match: { error.type: x_content_parse_exception }
---
"Test update desired nodes using legacy processors field":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not call this legacy, the min/max is not really the desired way to use this.

"Test update desired nodes using legacy processors field":
- skip:
reason: "contains is a newly added assertion"
features: contains
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we are not using contains here (as well as the rest of the new tests)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy-paste... 🤦

}

public int roundedMin() {
return Math.round(min);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should round this down and round max up? That seems more intuitive, since then the actual quota of processors is still within the range.

@@ -86,21 +91,23 @@ private static Version parseVersion(String version) {
}

private final Settings settings;
private final int processors;
private final Processors processors;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this will work in a mixed cluster, since we are changing the type of a field stored in cluster state (regardless of whether the new API has been used) and an old version cannot read it. I think we need a rolling upgrade test. Also, I wonder if we should simply make it two fields instead like processors and processors_range - to simplify parsing both for ourselves and clients? We could still make the two fields mutually exclusive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the assumption that we would be able to break this since we don't have known users of this API, but that might be a risky assumption. I'm wondering if we should prevent upgrades during mixed-version clusters (at least when there are 8.2 and 8.3 master nodes) as we should be able to accept a processor range but the older nodes won't have that notion and we would have to send them the lower bound, maybe it's better to reject the updates in that phase to simplify things? wdty @henningandersen

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that might be a risky assumption

Yes, it can be a bit hard to control that it is never called on an 8.2 cluster. And if so, it will essentially break the cluster during upgrade, which sounds bad.

we would have to send them the lower bound, maybe it's better to reject the updates in that phase to simplify things?

It sounds reasonable to me to reject a range specification during an upgrade to the release allowing the range specification, since otherwise we would be throwing away data.

I think we need to still accept a single processors specification in a mixed cluster though, i.e., as long as the client conforms to the old version API, everything should still work.


static {
PROCESSORS_PARSER.declareFloat(ConstructingObjectParser.constructorArg(), MIN_FIELD);
PROCESSORS_PARSER.declareFloat(ConstructingObjectParser.constructorArg(), MAX_FIELD);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be optional? I think k8s might only have requests set, leading to no max cpu. In that case we would still like to know the minimum value.

Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a few tests using a dev version of the ECK operator (from this PR) and it works great 👍

Some comments/questions:

  • Setting a value of 0 for the max CPU capacity does not seem to clear a previous max value. It also does not raise an error as I think it is the case in the current implementation. I'm wondering if setting a value of 0 should raise an error as it is the case now? (and also maybe explain it in the documentation?)

  • I also want to double check the units: if a memory capacity of 2333Mi is set in the K8S resource, it is translated by the ECK operator as 2446327808b:

In K8S:

  resources:
    requests:
      memory: 2333Mi
      cpu: 2222m
    limits:
      memory: 2333Mi
      cpu: 3141m

This the content of the API request by the ECK operator:

PUT /_internal/desired_nodes/history/version

"processors_range": {
	"min": 2.222,
	"max": 3.141
},
"memory": "2446327808b"

It is however then stored in Elasticsearch as 2.2gb in the desired node state:

GET _internal/desired_nodes/_latest

      "processors_range" : {
        "min" : 2.222,
        "max" : 3.141
      },
      "memory" : "2.2gb"

I just want to be sure it is the expected result given that it should be either 2.2783203125 GiB or 2.446327808 GB. I think this is because "2.2gb" actually means "2.2GiB"?

Thanks!

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks mostly good. My only real concerns is with how ProcessorsRange.max() returns the min value if not specified, that looks wrong to me.

@@ -77,7 +77,7 @@ PUT /_internal/desired_nodes/Ywkh3INLQcuPT49f6kcppA/100
"node.attr.data" : "hot",
"node.attr.logical_availability_zone" : "zone-0"
},
"processors" : 8,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us also mention that processors can be specified as a range, that makes it easier to see which releases the range syntax is supported in (and documents it).

public class DesiredNodesUpgradeIT extends AbstractRollingTestCase {
public void testUpgradeDesiredNodes() throws Exception {
// Desired nodes was introduced in 8.2
if (UPGRADE_FROM_VERSION.before(Version.V_8_2_0)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we introduced it in 8.1, so we should test with upgrade from version = 8.1+?


@Override
public Float max() {
return max == null ? min : max;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have thought we would make this return null if max is null? Or infinity.

IllegalArgumentException.class,
() -> new DesiredNode(
settings,
new DesiredNode.ProcessorsRange(randomFloat() + 1, randomInvalidFloatProcessor()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
new DesiredNode.ProcessorsRange(randomFloat() + 1, randomInvalidFloatProcessor()),
new DesiredNode.ProcessorsRange(randomFloat() + 0.1, randomInvalidFloatProcessor()),

IllegalArgumentException.class,
() -> new DesiredNode(
settings,
new DesiredNode.ProcessorsRange(3.0f, 2.0f),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would pass even if comparing ints. Perhaps we can pick the lower bound randomly between 0.1-10 and the upper bound between 0.01 and lower bound-Math.ulp(lower bound)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about ulp. Thanks for the tip!

}

private Float randomInvalidFloatProcessor() {
return randomFrom(-1.0f, Float.NaN, Float.POSITIVE_INFINITY, Float.NEGATIVE_INFINITY);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also include 0.0f?

case 1 -> randomDesiredNode(version, randomIntBetween(1, 256), settingsProvider);
case 2 -> randomDesiredNode(version, randomIntBetween(1, 256) + randomFloat(), settingsProvider);
default -> throw new IllegalStateException();
};
}

public static DesiredNode randomDesiredNode(Version version, int processors, Consumer<Settings.Builder> settingsProvider) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simply remove this overload?

@fcofdez
Copy link
Contributor Author

fcofdez commented May 18, 2022

Thanks for testing the PR @barkbay!

Setting a value of 0 for the max CPU capacity does not seem to clear a previous max value. It also does not raise an error as I think it is the case in the current implementation. I'm wondering if setting a value of 0 should raise an error as it is the case now? (and also maybe explain it in the documentation?)

This is strange, it should fail (I even added a couple of tests that cover that case). When you mention that it does not clear the previous max value, do you mean that the Desired nodes remain in the previous version?

I also want to double check the units: if a memory capacity of 2333Mi is set in the K8S resource, it is translated by the ECK operator as 2446327808b:

Yes, it's GiB but it states that the unit is gb. Is that problematic?

@barkbay
Copy link
Contributor

barkbay commented May 19, 2022

This is strange, it should fail (I even added a couple of tests that cover that case). When you mention that it does not clear the previous max value, do you mean that the Desired nodes remain in the previous version?

I'm not able to reproduce, it might be a problem on my side.

Yes, it's GiB but it states that the unit is gb. Is that problematic?

No, just want to be sure I was understanding the units correctly.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Francisco.

…dNode.java

Co-authored-by: Henning Andersen <33268011+henningandersen@users.noreply.github.com>
@fcofdez
Copy link
Contributor Author

fcofdez commented May 20, 2022

@elasticmachine upgrade branch

@fcofdez fcofdez merged commit e91e7e6 into elastic:master May 20, 2022
@fcofdez
Copy link
Contributor Author

fcofdez commented May 20, 2022

Thanks all for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Autoscaling >enhancement Team:Clients Meta label for clients team Team:Distributed Meta label for distributed team v8.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants