Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix capacity response in mm-less ingestion #14888

Merged
merged 4 commits into from
Aug 25, 2023

Conversation

georgew5656
Copy link
Contributor

@georgew5656 georgew5656 commented Aug 21, 2023

Fixing bug where capacity for mm-less ingestion was not being properly reported

Description

This change: georgew5656@3954685 to report additional metrics for mm-less ingestion changed the key of the capacity hint returned in getTotalTaskSlotCount and didn't change it in OverlordResource, causing capacity to be returned as 0.

Initial plan was:
I decided to just update the hardcoded value in OverlordResource because we can't ref classes from the optional extension in core and I couldn't really find a good constants file to put the constant in for OverlordResource. Open to changing this if someone has a strong opinion here.

After some discussion I decided to move the logic for calculating capacity to the existing totalWorkerCapacity API instead of the /workers api. This cleans up the logic in the workers api (no need to pretend the k8s task runner has a worker), and lets us delete the isK8sTaskRunner method on the taskRunner.

Release note

  • Fix capacity response in mm-less ingestion.
  • Adds a new usedClusterCapacity to the GET /totalWorkerCapacity response. This API should be used to get the total ingestion capacity on the overlord.
Key changed/added classes in this PR
  • OverlordResource
  • TaskRunner,RemoteTaskRunner,HttpRemoteTaskRunner,KubernetesTaskRunner

K8s Task Runner with capacity of 100
Screenshot 2023-08-24 at 9 52 15 AM

Http Remote Task runner with a worker with capacity of 5
Screenshot 2023-08-24 at 9 53 29 AM

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • [] added integration tests.
  • been tested in a test Druid cluster.

Copy link
Contributor

@YongGang YongGang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -947,7 +947,7 @@ public Response apply(TaskRunner taskRunner)
"http",
"host",
"8100",
taskRunner.getTotalTaskSlotCount().getOrDefault("taskQueue", 0L).intValue(),
taskRunner.getTotalTaskSlotCount().getOrDefault("_k8s_worker_category", 0L).intValue(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please update the comment above to indicate why this specific value is required? Also, please check if the value can be brought out into a constant somewhere.

@kfaraz
Copy link
Contributor

kfaraz commented Aug 22, 2023

@georgew5656 , @YongGang , I am a little confused on why we should be using this API rather than the getTotalWorkerCapacity or getTotalTaskSlots API for this purpose.

@YongGang
Copy link
Contributor

YongGang commented Aug 22, 2023

@kfaraz I believe this is only used by Druid console. This bug caused the following issue when run MSQ.
Screenshot 2023-08-18 at 1 54 28 PM

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving for now. The correct fix would be to update the API used by web-console so that we don't need special handling for K8STaskRunner.

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, @georgew5656 , the TaskRunner interface makes much more sense now. I have left some minor suggestions.

Could you please share a web-console screenshot after your changes to demonstrate the capacity is being populated correctly?

workers = ImmutableList.of();
currentCapacity = -1;
}
Collection<ImmutableWorkerInfo> workers = taskRunner instanceof WorkerTaskRunner ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need the list of workers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for capacity with autoscaling, it looks like if capacity hint is not passed it tries to assume each worker has the same capacity when calculating max capacity, so it needs at least a single worker as an example.

@@ -154,5 +146,14 @@ default void updateLocation(Task task, TaskLocation location)
// do nothing
}

default int getTotalCapacity()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best add javadocs to these methods esp to point out that they can return -1 when not implemented or can't be determined.

@kfaraz
Copy link
Contributor

kfaraz commented Aug 24, 2023

@georgew5656 , also, please update the release note to reflect that a new field has been added in the API response and that this API should now be used to get total and used cluster capacity instead of the getWorkers API.

@kfaraz kfaraz merged commit ad32f84 into apache:master Aug 25, 2023
78 checks passed
@LakshSingla LakshSingla added this to the 28.0 milestone Oct 12, 2023
krishnanand5 pushed a commit to krishnanand5/druid that referenced this pull request Nov 13, 2023
Changes:
- Fix capacity response in mm-less ingestion.
- Add field usedClusterCapacity to the GET /totalWorkerCapacity response.
This API should be used to get the total ingestion capacity on the overlord.
- Remove method `isK8sTaskRunner` from interface `TaskRunner`
abhishekrb19 pushed a commit that referenced this pull request Nov 15, 2023
…n` by changing string to key:value pair (#15207)

* Fix capacity response in mm-less ingestion (#14888)

Changes:
- Fix capacity response in mm-less ingestion.
- Add field usedClusterCapacity to the GET /totalWorkerCapacity response.
This API should be used to get the total ingestion capacity on the overlord.
- Remove method `isK8sTaskRunner` from interface `TaskRunner`

* Using Map to perform comparison

* Minor Change

---------

Co-authored-by: George Shiqi Wu <george.wu@imply.io>
CaseyPan pushed a commit to CaseyPan/druid that referenced this pull request Nov 17, 2023
…n` by changing string to key:value pair (apache#15207)

* Fix capacity response in mm-less ingestion (apache#14888)

Changes:
- Fix capacity response in mm-less ingestion.
- Add field usedClusterCapacity to the GET /totalWorkerCapacity response.
This API should be used to get the total ingestion capacity on the overlord.
- Remove method `isK8sTaskRunner` from interface `TaskRunner`

* Using Map to perform comparison

* Minor Change

---------

Co-authored-by: George Shiqi Wu <george.wu@imply.io>
writer-jill pushed a commit to writer-jill/druid that referenced this pull request Nov 20, 2023
…n` by changing string to key:value pair (apache#15207)

* Fix capacity response in mm-less ingestion (apache#14888)

Changes:
- Fix capacity response in mm-less ingestion.
- Add field usedClusterCapacity to the GET /totalWorkerCapacity response.
This API should be used to get the total ingestion capacity on the overlord.
- Remove method `isK8sTaskRunner` from interface `TaskRunner`

* Using Map to perform comparison

* Minor Change

---------

Co-authored-by: George Shiqi Wu <george.wu@imply.io>
yashdeep97 pushed a commit to yashdeep97/druid that referenced this pull request Dec 1, 2023
…n` by changing string to key:value pair (apache#15207)

* Fix capacity response in mm-less ingestion (apache#14888)

Changes:
- Fix capacity response in mm-less ingestion.
- Add field usedClusterCapacity to the GET /totalWorkerCapacity response.
This API should be used to get the total ingestion capacity on the overlord.
- Remove method `isK8sTaskRunner` from interface `TaskRunner`

* Using Map to perform comparison

* Minor Change

---------

Co-authored-by: George Shiqi Wu <george.wu@imply.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants