-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix capacity response in mm-less ingestion #14888
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -947,7 +947,7 @@ public Response apply(TaskRunner taskRunner) | |||
"http", | |||
"host", | |||
"8100", | |||
taskRunner.getTotalTaskSlotCount().getOrDefault("taskQueue", 0L).intValue(), | |||
taskRunner.getTotalTaskSlotCount().getOrDefault("_k8s_worker_category", 0L).intValue(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please update the comment above to indicate why this specific value is required? Also, please check if the value can be brought out into a constant somewhere.
@georgew5656 , @YongGang , I am a little confused on why we should be using this API rather than the getTotalWorkerCapacity or getTotalTaskSlots API for this purpose. |
@kfaraz I believe this is only used by Druid console. This bug caused the following issue when run MSQ. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving for now. The correct fix would be to update the API used by web-console so that we don't need special handling for K8STaskRunner
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix, @georgew5656 , the TaskRunner
interface makes much more sense now. I have left some minor suggestions.
Could you please share a web-console screenshot after your changes to demonstrate the capacity is being populated correctly?
workers = ImmutableList.of(); | ||
currentCapacity = -1; | ||
} | ||
Collection<ImmutableWorkerInfo> workers = taskRunner instanceof WorkerTaskRunner ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need the list of workers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for capacity with autoscaling, it looks like if capacity hint is not passed it tries to assume each worker has the same capacity when calculating max capacity, so it needs at least a single worker as an example.
@@ -154,5 +146,14 @@ default void updateLocation(Task task, TaskLocation location) | |||
// do nothing | |||
} | |||
|
|||
default int getTotalCapacity() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best add javadocs to these methods esp to point out that they can return -1 when not implemented or can't be determined.
@georgew5656 , also, please update the release note to reflect that a new field has been added in the API response and that this API should now be used to get total and used cluster capacity instead of the getWorkers API. |
Changes: - Fix capacity response in mm-less ingestion. - Add field usedClusterCapacity to the GET /totalWorkerCapacity response. This API should be used to get the total ingestion capacity on the overlord. - Remove method `isK8sTaskRunner` from interface `TaskRunner`
…n` by changing string to key:value pair (#15207) * Fix capacity response in mm-less ingestion (#14888) Changes: - Fix capacity response in mm-less ingestion. - Add field usedClusterCapacity to the GET /totalWorkerCapacity response. This API should be used to get the total ingestion capacity on the overlord. - Remove method `isK8sTaskRunner` from interface `TaskRunner` * Using Map to perform comparison * Minor Change --------- Co-authored-by: George Shiqi Wu <george.wu@imply.io>
…n` by changing string to key:value pair (apache#15207) * Fix capacity response in mm-less ingestion (apache#14888) Changes: - Fix capacity response in mm-less ingestion. - Add field usedClusterCapacity to the GET /totalWorkerCapacity response. This API should be used to get the total ingestion capacity on the overlord. - Remove method `isK8sTaskRunner` from interface `TaskRunner` * Using Map to perform comparison * Minor Change --------- Co-authored-by: George Shiqi Wu <george.wu@imply.io>
…n` by changing string to key:value pair (apache#15207) * Fix capacity response in mm-less ingestion (apache#14888) Changes: - Fix capacity response in mm-less ingestion. - Add field usedClusterCapacity to the GET /totalWorkerCapacity response. This API should be used to get the total ingestion capacity on the overlord. - Remove method `isK8sTaskRunner` from interface `TaskRunner` * Using Map to perform comparison * Minor Change --------- Co-authored-by: George Shiqi Wu <george.wu@imply.io>
…n` by changing string to key:value pair (apache#15207) * Fix capacity response in mm-less ingestion (apache#14888) Changes: - Fix capacity response in mm-less ingestion. - Add field usedClusterCapacity to the GET /totalWorkerCapacity response. This API should be used to get the total ingestion capacity on the overlord. - Remove method `isK8sTaskRunner` from interface `TaskRunner` * Using Map to perform comparison * Minor Change --------- Co-authored-by: George Shiqi Wu <george.wu@imply.io>
Fixing bug where capacity for mm-less ingestion was not being properly reported
Description
This change: georgew5656@3954685 to report additional metrics for mm-less ingestion changed the key of the capacity hint returned in getTotalTaskSlotCount and didn't change it in OverlordResource, causing capacity to be returned as 0.
Initial plan was:
I decided to just update the hardcoded value in OverlordResource because we can't ref classes from the optional extension in core and I couldn't really find a good constants file to put the constant in for OverlordResource. Open to changing this if someone has a strong opinion here.
After some discussion I decided to move the logic for calculating capacity to the existing totalWorkerCapacity API instead of the /workers api. This cleans up the logic in the workers api (no need to pretend the k8s task runner has a worker), and lets us delete the isK8sTaskRunner method on the taskRunner.
Release note
Key changed/added classes in this PR
OverlordResource
TaskRunner,RemoteTaskRunner,HttpRemoteTaskRunner,KubernetesTaskRunner
K8s Task Runner with capacity of 100
Http Remote Task runner with a worker with capacity of 5
This PR has: