Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making tq stats visible from cli #638

Merged
merged 17 commits into from
Aug 29, 2024
Merged

Making tq stats visible from cli #638

merged 17 commits into from
Aug 29, 2024

Conversation

Shivs11
Copy link
Member

@Shivs11 Shivs11 commented Aug 19, 2024

What was changed

  • Adds support to view TQ stats from the CLI.
  • Sample input and output for the command:

temporal task-queue describe --task-queue xxx

Task Queue Statistics

BuildID TaskQueueType ApproximateBacklogCount ApproximateBacklogAge BacklogIncreaseRate TasksAddRate TasksDispatchRate
UNVERSIONED workflow 0 0s 0 0 0
UNVERSIONED activity 0 0s 0 0 0

temporal task-queue describe --task-queue xxx -o json

"stats": [ { "buildId": "UNVERSIONED", "taskQueueType": "workflow", "approximateBacklogCount": 0, "approximateBacklogAge": 0, "backlogIncreaseRate": 0, "tasksAddRate": 0, "tasksDispatchRate": 0 }, { "buildId": "UNVERSIONED", "taskQueueType": "activity", "approximateBacklogCount": 0, "approximateBacklogAge": 0, "backlogIncreaseRate": 0, "tasksAddRate": 0, "tasksDispatchRate": 0 } ]

Why?

  • Shall help in worker fleet management.

Checklist

  1. Closes

  2. How was this tested:

  • Modified an existing unit test
  • Ensured all other unit tests pass
  1. Any docs updates needed?
  • Updated docs.

@CLAassistant
Copy link

CLAassistant commented Aug 19, 2024

CLA assistant check
All committers have signed the CLA.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@@ -15,7 +15,7 @@ require (
github.com/stretchr/testify v1.9.0
github.com/temporalio/ui-server/v2 v2.29.2
go.temporal.io/api v1.36.0
go.temporal.io/sdk v1.27.0
go.temporal.io/sdk v1.28.2-0.20240816033637-2a02c483658a
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upgraded sdk to point to the version where TQ stats were introduced in DescribeTaskQueueEnhanced (#1553)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot be merged to main until it is a stable tag on the dependency

@Shivs11 Shivs11 requested a review from ShahabT August 19, 2024 20:48
temporalcli/commands.taskqueue.go Show resolved Hide resolved
temporalcli/commands.taskqueue_test.go Outdated Show resolved Hide resolved
temporalcli/commands.gen.go Outdated Show resolved Hide resolved
@Shivs11 Shivs11 marked this pull request as ready for review August 19, 2024 21:16
@cretz cretz requested a review from a team August 20, 2024 12:34
Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show what the output of this command w/ stats for a normal task queue (no versioning) looks like in text and JSON form? Can just be as a comment here in the PR. It's hard to get a feel for what the user sees from the code alone.

@@ -15,7 +15,7 @@ require (
github.com/stretchr/testify v1.9.0
github.com/temporalio/ui-server/v2 v2.29.2
go.temporal.io/api v1.36.0
go.temporal.io/sdk v1.27.0
go.temporal.io/sdk v1.28.2-0.20240816033637-2a02c483658a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot be merged to main until it is a stable tag on the dependency

temporalcli/commands.gen.go Outdated Show resolved Hide resolved
@Shivs11 Shivs11 requested a review from ShahabT August 20, 2024 17:15
@Shivs11 Shivs11 requested a review from ShahabT August 26, 2024 15:48
@ShahabT
Copy link
Contributor

ShahabT commented Aug 26, 2024

@antlai-temporal @Shivs11 I think it's better to change the short and long descriptions of the describe command. Right now it seems we're over-focusing on the reachability info. e.g. the short description is "Provides task reachability and pollers information for Workers on this Task Queue.".

I'd suggest we mention poller info and stats as the main output. Reachability can be mentioned as a experimental part of the API with proper warning that it may change in the future. Note that there is a chance that we want to move reachability API under temporal versioning command, and in that case it's not clear if we'd still keep it under temporal task-queue describe too.

@Shivs11
Copy link
Member Author

Shivs11 commented Aug 27, 2024

@antlai-temporal @Shivs11 I think it's better to change the short and long descriptions of the describe command. Right now it seems we're over-focusing on the reachability info. e.g. the short description is "Provides task reachability and pollers information for Workers on this Task Queue.".

I'd suggest we mention poller info and stats as the main output. Reachability can be mentioned as a experimental part of the API with proper warning that it may change in the future. Note that there is a chance that we want to move reachability API under temporal versioning command, and in that case it's not clear if we'd still keep it under temporal task-queue describe too.

I was not aware that the reachability side of things could be moved over later and knowing this now, it makes sense to not over-focus on it. I am for the act of keeping pollers and stats as the main output but shall wait for @antlai-temporal to give us the green light 🚥

@cretz
Copy link
Member

cretz commented Aug 28, 2024

Can you show what the output of this command w/ stats for a normal task queue (no versioning) looks like in text and JSON form?

I saw the markdown table in PR description but I think we want to see what the raw table looks like in text (and will wait until after y'all's discussion on text/json format to review what latest output looks like and approve).

@Shivs11
Copy link
Member Author

Shivs11 commented Aug 28, 2024

Can you show what the output of this command w/ stats for a normal task queue (no versioning) looks like in text and JSON form?

I saw the markdown table in PR description but I think we want to see what the raw table looks like in text (and will wait until after y'all's discussion on text/json format to review what latest output looks like and approve).

Ah, my apologies for the misunderstanding. Pasting the output as seen from the CLI here:

Scenario - Adding a task in the backlog with no pollers:

  1. JSON
image
  1. Non-JSON
image

note - the backlog increase rate becomes 0 in the second picture because we are fetching the rate over a window of 30 seconds which were elapsed between the two screenshots.

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but would like to wait for other approvers because I think they may have stronger opinions on the output structure. And please don't merge until we have a tagged SDK dependency.

@antlai-temporal
Copy link
Contributor

I was not aware that the reachability side of things could be moved over later and knowing this now, it makes sense to not
over-focus on it. I am for the act of keeping pollers and stats as the main output but shall wait for @antlai-temporal to give us the green light

The main reason to over focus on reachability was the feedback that it was hard to find which command would give us that info. I think that once we move reachaibility to versioning there will be no issue, but until we do, it makes sense to give it priority in the short/long description.

Comment on lines +768 to 769
* `--disable-stats` (bool) - Disable task queue statistics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for reachability we use report-reachability as opposed to disable-reachability because it is expensive, and we wanted to avoid calls.
Is getting the stats also expensive? In that case we should use also use report-stats...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stats is not the cheapest thing (because it need partition fan-out) but still relatively cheap. I'm comfortable with having it enabled by default. wdyt @dnr ?

We may want to add caching for stats in the root partition in the future (if SDK starts to query it for poller scaling).

Comment on lines 727 to 728
information for the requested versions and all task types, which can be used to safely retire Workers with old code versions,
provided that they were assigned a Build ID.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part of the sentence is only about reachability and it should be removed.

Copy link
Member Author

@Shivs11 Shivs11 Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, however we should have the part : the requested versions and all task types.

All in all, the sentence should look like:
The temporal task-queue describe command provides poller information, backlog statistics and task reachability (experimental) information for the requested versions and task types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task reachability (experimental) information

"information" seems redundant?

temporalcli/commandsmd/commands.md Show resolved Hide resolved
@Shivs11 Shivs11 requested a review from ShahabT August 28, 2024 22:28
@Shivs11 Shivs11 merged commit 0120a01 into nexus Aug 29, 2024
5 of 6 checks passed
@Shivs11 Shivs11 deleted the shivam/cli-tq-stats branch August 29, 2024 17:37
bergundy added a commit that referenced this pull request Aug 29, 2024
…"_" (#641)

Also [Bring back nexus workflow view
tests](83b70c2)
now that the SDK is released.

Note that since #638 was merged into the nexus branch, it is no longer
in a releasable state since `go.mod` is referencing an unreleased Go
SDK.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants