Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move key telemetry spans into app #1171

Closed
Tracked by #1172
taylordowns2000 opened this issue Oct 4, 2023 · 7 comments · Fixed by #1175
Closed
Tracked by #1172

Move key telemetry spans into app #1171

taylordowns2000 opened this issue Oct 4, 2023 · 7 comments · Fixed by #1175
Assignees

Comments

@taylordowns2000
Copy link
Member

Right now, our load_testing script (see BENCHMARKING.md) is adding spans around key functions in an .exs file. This ticket is to experiment with adding those spans into the app itself so we can monitor them first in Live Dashboard and then later in grafana/something like it.

The proposed spans are:

  1. rolling average, time to store new HTTP request bodies in the DB when sent to a webhook trigger URL
  2. rolling average, time to start executing a new workorder for a webhook triggered attempt
  3. rolling average, how long is it taking for an attempt to be claimed off the queue? (this helps us understand if worker apps are being auto-scaled efficiently, or if they are being overworked)
  4. rolling average, how long is it taking for users to access the history page? (historically, this is the heaviest read from the DB)
@rorymckinley
Copy link
Collaborator

@taylordowns2000 I am not fully grokking the difference between (2) & (3):

I think this addresses 2 (it exposes the Oban telemetry relating to the queue time for jobs on the 'runs' queue):

https://github.com/OpenFn/Lightning/blob/edec03c3da20e2592f2975075c3a2a2f72d166ad/lib/lightning_web/telemetry.ex#L79-L85

Following the code, I end up at: https://github.com/OpenFn/Lightning/blob/1920e834c81cc1921384dbf92c9d992582b7e544/lib/lightning/runtime/handler.ex#L50

Is this where I should be looking for (3) - or am I completely lost?

@rorymckinley
Copy link
Collaborator

@taylordowns2000 Questions, part deux: For (4) - would I be correct in assuming that we would want to instrument both the initial load of the history page as well as the filters (Workflow, Status, Time) and Search page loads?

@taylordowns2000
Copy link
Member Author

Good questions @rorymckinley .

2 - I think I had this wrong earlier. I think the function from the initial set of load tests that was getting spanned was to add a workorder to the queue, rather than to actually start a workorder.
3 - This one, we're trying to understand how long stuff is waiting in the queue before it gets claimed by a worker. There, the name of the queue you're looking to monitor would be the :runs queue, but since we're only a few days away from a radically new enque/deque structure, we might actually put a pause on this one for now.
4 - it would be useful to know about each time a new set of filters are sent to the backend: https://github.com/OpenFn/Lightning/blob/extend_telemetry_metrics/lib/lightning_web/live/run_live/index.ex#L118-L135

@rorymckinley
Copy link
Collaborator

Thanks @taylordowns2000 .

2 - Ok, I will inspect and adapt.
3 - I think the solution for (2) then applies to 3 - and no worries if that is going to change - to apply the same logic to another queue is pretty cheap in the grand scheme of things, so I can circle back to that.
4 - Thanks - will take a look.

@rorymckinley
Copy link
Collaborator

@taylordowns2000 Wrt (2) - I believe the code you are referencing in the load tests is this:

https://github.com/OpenFn/Lightning/blob/5e2da96faea24c06d6cd80a9803f22460b471f09/benchmarking/load_tests.exs#L136-L139

This currently included within the telemetry span that I built to address (1). If my above understanding is correct, I would suggest that we perhaps do not include a lower-level span as well, until we have a reason to drill down?

@rorymckinley
Copy link
Collaborator

@taylordowns2000 pending an answer to my question above - this is what i have at the moment (note: lightning.repo.query.idle.time was already there):

image

I have some questions relating to the code itself but if you are happy with this as a first pass, I think I am going to move on to the loadtesting. Hopefully it will generate more diverse data which would give us a better opportunity to see how the LiveDashboard performs.

@taylordowns2000
Copy link
Member Author

That's perfect. Thank you, Rory. I'll also connect you with @josephjclark toward the end of next week to discuss some key metrics for the new JavaScript Worker--the thing that pulls work order attempts off the queue and... does work.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants