Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make metrics LFC number of hits, number of misses and working set size available to autoscaling algorithm #872

Closed
7 tasks done
stradig opened this issue Mar 22, 2024 · 11 comments
Assignees

Comments

@stradig
Copy link
Contributor

stradig commented Mar 22, 2024

We want to be able to experiment with the algorithm to see which of those values can improve performance for autoscaled computes.

Tasks

  1. c/compute t/feature
    knizhnik
@skyzh
Copy link
Member

skyzh commented Mar 25, 2024

Need to investigate how to export data using SQL statements. This does not seem to be supported by vector.dev.

@sharnoff
Copy link
Member

IIRC the existing metrics are exposed by sql-exporter — I think vector could just pull from there, if we want to expose it via vector.

@skyzh
Copy link
Member

skyzh commented Mar 25, 2024

yep, I found https://vector.dev/docs/reference/configuration/sources/prometheus_scrape/ that directly scrapes exporter data.

skyzh added a commit to neondatabase/neon that referenced this issue Mar 29, 2024
ref neondatabase/autoscaling#878
ref neondatabase/autoscaling#872

Add `approximate_working_set_size` to sql exporter so that autoscaling
can use it in the future.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Peter Bendel <peterbendel@neon.tech>
@Omrigan
Copy link
Contributor

Omrigan commented Apr 11, 2024

So we have 4 possible ways to go forwad:

  1. Fetch from vector (vm-builder: add SQL exporter to vector #878)
    • Disadvantage: adds an additional delay between sql-exporter and vector
  2. Fetch from sql-exporter (agent: Support fetching LFC metrics (but don't use them yet) #895)
    • Disadvantage: sql-exporter fetches a number of things and it might overload the database if we fetch it every 5-15s
  3. Fetch from vm-monitor (vm-monitor: collect lfc stats from vm-monitor neon#7302 (comment))
    • Disadvantage: one more place to implement working with metrics
  4. Fetch directly from postgres
    • Disadvantage: breaks abstraction layers, needs somehow to put credentials into the autoscaler-agent

@skyzh @sharnoff sounds correct? Which ones do you prefer?

@sharnoff
Copy link
Member

My thoughts — I want to avoid adding tech debt by linking together components that weren't previously linked.

  1. Fetch from vector — modifies vector here to support sql-exporter in neondatabase/neon, adding a new link. Also has the downside of repeating metric values because the autoscaler-agent fetch frequency would be greater than vector's refresh frequency.
  2. Fetch from sql-exporter — mostly doesn't add a new link beyond what's required for this issue; the autoscaler-agent already fetches prometheus metrics from the VM. That's why I went with this approach.
  3. Fetch from vm-monitor — adds a new responsibility to vm-monitor, and would also require additional support in the autoscaler-agent. All work done on the autoscaler-agent <-> vm-monitor protocol should be approached with hazmat suits for now. It does what we need it to, but it needs a lot of work, and I'm hesitant to add more responsibilities to it until after some refactoring has taken place.
  4. Fetch directly from postgres — adds a new link between autoscaler-agent and postgres, like you said @Omrigan. And yeah, credentials would be quite tricky, requiring help from other components we don't currently rely on.

re:

sql-exporter fetches a number of things and it might overload the database if we fetch it every 5-15s

The current state of #895 is to have a configurable port and frequency — we can fetch as slow as we need to. For the ext-metrics datasources, we already do query every 15s (or maybe even more frequently?). Once a secondary sql-exporter is added with just the cheap metrics, we can e.g. add support for gradual rollout of fetching from a different port, faster, eventually switching everything over once old VMs restart.

@Omrigan
Copy link
Contributor

Omrigan commented Apr 16, 2024

@skyzh Can you share your opinion on options 2 vs 3?

@skyzh
Copy link
Member

skyzh commented Apr 16, 2024

If we want to have a second sql-exporter, I'm fine with either option 2 or 3. Otherwise, there needs to be a place to fetch these metrics, and it is easier to happen in vm-monitor.

@skyzh
Copy link
Member

skyzh commented Apr 16, 2024

...to be specific, I assume that autoscaling agent will at some point scrape these data at a high frequency, and I don't want these SQLs to be executed when we scrape sql-exporter:

https://github.com/neondatabase/neon/blob/2d5a8462c8093fb7db7e15cea68c6d740818c39c/vm-image-spec.yaml#L161-L188

Therefore, I'm proposing not go into the normal metrics sql-exporter for autoscaling metrics.

@sharnoff
Copy link
Member

sharnoff commented Apr 22, 2024

Discussed briefly with @skyzh — tl;dr:

  • Medium-term, we want to avoid having the autoscaler-agent pull LFC metrics from the main sql-exporter
  • Short-term:
    1. We can have the autoscaler-agent pull metrics from the existing sql-exporter, just with a low frequency so we don't overload postgres
    2. We can set up a second sql-exporter to just report LFC metrics
  • Then, we can have control plane set some annotation on new VMs to tell the autoscaler-agent to fetch LFC metrics with a higher frequency from the new port — giving the desired end state while retaining support for older VMs.

@sharnoff
Copy link
Member

Status:

  • agent: Support fetching LFC metrics (but don't use them yet) #895 is ready to merge, just was waiting to avoid interfering with patch release
  • We found out the new metrics weren't exposed. PR to fix is neondatabase/cloud#14245
  • Remaining work after that is actually using the metrics (design + implementation of new scaling algorithm, maybe?)

sharnoff added a commit to neondatabase/neon that referenced this issue Jul 5, 2024
In general, rename:

- lfc_approximate_working_set_size to
- lfc_approximate_working_set_size_seconds

For the "main" metrics that are actually scraped and used internally,
the old one is just marked as deprecated.
For the "autoscaling" metrics, we're not currently using the old one, so
we can get away with just replacing it.

Also, for the user-visible metrics we'll only store & expose a few
different time windows, to avoid making the UI overly busy or bloating
our internal metrics storage.

But for the autoscaling-related scraper, we aren't storing the metrics,
and it's useful to be able to programmatically operate on the trendline
of how WSS increases (or doesn't!) window size. So there, we can just
output datapoints for each minute.

Part of neondatabase/autoscaling#872.
See also #7466.
sharnoff added a commit that referenced this issue Jul 7, 2024
Part of #872.
This builds on the metrics that will be exposed by neondatabase/neon#8298.

For now, we only look at the working set size metrics over various time
windows.

The algorithm is somewhat straightforward to implement (see wss.go), but
unfortunately seems to be difficult to understand *why* it's expected to
work.

See also: https://www.notion.so/neondatabase/874ef1cc942a4e6592434dbe9e609350
sharnoff added a commit that referenced this issue Jul 7, 2024
Part of #872.
This builds on the metrics that will be exposed by neondatabase/neon#8298.

For now, we only look at the working set size metrics over various time
windows.

The algorithm is somewhat straightforward to implement (see wss.go), but
unfortunately seems to be difficult to understand *why* it's expected to
work.

See also: https://www.notion.so/neondatabase/874ef1cc942a4e6592434dbe9e609350
sharnoff added a commit that referenced this issue Jul 10, 2024
Part of #872.
This builds on the metrics that will be exposed by neondatabase/neon#8298.

For now, we only look at the working set size metrics over various time
windows.

The algorithm is somewhat straightforward to implement (see wss.go), but
unfortunately seems to be difficult to understand *why* it's expected to
work.

See also: https://www.notion.so/neondatabase/874ef1cc942a4e6592434dbe9e609350
sharnoff added a commit that referenced this issue Jul 19, 2024
Part of #872.
This builds on the metrics that will be exposed by neondatabase/neon#8298.

For now, we only look at the working set size metrics over various
evenly-spaced windows (all 1 minute apart).

The algorithm is somewhat straightforward to implement (see wss.go), but
unfortunately seems to be difficult to understand *why* it's expected to
work.

For more context, refer to the RFC here:
https://www.notion.so/neondatabase/cca38138fadd45eaa753d81b859490c6
sharnoff added a commit to neondatabase/neon that referenced this issue Jul 22, 2024
In general, replace:

* 'lfc_approximate_working_set_size' with
* 'lfc_approximate_working_set_size_windows'

For the "main" metrics that are actually scraped and used internally,
the old one is just marked as deprecated.
For the "autoscaling" metrics, we're not currently using the old one, so
we can get away with just replacing it.

Also, for the user-visible metrics we'll only store & expose a few
different time windows, to avoid making the UI overly busy or bloating
our internal metrics storage.

But for the autoscaling-related scraper, we aren't storing the metrics,
and it's useful to be able to programmatically operate on the trendline
of how WSS increases (or doesn't!) with window size. So there, we can
just output datapoints for each minute.

Part of neondatabase/autoscaling#872
See also https://www.notion.so/neondatabase/cca38138fadd45eaa753d81b859490c6
sharnoff added a commit that referenced this issue Jul 23, 2024
Part of #872.
This builds on the metrics exposed by neondatabase/neon#8298.

For now, we only look at the working set size metrics over various time
windows.

The algorithm is somewhat straightforward to implement (see wss.go), but
unfortunately seems to be difficult to understand *why* it's expected to
work.

See also: https://www.notion.so/neondatabase/cca38138fadd45eaa753d81b859490c6
lubennikovaav pushed a commit to neondatabase/neon that referenced this issue Jul 25, 2024
In general, replace:

* 'lfc_approximate_working_set_size' with
* 'lfc_approximate_working_set_size_windows'

For the "main" metrics that are actually scraped and used internally,
the old one is just marked as deprecated.
For the "autoscaling" metrics, we're not currently using the old one, so
we can get away with just replacing it.

Also, for the user-visible metrics we'll only store & expose a few
different time windows, to avoid making the UI overly busy or bloating
our internal metrics storage.

But for the autoscaling-related scraper, we aren't storing the metrics,
and it's useful to be able to programmatically operate on the trendline
of how WSS increases (or doesn't!) with window size. So there, we can
just output datapoints for each minute.

Part of neondatabase/autoscaling#872
See also https://www.notion.so/neondatabase/cca38138fadd45eaa753d81b859490c6
@sharnoff
Copy link
Member

sharnoff commented Aug 9, 2024

Earlier this week, LFC-aware scaling was completely rolled out to all regions. Closing this :)

@sharnoff sharnoff closed this as completed Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants