Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: optimize SelectMergeProfile #2762

Merged
merged 12 commits into from
Dec 1, 2023
Merged

Conversation

kolesnikovae
Copy link
Collaborator

@kolesnikovae kolesnikovae commented Nov 27, 2023

The PR:

  • Enables compression for SelectMergeProfile and SelectMergeStacktrace handlers.
  • Optimizes reading of deduplicated blocks SelectMergeProfile (skips filtering).
  • Enables query split by interval for SelectMergeProfile.
  • Minor improvements, such as use of VTProto marshal/unmarshal methods instead of the default ones.
  • Adds ability to truncate pprof profiles (max_node parameter added to SelectMergeProfile).

My initial goal was to optimize SelectMergeProfile and use it instead of SelectMergeStacktraces in order to get location information (such as line number, file name, etc) in the flamegraph.

However, this does not seem possible due to the vast overhead pprof brings: while a tree / famegraph has compression ratio 10 (gzip), with pprof we can barely expect compression ratio 2. This makes it virtually impossible to switch to pprof without a performance degradation. Moreover, without truncation, we easily hit the default gRPC message size limit (100MB):

image

Besides, there are certain limitations of the pprof processing flow:

  • In a single querier/query-frontend, we can handle ~50 pprof profiles per second (max_nodes is 8K, ca. 2MB raw uncompressed pprof) per core. With the tree representation (used for building flamegraphs), the rate is appx. 350 profiles per second. Optimisation of this procedure is not trivial and will likely require us to use an alternative format.
  • A single querier frontend merges responses from all the queriers, therefore we can't scale them (querier and query-frontend services) horizontally efficiently.

This makes me thinking that we probably first have to introduce "function selector" that allows to scope a flamegraph to a specific location – basically, request a new flamegraph with more details where the selected node is the root (not to confuse with the existing "Focus block" view). This "drill-down" effect will allow us to set a lower max_nodes limit (2-4-8K should be fine) as it mitigates the negative impact of truncation.

Nevertheless, exporting data in pprof will still be limited to the max message size (on the querier / query-frontend side). I don't think this is a problem, because I know no tools capable of handling this big profiles; but we should probable handle this more gracefully.

@kolesnikovae kolesnikovae changed the title WIP: feat: pprof in SelectMergeStacktraces feat: use pprof in SelectMergeStacktraces Nov 28, 2023
@kolesnikovae kolesnikovae changed the title feat: use pprof in SelectMergeStacktraces perf: optimize SelectMergeProfile Nov 29, 2023
@kolesnikovae kolesnikovae marked this pull request as ready for review November 29, 2023 10:49
@kolesnikovae kolesnikovae requested a review from a team as a code owner November 29, 2023 10:49
@kolesnikovae kolesnikovae linked an issue Nov 29, 2023 that may be closed by this pull request
Copy link
Contributor

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@aleks-p aleks-p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well! I left a few low priority questions / comments.

api/ingester/v1/ingester.proto Show resolved Hide resolved
api/querier/v1/querier.proto Outdated Show resolved Hide resolved
pkg/model/flamegraph.go Outdated Show resolved Hide resolved
@kolesnikovae kolesnikovae merged commit a27512a into main Dec 1, 2023
19 checks passed
@kolesnikovae kolesnikovae deleted the feat/flamegraph-from-pprof branch December 1, 2023 04:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for max_nodes parameter to pprof query
3 participants