Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize repeated row iterator #2572

Merged
merged 18 commits into from
Oct 25, 2023
Merged

Optimize repeated row iterator #2572

merged 18 commits into from
Oct 25, 2023

Conversation

kolesnikovae
Copy link
Collaborator

@kolesnikovae kolesnikovae commented Oct 24, 2023

The PR is aimed at improving performance of the parquet rows iterator:

  1. Refactored repeated row iterator, fixed excessive CPU time consumption.
  2. Asynchronous batching of fetched rows.
  3. Iterator tee. Instead of copying row iterators into memory, we now can stream them.

Benchmarking on synthetic data does not show any substantial improvements. However, with the real life data and queries, the difference is quite prominent:

goos: darwin
goarch: arm64
pkg: github.com/grafana/pyroscope/pkg/phlaredb
                             │  before.txt  │              after.txt              │
                             │    sec/op    │   sec/op     vs base                │
_MergeSampleByStacktraces-10   1278.8m ± 6%   463.1m ± 2%  -63.79% (p=0.000 n=10)

                             │  before.txt  │              after.txt              │
                             │     B/op     │     B/op      vs base               │
_MergeSampleByStacktraces-10   318.4Mi ± 2%   307.2Mi ± 2%  -3.52% (p=0.000 n=10)

                             │  before.txt  │              after.txt              │
                             │  allocs/op   │  allocs/op   vs base                │
_MergeSampleByStacktraces-10   1193.3k ± 0%   884.8k ± 0%  -25.85% (p=0.000 n=10)

image

Flamegraph

In terms of query latency, in some cases it is reduced by up to 25-30%, with the impact becoming more noticeable the more profiles that need to be retrieved.

@kolesnikovae kolesnikovae changed the title Perf/chunk iterator Optimize repeated row iterator Oct 24, 2023
@kolesnikovae kolesnikovae marked this pull request as ready for review October 25, 2023 04:56
@kolesnikovae kolesnikovae requested a review from a team as a code owner October 25, 2023 04:56
pkg/iter/tee.go Outdated Show resolved Hide resolved
// NumRows return the number of row in the page
// not counting skipped ones (because of SeekToRow).
// The implementation is quite expensive, therefore
// we should call it once per page.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100%

Copy link
Contributor

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stellar !!!! This is going to be perfect for the future.

LGTM

@cyriltovena cyriltovena merged commit 4e8439d into main Oct 25, 2023
16 checks passed
@cyriltovena cyriltovena deleted the perf/chunk_iterator branch October 25, 2023 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants