Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] query-fronted: concurrent query has a bottleneck #5378

Closed
liguozhong opened this issue Feb 12, 2022 · 6 comments
Closed

[bug] query-fronted: concurrent query has a bottleneck #5378

liguozhong opened this issue Feb 12, 2022 · 6 comments
Labels
stale A stale issue or PR that will automatically be closed.

Comments

@liguozhong
Copy link
Contributor

liguozhong commented Feb 12, 2022

7 day - 2TB - cost 89s
image

bug :"interval" spans have different start times
image

tracing top
image

No CPU throttled
image

@liguozhong
Copy link
Contributor Author

liguozhong commented Feb 12, 2022

This 89s duration is because the query task load of the querier is not balanced??

I expect querier-scheduler's distribution algorithm, using a more balanced implementation,

[1, 1, 1, 1, 1]
->
[2, 2, 1, 1, 1]
->
[2, 2, 2, 1, 1]

with cpu limit
image
withouot cpu limit
image
query-frontend 10000+ goroutine (query task).
image

@liguozhong
Copy link
Contributor Author

loki deploy v2 querier path with scheduler
image

@cyriltovena
Copy link
Contributor

The difference in the starting span could be max_query_parallelism

@liguozhong
Copy link
Contributor Author

liguozhong commented Feb 13, 2022

thanks! @cyriltovena
‘ max_query_parallelism’ is effective and can be reduced to 64s~74s(fetch chunk from redis), but the results of multiple tests are different。
we guess it is related to the imbalance of the querier。

Is it a reasonable require for the querier to use a more balanced distribution algorithm?

our cluster: 2Tb / 64s = 33Gb/s
The gap with the top level of 100Gb/s is still a bit big,we hope to achieve 100Gb/s。
image
#5062

fetch chunk from redis

image
image
image
image
image
image
image
image
image
image
image

querier cpu memory metrics
image

query-scheduler cpu memory metrics
image

fetch chunk from s3
image

@cyriltovena
Copy link
Contributor

I was mostly referring to the gap you have in interval spans, those are because you reach the max_paralellism_query.

For the unbalanced question, can you elaborate ?

@stale
Copy link

stale bot commented Apr 17, 2022

Hi! This issue has been automatically marked as stale because it has not had any
activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project.
A stalebot can be very useful in closing issues in a number of cases; the most common
is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely
    to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task,
our sincere apologies if you find yourself at the mercy of the stalebot.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Apr 17, 2022
@stale stale bot closed this as completed Apr 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale A stale issue or PR that will automatically be closed.
Projects
None yet
Development

No branches or pull requests

2 participants