Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logql/parallel binop #5317

Merged
merged 7 commits into from
Feb 3, 2022
Merged

Logql/parallel binop #5317

merged 7 commits into from
Feb 3, 2022

Conversation

owen-d
Copy link
Member

@owen-d owen-d commented Feb 3, 2022

This PR does a few things:

  1. Runs both legs of a binary operation in parallel.
  2. Introduces the Clone() method for our Exprs to ensure we run into mutability bugs less
  3. Uses (2) in our sharding code to prevent a mutability bug, in this case one which prevented binary operations from being sharded. This bug has existed since 2020 😭 and of course I'm the original author 😱
  4. Increases the Downstreamer concurrency. Now that we have the LimitedRoundTripper, this structure is largely used to just prevent goroutine explosions via malicious queries, so it feels safe to increase this as it no longer needs to limit access to our tenant queues.

Running this in one of our clusters resulted in sharded binary operations running ~10x faster 🎉

@owen-d owen-d requested a review from a team as a code owner February 3, 2022 15:36
Copy link
Contributor

@kavirajk kavirajk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! happy all my ratio queries (exp1)/(exp2) is going to run faster now :)

Left one minor clarification though.

@@ -18,7 +18,7 @@ import (
)

const (
DefaultDownstreamConcurrency = 32
DefaultDownstreamConcurrency = 128
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Failing to understand why it got increased (even after reading below comment on Downstreamer())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any way we can line it up with the max_query_parallelism ? Because 128 is half the biggest tenant we have at 256 so he will get limited by this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or may be cap it higher ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was historically used to limit how many downstream queries (query-frontend -> querier) could be dispatched in parallel by a single time splitted logql query. Now that this part is controlled by the LimitedRoundTripper instead, we only want to use the Downstreamer concurrency to prevent us from creating unbounded goroutines. Increasing the limit to 128 still seems reasonable to that effect but is also high enough to not pre-limit anything the LimitedRoundTripper would limit anyway. Basically, this is a crude attempt to prevent us from blowing up goroutines due to malicious queries without introducing a bottleneck in our query path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying Owen. 👍

Two things:

  1. Can we add this
 Increasing the limit to 128 still seems reasonable to that effect but is also high enough to not pre-limit anything the LimitedRoundTripper would limit anyway. 

as a comment to the const itself?

  1. Also +1 to have this value same as max_query_parallelism. Because having max_query_parallelism as 256 will be allowed happily by LimitedRoundTripper but still can get limited by this Downstreamer.?

Copy link
Member Author

@owen-d owen-d Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any way we can line it up with the max_query_parallelism ? Because 128 is half the biggest tenant we have at 256 so he will get limited by this.

Yes, we could, but this is also post-splitted code, meaning that each split would need to schedule more than 128 queries for this to limit it. We may ultimately want to thread it into the MaxQueryParallelism code, but I felt that was overengineering for the moment and could be done in another PR if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good !

pkg/logql/evaluator.go Outdated Show resolved Hide resolved
@owen-d owen-d merged commit 2af3ca0 into grafana:main Feb 3, 2022
KMiller-Grafana pushed a commit to KMiller-Grafana/loki that referenced this pull request Feb 4, 2022
* adds justification for keeping Downstreamer parallelism

* loads binop legs in parallel

* increases downstreamer default concurrency

* astmapper spanlogger

* always clone expr during mapping to prevent mutability bugs

* Revert "astmapper spanlogger"

This reverts commit 23f6b55.

* cleanup + use errgroup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants