-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adapt queryTimeout to be a per-tenant configuration #6835
Adapt queryTimeout to be a per-tenant configuration #6835
Conversation
- Adds context timeout to metadata calls based on the new per-tenant query timeout - Add span logs to calls missing it
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
- querier -2%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
- Use per-tenant query timeout everywhere instead of the old engine timeout configuration
8da2d32
to
80c4f15
Compare
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
- querier -2%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
Still missing updating the docs and testing a few things, but I'm moving it to ready in case someone is willing to review it/help me finding problems with my approach here 😄 |
pkg/logql/engine.go
Outdated
@@ -226,7 +219,14 @@ func (q *query) Exec(ctx context.Context) (logqlmodel.Result, error) { | |||
} | |||
|
|||
func (q *query) Eval(ctx context.Context) (promql_parser.Value, error) { | |||
ctx, cancel := context.WithTimeout(ctx, q.timeout) | |||
queryTimeout := time.Minute * 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think better we move this to package level const? (e.g: DefaultQueryTimeout). And Q: what's the rationale behind 2 minutes as default? (Previously query engine was using 5 * minute as default timeout. Is it to standardize it? then we may need to add it in upgrade guide as well I think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm where did you find that the engine had 5 minutes default timeout?
pkg/querier/http.go
Outdated
@@ -376,14 +408,23 @@ func (q *QuerierAPI) TailHandler(w http.ResponseWriter, r *http.Request) { | |||
// SeriesHandler returns the list of time series that match a certain label set. | |||
// See https://prometheus.io/docs/prometheus/latest/querying/api/#finding-series-by-label-matchers | |||
func (q *QuerierAPI) SeriesHandler(w http.ResponseWriter, r *http.Request) { | |||
log, ctx := spanlogger.New(r.Context(), "query.Series") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing it is being repeated for almost all the handler, should we make this as middleware?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I think it shouldn't, as some handlers won't want this (ex: the tail handler)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely agree about the middleware; we can apply the middleware selectively - which is already done in pkg/loki/modules.go
:
queryHandlers := map[string]http.Handler{
"/loki/api/v1/query_range": httpMiddleware.Wrap(http.HandlerFunc(t.querierAPI.RangeQueryHandler)),
"/loki/api/v1/query": httpMiddleware.Wrap(http.HandlerFunc(t.querierAPI.InstantQueryHandler)),
"/loki/api/v1/label": http.HandlerFunc(t.querierAPI.LabelHandler),
"/loki/api/v1/labels": http.HandlerFunc(t.querierAPI.LabelHandler),
"/loki/api/v1/label/{name}/values": http.HandlerFunc(t.querierAPI.LabelHandler),
"/loki/api/v1/series": http.HandlerFunc(t.querierAPI.SeriesHandler),
"/loki/api/v1/index/stats": http.HandlerFunc(t.querierAPI.IndexStatsHandler),
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
- querier -1.9%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
- querier -1.9%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
- querier -1.9%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
pkg/querier/http.go
Outdated
@@ -376,14 +408,23 @@ func (q *QuerierAPI) TailHandler(w http.ResponseWriter, r *http.Request) { | |||
// SeriesHandler returns the list of time series that match a certain label set. | |||
// See https://prometheus.io/docs/prometheus/latest/querying/api/#finding-series-by-label-matchers | |||
func (q *QuerierAPI) SeriesHandler(w http.ResponseWriter, r *http.Request) { | |||
log, ctx := spanlogger.New(r.Context(), "query.Series") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely agree about the middleware; we can apply the middleware selectively - which is already done in pkg/loki/modules.go
:
queryHandlers := map[string]http.Handler{
"/loki/api/v1/query_range": httpMiddleware.Wrap(http.HandlerFunc(t.querierAPI.RangeQueryHandler)),
"/loki/api/v1/query": httpMiddleware.Wrap(http.HandlerFunc(t.querierAPI.InstantQueryHandler)),
"/loki/api/v1/label": http.HandlerFunc(t.querierAPI.LabelHandler),
"/loki/api/v1/labels": http.HandlerFunc(t.querierAPI.LabelHandler),
"/loki/api/v1/label/{name}/values": http.HandlerFunc(t.querierAPI.LabelHandler),
"/loki/api/v1/series": http.HandlerFunc(t.querierAPI.SeriesHandler),
"/loki/api/v1/index/stats": http.HandlerFunc(t.querierAPI.IndexStatsHandler),
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
- querier -0.2%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
Use timeout variable instead of hardcoded value. Co-authored-by: Danny Kopping <dannykopping@gmail.com>
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
- querier -0.2%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
- Interrupt requests when tenantID couldn' be fetch.
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
- querier -0.4%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Improve deprecation notice. Co-authored-by: Danny Kopping <dannykopping@gmail.com>
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
- querier -0.4%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
What this PR does / why we need it: On PR #6835 I mistakenly removed the querier:query_timeout YAML configuration. This PR returns it and uses it if it is still configured while logging a warning message to alert users of the upcoming changes.
What this PR does / why we need it: * Add a new per-tenant query timeout * Adds new middleware to query calls. This new middleware will timeout requests based on the new per-tenant query timeout * Add span logs to query calls missing it * Deprecate `engine.timeout` configuration The motivation to change this configuration to be per-tenant instead of global is that it makes sense to alleviate the timeout for particular tenants or make it more strict for others. Especially useful for scenarios where the Loki client doesn't handle context cancellations correctly; in such scenarios, since Loki would still process expensive queries that were canceled, this allows one to have small timeouts for most tenants, which will help mitigate unnecessary work without having a timeout that is too short for important tenants.
What this PR does / why we need it: On PR grafana#6835 I mistakenly removed the querier:query_timeout YAML configuration. This PR returns it and uses it if it is still configured while logging a warning message to alert users of the upcoming changes.
What this PR does / why we need it:
query timeout
The motivation to change this configuration to be per-tenant instead of global is that it makes sense to alleviate the timeout for particular tenants or make it more strict for others. Especially useful for scenarios where the Loki client doesn't handle context cancellations correctly; in such scenarios, since Loki would still process expensive queries that were canceled, this allows one to have small timeouts for most tenants, which will help mitigate unnecessary work without having a timeout that is too short for important tenants.
Which issue(s) this PR fixes:
N/A