Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved exception handling in case of query timeouts #10464

Merged
merged 2 commits into from
Nov 3, 2020

Conversation

a2l007
Copy link
Contributor

@a2l007 a2l007 commented Oct 2, 2020

This PR separates out query timeout exceptions and avoids it from being thrown as a QueryInterruptedException. Timeout exceptions are now handled using a new QueryTimeoutException. Timed out queries will now return HTTP 504 status code instead of the default HTTP 500 status. This makes it easier for clients to identify timeout exceptions and handle it accordingly.
To preserve the query metrics behavior, timeout exceptions are still being clubbed under query/interrupted/count, but I feel
there should be a separate metric for query timeouts which can be done in a separate PR.

Fixes #10028
Upgrade Notes:
Cluster operators please note that with this patch, queries that have failed as a result of timeouts will return HTTP 504 status code instead of HTTP 500 status code. If there are any configured workflows that depend on HTTP 500 for timeout queries, please adjust them accordingly.


This PR has:

  • been self-reviewed.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

Key changed/added classes in this PR
  • QueryTimeoutException

@capistrant
Copy link
Contributor

Awesome, this is something my team's users have been asking for. I'm going to review code and test locally today 👍

Copy link
Contributor

@capistrant capistrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems good to go in my opinion. Can you create an issue for adding a new metric specifically for timeouts and reference this PR so someone can pick that work up?

I pulled down your branch and have ran it locally to test out the functionality and everything looks good with the new 504 response.

No complaints about the code in general. The one gotcha could be if we missed a spot in the code where a timeout is handled, but I have not been able to find such a place. And even if it were missed, there should not be breakage, just undesirable behavior that requires a follow on patch.

All in all, I'm comfortable with this going in. Not sure if we want a second opinion or not, so I will not merge until, let's say, early next week; Unless someone comes in and requests changes.

@a2l007
Copy link
Contributor Author

a2l007 commented Oct 30, 2020

@capistrant Thanks for taking a look.

This seems good to go in my opinion. Can you create an issue for adding a new metric specifically for timeouts and reference this PR so someone can pick that work up?

I already have the changes for this which is being tested internally. I'll raise a follow-up PR with the metric addition once it is ready.

@capistrant
Copy link
Contributor

This PR has been approved for about 4 days with no further feedback so I am going to go ahead and merge. Thank you for this patch @a2l007 ... I know that my team is going to really enjoy this new API response dedicated to time outs.

@capistrant capistrant merged commit 6ccdded into apache:master Nov 3, 2020
@jihoonson
Copy link
Contributor

jihoonson commented Nov 3, 2020

Hi @a2l007 and @capistrant, thanks for the nice PR and review! We usually treat silence as acceptance, so thanks for merging the PR too. However, for this particular PR, I think it needs to get "Design Review" which needs 3 +1s from committers (including the author himself) because it changes some user-facing behavior (https://github.com/apache/druid/blob/master/dev/committer-instructions.md). We have this policy because it's hard to undo once the changed user-facing behavior appears in some release. When we do design review, we usually have one committer reviewing the whole PR line by line, so that others can focus on only the design part. From this aspect, I quickly reviewed only the design part. I have 3 comments so far.

  • The purpose of the PR sounds good to me because the timeout error reporting was inconsistent before. However this change is incompatible which means it can break user applications working based on the previous behavior. Someone can argue that no one could do it because the previous behavior was inconsistent. IMO, this change looks worth since the previous timeout error handling was easy to break, but should be called out in the release notes.
  • Looking at the code, the status code is defined as 504 which means gateway timeout. I'm wondering 408 request timeout is more appropriate in this case.
  • All query errors should be documented. Check out https://druid.apache.org/docs/latest/querying/querying.html#query-errors.

@capistrant
Copy link
Contributor

capistrant commented Nov 3, 2020

Hi @a2l007 and @capistrant, thanks for the nice PR and review! We usually treat silence as acceptance, so thanks for merging the PR too. However, for this particular PR, I think it needs to get "Design Review" which needs 3 +1s from committers (including the author himself) because it changes some user-facing behavior (https://github.com/apache/druid/blob/master/dev/committer-instructions.md). We have this policy because it's hard to undo once the changed user-facing behavior appears in some release.

Apologies for the pre-mature merge. I will keep this in mind in future to make sure I loop in more committers during review process for something client facing like this.

When we do design review, we usually have one committer reviewing the whole PR line by line, so that others can focus on only the design part. From this aspect, I quickly reviewed only the design part. I have 3 comments so far.

  • The purpose of the PR sounds good to me because the timeout error reporting was inconsistent before. However this change is incompatible which means it can break user applications working based on the previous behavior. Someone can argue that no one could do it because the previous behavior was inconsistent. IMO, this change looks worth since the previous timeout error handling was easy to break, but should be called out in the release notes.

Good point. How does something like this work for release notes? Does the patch author write up release notes for their change that the release manager will use later on?

Could you add more to your thoughts that 408 may be a better error code? I thought that 504 sounded ideal since the Broker API is kind of a proxy to the druid query engine. The engine did not produce a response in time for the Broker to return to client so it responds with 504. I also worry about the implications of using a 408 code in relation to the potential for middleware auto-retrying the request when the end client may not want that (something like a browser doing auto retries N times before actually surfacing that 408 to client if it continues). And I've always had the perception of 5XX being a server error and 4XX being more of a client error, like the client made a mistake. But in this case the server isn't coming up with a response in time.

shoot I had a feeling there was a doc like this out there, @a2l007 do you want to start a follow on PR for this? I suppose it may be a part of a larger follow on with whatever else we uncover in discussions with Jihoon.

@a2l007
Copy link
Contributor Author

a2l007 commented Nov 3, 2020

@jihoonson and @capistrant Thank you for the feedback.

Looking at the code, the status code is defined as 504 which means gateway timeout. I'm wondering 408 request timeout is more appropriate in this case.

Concurring with @capistrant 's response, I feel that categorizing this as a server gateway timeout(https://tools.ietf.org/html/rfc2068#section-10.5.5 )makes more sense than returning a 408 (https://tools.ietf.org/html/rfc2068#section-10.4.9) here. I'm interested on hearing your opinion on this.

I can start a follow up PR once the conversation here is resolved.

@jihoonson
Copy link
Contributor

Good point. How does something like this work for release notes? Does the patch author write up release notes for their change that the release manager will use later on?

We usually ask authors to add some summary and major changes in the PR description so that the release manager can easily pick up and add them in the release notes.

Could you add more to your thoughts that 408 may be a better error code? I thought that 504 sounded ideal since the Broker API is kind of a proxy to the druid query engine. The engine did not produce a response in time for the Broker to return to client so it responds with 504. I also worry about the implications of using a 408 code in relation to the potential for middleware auto-retrying the request when the end client may not want that (something like a browser doing auto retries N times before actually surfacing that 408 to client if it continues). And I've always had the perception of 5XX being a server error and 4XX being more of a client error, like the client made a mistake. But in this case the server isn't coming up with a response in time.

Good point. The reason I mentioned 408 is that the broker does not always act as a gateway, but also performs query execution especially when you have subqueries on top of non-groupBy or multiple joins in the query. I think these queries cause timeout more often because processing subqueries or multiple joins are not parallelizable for now. But I do see why you chose 504 instead of 408 which also seems reasonable.

@a2l007
Copy link
Contributor Author

a2l007 commented Nov 4, 2020

FYI, I have modified the PR description to add a more release-note friendly description. Raised a PR for the documentation fix as well.

@a2l007 a2l007 mentioned this pull request Nov 9, 2020
4 tasks
@jihoonson jihoonson added this to the 0.21.0 milestone Jan 4, 2021
JulianJaffePinterest pushed a commit to JulianJaffePinterest/druid that referenced this pull request Jan 22, 2021
* Separate timeout exceptions

* Add more tests

Co-authored-by: Atul Mohan <atulmohan@yahoo-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Return consistent HTTP 504 error responses for query timeouts
3 participants