-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved exception handling in case of query timeouts #10464
Conversation
Awesome, this is something my team's users have been asking for. I'm going to review code and test locally today 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems good to go in my opinion. Can you create an issue for adding a new metric specifically for timeouts and reference this PR so someone can pick that work up?
I pulled down your branch and have ran it locally to test out the functionality and everything looks good with the new 504 response.
No complaints about the code in general. The one gotcha could be if we missed a spot in the code where a timeout is handled, but I have not been able to find such a place. And even if it were missed, there should not be breakage, just undesirable behavior that requires a follow on patch.
All in all, I'm comfortable with this going in. Not sure if we want a second opinion or not, so I will not merge until, let's say, early next week; Unless someone comes in and requests changes.
@capistrant Thanks for taking a look.
I already have the changes for this which is being tested internally. I'll raise a follow-up PR with the metric addition once it is ready. |
This PR has been approved for about 4 days with no further feedback so I am going to go ahead and merge. Thank you for this patch @a2l007 ... I know that my team is going to really enjoy this new API response dedicated to time outs. |
Hi @a2l007 and @capistrant, thanks for the nice PR and review! We usually treat silence as acceptance, so thanks for merging the PR too. However, for this particular PR, I think it needs to get "Design Review" which needs 3 +1s from committers (including the author himself) because it changes some user-facing behavior (https://github.com/apache/druid/blob/master/dev/committer-instructions.md). We have this policy because it's hard to undo once the changed user-facing behavior appears in some release. When we do design review, we usually have one committer reviewing the whole PR line by line, so that others can focus on only the design part. From this aspect, I quickly reviewed only the design part. I have 3 comments so far.
|
Apologies for the pre-mature merge. I will keep this in mind in future to make sure I loop in more committers during review process for something client facing like this.
Good point. How does something like this work for release notes? Does the patch author write up release notes for their change that the release manager will use later on?
Could you add more to your thoughts that 408 may be a better error code? I thought that 504 sounded ideal since the Broker API is kind of a proxy to the druid query engine. The engine did not produce a response in time for the Broker to return to client so it responds with 504. I also worry about the implications of using a 408 code in relation to the potential for middleware auto-retrying the request when the end client may not want that (something like a browser doing auto retries N times before actually surfacing that 408 to client if it continues). And I've always had the perception of 5XX being a server error and 4XX being more of a client error, like the client made a mistake. But in this case the server isn't coming up with a response in time.
shoot I had a feeling there was a doc like this out there, @a2l007 do you want to start a follow on PR for this? I suppose it may be a part of a larger follow on with whatever else we uncover in discussions with Jihoon. |
@jihoonson and @capistrant Thank you for the feedback.
Concurring with @capistrant 's response, I feel that categorizing this as a server gateway timeout(https://tools.ietf.org/html/rfc2068#section-10.5.5 )makes more sense than returning a 408 (https://tools.ietf.org/html/rfc2068#section-10.4.9) here. I'm interested on hearing your opinion on this. I can start a follow up PR once the conversation here is resolved. |
We usually ask authors to add some summary and major changes in the PR description so that the release manager can easily pick up and add them in the release notes.
Good point. The reason I mentioned 408 is that the broker does not always act as a gateway, but also performs query execution especially when you have subqueries on top of non-groupBy or multiple joins in the query. I think these queries cause timeout more often because processing subqueries or multiple joins are not parallelizable for now. But I do see why you chose 504 instead of 408 which also seems reasonable. |
FYI, I have modified the PR description to add a more release-note friendly description. Raised a PR for the documentation fix as well. |
* Separate timeout exceptions * Add more tests Co-authored-by: Atul Mohan <atulmohan@yahoo-inc.com>
This PR separates out query timeout exceptions and avoids it from being thrown as a
QueryInterruptedException
. Timeout exceptions are now handled using a newQueryTimeoutException
. Timed out queries will now return HTTP 504 status code instead of the default HTTP 500 status. This makes it easier for clients to identify timeout exceptions and handle it accordingly.To preserve the query metrics behavior, timeout exceptions are still being clubbed under
query/interrupted/count
, but I feelthere should be a separate metric for query timeouts which can be done in a separate PR.
Fixes #10028
Upgrade Notes:
Cluster operators please note that with this patch, queries that have failed as a result of timeouts will return HTTP 504 status code instead of HTTP 500 status code. If there are any configured workflows that depend on HTTP 500 for timeout queries, please adjust them accordingly.
This PR has:
Key changed/added classes in this PR
QueryTimeoutException