Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a hard time limit for the entire search request #30897

Closed
fantapsody opened this issue May 28, 2018 · 11 comments
Closed

Add a hard time limit for the entire search request #30897

fantapsody opened this issue May 28, 2018 · 11 comments
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@fantapsody
Copy link

I wonder if there is any method to limit the service time of the entire search request, respond to the client with a timeout error and cancel all remaining search requests on shards automatically. As I have done a few researches on official documents, issues and discussions, even the source code, and found the search timeout parameter only limits the execution of the search on each shard, not entire service time of the search request(time in the queue, time parsing the query, time rendering the response, etc.).

This feature is extremely important for online services using es as the data engine: imagine an online service serving tens of thousands requests per second, and each request will make a search request to the es cluster. The service request comes with a timeout, such as 5 seconds, and if the time used to fulfill the request is longer than the timeout, another request will be made to do the retry. In this case, even if only 1% of search request executes longer than the timeout, such as 10 seconds, it may result in a big trouble to the whole service cluster: if we limit the number of outstanding search requests on each application server, the search request pool may be saturated quickly by retries, as the long service requests never succeed, and outstanding search requests take long to release, even if the result of the request is useless. If we don't limit the number of outstanding search requests on each application server, the heavy search requests may accumulate in the es cluster, which may have impact on normal search requests, or even bring down es nodes. In fact, this is what happened to our cluster. I think this can be improved by having a hard search time limit mechanism that releases all resources of the search contexts after the timeout.

So, do you have any plan for this, a hard time limit for the search request? Thanks!

@cbuescher
Copy link
Member

@fantapsody thanks for raising this point, it makes a lot of sense to me. However, it looks very similar to an ongoing dicussion about default search timeouts we are having already in issues like #26308. Can you take a look at that discussion and see if this is similar to your needs? In this case I would suggest closing this issue as a duplicate since the issue is already tracked.

@cbuescher cbuescher added the :Search/Search Search-related issues that do not fall into other categories label May 28, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@fantapsody
Copy link
Author

@cbuescher thanks for your reply, I have read the origin and related issues (#26238, #26258, #25776) you mentioned above, and read the patch in #25776 briefly, it seems to me that the timeout discussed in these topics was the execution time on each shard, while the time limit I suggested is the total service time of the entire search request.

For example, if the search queue of the node has hundreds of pending requests, it may take tens of seconds before the execution of the search on that node, and even the search execution took only 1ms on the shard, it may still take tens of seconds before the client receive the response. As a result, the search timeout seems to be meaningless to users, since it give no promise on when the response returns.

@cbuescher
Copy link
Member

@fantapsody thanks for checking, will leave this open and discuss it further with the team then.

@jpountz
Copy link
Contributor

jpountz commented Jul 2, 2018

@simonw suggested that taking queue time into account should be easy by computing the current time when receiving the shard-level request

@jpountz jpountz added help wanted adoptme and removed team-discuss labels Jul 2, 2018
@fantapsody
Copy link
Author

Hi, @jpountz @cbuescher , after had been bothered by the problem for quite a while, we decided to improve the query timeout mechanism by ourselves. After deployed our solution to our production environment, it worked quite well. So I would like to share our works and experiences with you, and discuss whether it could be a general and long term solution for the problem.

The key change in our solution is the timer for a search starts when the request first reach the coordinator node. So it covers much more phases other than shard queries, possibly including pre-filter, dfs, shard queries(includes queue time), fetch, reduce, etc. It also includes the time used in the message transmission between nodes. We believe it provides a more meaningful semantic of search timeout to users.

The mechanism is designed on a best effort basis, which means the timeout is guaranteed to be triggered after the given timeout value, but could be delayed significantly by various factors such as GC, network, etc.

The key ideas of the implementation are:

  • Extend the CancellableTask as the AutoCancellableTask, which can schedule a runnable to cancel the task after the timeout.
  • Add checks for task cancellation in key points in the control flow of search. And the existing checks of cancellation during shard queries are fully reused, especially the low level cancellation.
  • The timeout propagates from the parent task to shard tasks. For example, if the total timeout of the search is 3s, and it takes 0.5s before a shard request is sent, then the timeout for the shard request is 2.5s.
  • Only relative time is used to prevent uncertainty from clock skews.

Compared with the current query timeout implementation from the community, the requests will not accumulate in queues in our solution, the cluster has a better chance to recover from a bunch of anomaly slow requests, and the users will experience a more clear search timeout behaviour.

Of course, the timer registered by the search task introduced extra costs. However, as the number of search requests per second is only several thousands at max, and the time complexity of adding and removing a task to a scheduler is O(log(n)), the impact on the performance is negligible.

Nevertheless, for users who have a higher level demand on the timeout accuracy, such as services that has a timeout for each RPC request, they may have to release the query context on the client side before a success or timeout search response returns. In this case, the server side timeout mechanism is basically used as a way to release server side search contexts asynchronously.

@fantapsody
Copy link
Author

Hi, @jpountz @cbuescher @jimczi ,
I hope you have read what I posted above, and I would like to know your opinions on it. As I see there are a few issues about search timeout & cancelation that have been opened for a long time, I think it would be great to see an official feature that has a good support of it, because task timeout & cancelation is really an important function for massive online queries from my experiences and observations in production environment.

@cbuescher
Copy link
Member

Thanks @fantapsody for the detailed description of your solution. We agree search timeout & cancelation is an important topic with some ongoing discussions.
Adding some thoughts from #26308 before closing it in favour of continuing the discussion here:

@javanna
Copy link
Member

javanna commented May 12, 2020

From #54056 :

Currently the timeout is only applied at the shard level. Shards are expected to wrap up ongoing work as soon as they hit the timeout and return a partial response. This is usually good enough, unless shard requests spend time in the search queue. For instance if a timeout of 1s is configured but the shard search request waits for 10s in the search queue, then Elasticsearch wouldn't return a response in less than 10s despite the timeout.

This hasn't been much of an issue until now, but I believe that the introduction of slower options like searchable snapshots and schema on read is going to make this issue worse, as it'll make it more likely to have shard requests waiting in the queue.

@lizozom
Copy link

lizozom commented Sep 2, 2020

We at the @elastic/kibana-app-arch team are working on adding a dedicated search timeout to Kibana.

However, due to performance improvements in Elasticsearch, we are going to set that timeout to quite a high value in 7.10 (TBD) and support disabling it altogether (which is going to be the default starting 7.11).

@javanna javanna added >enhancement and removed help wanted adoptme labels Jul 8, 2022
@javanna
Copy link
Member

javanna commented Feb 10, 2023

Several of our APIs, including the search API, support now automatic cancellation on connection close. That means that the client can set a timeout to its request, and when that expires close the connection. Elasticsearch will react by cancelling all the ongoing shard requests caused by the request that was cancelled. This achieves the original goal of this issue which was killing a search request if it takes longer than a certain timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests

7 participants