-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Query Profiling #539
Comments
having this as an external cluster or self-contained node would create additional maintenance overhead when managing the cluster (esp. if it's not managed with a k8s operator; i.e. either run on bare metal or on k8s but w/o an operator). thus i'd vote for this being part of the nodes (be that as a built-in feature or a plugin). storing this in opensearch vs. exposing it to prometheus comes down to whether you want to have run-time profiling (=> then it'd probably be something for prometheus; but that'd require people to actually have prometheus (which you do in a big setup, esp. with k8s)) or want to have performance metrics of a single query you're currently working on. |
@rursprung I agree with your point regarding the extra node idea proposed. I was thinking this could simplify the architecture by separation of concerns, but as you correctly point out:
Initially, I had been aiming for run-time profiling. Aggregating data by WDYT about exposing the metrics this way? |
Couple of questions/suggestions
Would it be better to have this as a query param? eg. there is already a
num-shards, latency, doc-counts, frequency will provide a nice high level view. Will this include time spent on queues and shard level latency? Is the use case here more on analytics of the kind of queries running on the cluster or is the end goal is to leverage this for query planning? If it is the latter, then is the long term plan here to add more granular metrics eg. memory footprint and should this proposal include granular metrics? If its the former, then i would prefer something more generic like a netty interceptor that could be used to track not just search but also all requests like bulk? |
As stated at the top of the doc in the link you provided :
So the idea is to have the context of the query added to the query request (perhaps as a query param, but that is already more of an implementation detail)
Is that something you would like monitored per query context provided?
The first couple of phases are adding the basic metrics required to track queries by context and not by their actual query. So it won't be for query planning or for analytics on the kind of queries but rather analytics on the quey contexts running in the cluster. For example, you could say that a query context is the origin of the request, and then scrape the metrics and see how the queries marked with that context performed. An origin could be a dashboard and its id, or a direct API call... this way if the cluster is really slow and you see that there is a massive spike in queries from a certain dashboard - you can act on this information in some way, and then fix that dashboard. Later phases focus on the ability to derive part of the context from the query itself (categorizing them is an open question here). |
thanks for the examples, now i understand the use case better. So the target audience for this feature is primarily multi-tenant/multi-user clusters where queries land up from different dashboards/tools and this feature is to allow labeling queries/tracking source referrer 'context' and in the first phase collect shard counts and overall latency per 'context'. My guess is this is to give insights into the kind of dashboards/referrers are popular/slow? Is it more apt to update the issue name to something like 'query source analytics'?
Reason i asked about time spent on queue(at a shard level) and other finer metrics is e2e time could show up as slow but might actually be due to other queries, as currently search queues are FCFS |
This is a pain point for these audiences, however, also labeling the searches performed by the alerting system is a good idea since this will allow you to see what alerts are causing strain on the cluster. But generally speaking -- yes this is mostly usefull for monitoring multi-tenant/multi-user clusters.
I'm fine with that idea, @jkowall WDYT? ☝️
Thanks for this input I'll mention this in the phase breakdown 👍 |
@AmiStrn I think the idea is to make some additional issues off this master issue as we break things up. The capabilities to analyze the source of a query is merely one way the data is to be used. We are also going to create metrics that summarize the performance, utilization, and errors of specific queries or the OpenSearch engine as a while which are far outside of "query source". This is more of the core of a query profiler which is why we called this feature that name. We could go deeper and generate more data that can help optimize or tune queries or where time should be spent to optimize the queries. |
Breaking this feature down into several phases:
(This feature is optional - cluster settings will contain a flag to enable this feature, as well as definitions regarding sampling)Future features that can build off this one - the ability to reject queries based on their context (i.e. reject all queries from dashboard-98765). This will be highly beneficial for managed environments.**edited based on comments by @malpani |
Search DSL today already provides a |
Thanks! To be honest I never used this tag. After reading the docs it looks like it is way better than adding a new header. The idea is to enhance this and also in the 3rd phase, to add the capability to add stats on the fly. |
@malpani oddly enough the |
Quickly having a look at this; query profiling typically has to disable the query cache to ensure valid results (e.g., you're profiling the actual query and not measuring a cache hit). This is one of the reasons "profiling adds significant overhead". Is the same intended here? Can you edit the original issue and include the phased approach as a check list in the description? It'll make it easier to convert this to a meta issue. |
@nknize Depends on your use case for the query profiler. Sometimes this is used as an "explain" use case for developing efficient queries while other times it's used for observability and debugging. In the later case, I don't think any type of caching changes or disabling caching is required. In fact you wouldn't want to do that at all. I would this of this as more of what we were looking at, especially with the Prometheus integration we want to do for scraping the data. Being observability people I think this is the use case we are looking to fulfill. |
@nknize I am going to use the indices search stats groups for the measurements. Cache hits and misses are part of the way the query is being sent (if it is the same then you get hits, so it is not a query that should worry us usually in terms of cluster health). What would you suggest regarding making this more clear in the issue description? |
So this is a great point. The I think what you're after here for the "observability" use case is either
or
|
You can take your "Breaking the feature down into several phases" comment above and either:
Or,
I like 1. so everything is in one place but that's just my opinion. |
Just a minor correction that Profiling is part of the OSS codebase. Were you referring to something different? |
This is basically what the Regarding an enhancement to the
Thanks, I also like this option.
True, the thing that is not OSS is the Kibana side of this API. |
|
We decided that this issue was not easily solvable, at least not to get the outcome we wanted. I am going to close this out for now. We are going to use some of the query naming capabilities in the code to provide more control. We will be opening an issue/proposal on query control soon. |
Requirements - What kind of business use case are you trying to solve?
Problem - What blocks you from solving this today?
Today the profiling capabilities are deep but basic. They are also not part of the open source codebase. The only way to profile OpenSearch is via Java/JVM profiling which is highly inefficient and expensive from a performance perspective.
Proposal - what do you suggest to solve the problem or improve the existing situation?
The basic profiler capability within OpenSearch. This would allow for basic data collection and a pluggable architecture to allow the profiler to be extended with additional capabilities. Some of the future capabilities would be to allow queries to be actioned upon for example blocking, throttling, or prioritizing the queries being executed.
Selection of queries to profile
Data collected on queries
Data storage
Assumptions
We assume the user is running OpenSearch and we can define a new index for the data storage.
Any open questions to address
What did we miss? We don’t want this feature to get over-engineered, but I am sure more things should be added.
One open item is where this should live. This could be an internal plugin, or it could be an external self-contained node or cluster itself. There are pros and cons to each design decision.
We appreciate your input on this concept before design and coding start.
Thank you from @jkowall and @AmiStrn
The text was updated successfully, but these errors were encountered: