Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Isolating CPU for specific workloads(search or others) #12483

Open
sgup432 opened this issue Feb 27, 2024 · 27 comments
Open

[RFC] Isolating CPU for specific workloads(search or others) #12483

sgup432 opened this issue Feb 27, 2024 · 27 comments
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Roadmap:Stability/Availability/Resiliency Project-wide roadmap label Search:Resiliency

Comments

@sgup432
Copy link
Contributor

sgup432 commented Feb 27, 2024

Is your feature request related to a problem? Please describe

I want to propose this idea and looking for some opinions from the community. Hopefully it doesn't sound terrible.

Context:
As of now, we have search and indexing workload(and others as well) running on the same node under the same process with their own threadpool defined. A lot of times certain "expensive" search queries cause CPU to spike till 100% on some/all nodes in a cluster and causes cluster instability. This can happen when all search threads seems to occupy majority/all CPU cores, also considering that we define number of search threads greater than available CPU cores on a node for performance reasons (formula: ((allocatedProcessors * 3) / 2) + 1).

We don't have a nice way as of today for users to isolate a search workload(for example) to not take up more than 70%(lets say) CPU, thereby providing resiliency and avoid node drops.

We have had discussion around defining "search" only nodes but it is complex and might be expensive for users to maintain a different node type. Existing Search backpressure feature also doesn't solve this holistically and plus relies on a cancellation mechanism(for search tasks after CPU goes beyond X%) which has its own limitations and may not work necessarily.

Unfortunately there is no way in JAVA to isolate heap for specific group of threads, otherwise this could have been extended to JVM heap as well.

Use case:
User can say that they want to allocate 70% of CPU to search workloads via some cluster setting. And we will accordingly map search threadpool to X% CPU cores or something like that.

Describe the solution you'd like

In Java, a concept called Java affinity exists where you can map of group of threads to certain CPU cores. And it is mostly used in cases where OS's thread scheduling does not provide optimal performance. But we can use it for isolating a group of threads to only run on X cpu cores on a Y core node(for example).
There also exists a library around this which provides a way to assign a group of threads to certain CPU cores(by reading /proc/cpuinfo) using a ThreadFactory.

I haven't done a POC yet as not sure of the feasibility.

Related component

Search:Resiliency

Describe alternatives you've considered

Have different processes for search/indexing but this seems to be way more complex.

Additional context

No response

@sgup432 sgup432 added enhancement Enhancement or improvement to existing feature or request untriaged labels Feb 27, 2024
@andrross
Copy link
Member

we define number search threads greater than available CPU cores on a node for performance reasons

Can you expand on these reasons? It would probably be helpful context. Naively, if all search tasks were 100% CPU-bound then you could accomplish "User can say that they want to allocate 70% of CPU to search workloads" by defining the search thread pool size as allocated processors * 0.7.

@sgup432
Copy link
Contributor Author

sgup432 commented Feb 27, 2024

@andrross

Can you expand on these reasons? It would probably be helpful context.

Sure, I can try. Usually when we have systems like OpenSearch which does a lot of I/O bound tasks(network, file I/O), we can have a situation where a lot of threads are not doing much apart from waiting and in this case other threads(which needs to do some computation) can be scheduled(internally by Operating system) to use the CPU instead. This achieves much better CPU utilization. In case we had a service which is only meant to heavy CPU work(like computations), then it would have made sense to keep number of threads equal to the number of cores.

So I think we shouldn't solve this by reducing the thread pool size to allocated processors * 0.7, it will degrade the performance compared to the approach I have mentioned.

@andrross
Copy link
Member

Thanks @sgup432. The question behind my question is whether there is opportunity to structure the code such that we aren't doing blocking operations on "search" threads, which would make your scaling decisions about that thread pool much simpler and avoid needing to over-provision relative to the physical compute resources.

@jainankitk
Copy link
Collaborator

The question behind my question is whether there is opportunity to structure the code such that we aren't doing blocking operations on "search" threads

@andrross - The blocking operations on "search" threads are for reading data from lucene segment files on disk. Hence IMO, unless we fundamentally change lucene or way opensearch works, we cannot avoid blocking of threads.

@sgup432
Copy link
Contributor Author

sgup432 commented Feb 28, 2024

In addition to what @jainankitk said, we can never avoid a situation where thread won't be interrupted or made space for another until unless we have an ideal system. This stackoverflow thread probably explains it better.

@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5]
@sgup432 Thanks for creating this RFC, it might be useful to tie this to the #11061 effort that looks related

@peternied peternied added the RFC Issues requesting major changes label Feb 28, 2024
@reta
Copy link
Collaborator

reta commented Feb 28, 2024

Unfortunately there is no way in JAVA to isolate heap for specific group of threads, otherwise this could have been extended to JVM heap as well.

There some interesting options here that Graalvm explores, see please https://medium.com/graalvm/isolates-and-compressed-references-more-flexible-and-efficient-memory-management-for-graalvm-a044cc50b67e

In addition to what @jainankitk said, we can never avoid a situation where thread won't be interrupted or made space for another until unless we have an ideal system. This stackoverflow thread probably explains it better.

So this picture is a bit simplified and does not reflect the state we are now. Since we have concurrent search in GA (as of 2.12): there is a shared thread pool to manage segment traversals in parallel. In happy path, search threads do not do any I/O anymore but indexer search threads do.

We don't have a nice way as of today for users to isolate a search workload(for example) to not take up more than 70%(lets say) CPU, thereby providing resiliency and avoid node drops.

We cannot limit CPUs/heap on a thread pool efficiently (at least I do think so and would appreciate correction here), we could certainly try to estimate the pool's CPU/heap usage or limit the number of active threads to some % of cores, but all of those would not be very effective I think, we would end up with underutilization (fe many blocking threads would tell that the pool is doing nothing) or overutilization (once those blocking thread become runnable at the same time).

@sgup432
Copy link
Contributor Author

sgup432 commented Feb 28, 2024

There some interesting options here that Graalvm explores, see please https://medium.com/graalvm/isolates-and-compressed-references-more-flexible-and-efficient-memory-management-for-graalvm-a044cc50b67e

Thanks for the reference, I will take a look.

We cannot limit CPUs/heap on a thread pool efficiently (at least I do think so and would appreciate correction here)

Accordingly to my limited knowledge, the library I mentioned above provides a way to create a specific thread factory which is pinned to certain CPU cores. I am guessing we could use that(maybe with some tweak) to achieve this. As eventually search threadpool is essentially created by initializing a thread factory which is passed to ExecutorService(correct me if I am wrong here). Though I need to do more deep dive on the library part but first wanted to see if the idea makes any sense and initiate some discussions around it.

we could certainly try to estimate the pool's CPU/heap usage or limit the number of active threads to some % of cores, but all of those would not be very effective I think, we would end up with underutilization (fe many blocking threads would tell that the pool is doing nothing) or overutilization..

Yeah I agree that we might end up either over-utilizing or under-utilizing the CPU/heap usage. But this RFC accepts that caveat and is more focussed towards providing resiliency by isolating underlying resources for search(or indexing) workloads.

@reta
Copy link
Collaborator

reta commented Feb 28, 2024

Accordingly to my limited knowledge, the library I mentioned above provides a way to create a specific thread factory which is pinned to certain CPU cores.

The pinning it not the same as CPU usage, right? Pinning helps with data locality / context switches, you can try to pin the thread to core and just block the thread (btw, the library actually lives under OpenHFT, https://github.com/OpenHFT/Java-Thread-Affinity). It may work in some system but OpenSearch has many many layers (Apache Lucene) and I am 100% that pinning will not be helpful (we have ~20 different threads pools).

But this RFC accepts that caveat and is more focussed towards providing resiliency by isolating underlying resources for search(or indexing) workloads.

This is not the caveat that may be practical, it's like asking people to use serializable isolation level on their databases all the time - very good isolation, horrible performance.

@andrross
Copy link
Member

But this RFC accepts that caveat and is more focussed towards providing resiliency by isolating underlying resources for search(or indexing) workloads.

This is not the caveat that may be practical, it's like asking people to use serializable isolation level on their databases all the time - very good isolation, horrible performance.

Agreed this caveat may not be practical. Also, I believe we can achieve added resiliency via under-utilization today by overriding the default search thread pool size to a smaller value.

@sgup432
Copy link
Contributor Author

sgup432 commented Feb 28, 2024

@reta

The pinning it not the same as CPU usage, right?

Yeah I used pinning but the intention was not that exactly. Yeah atleast java affinity tries to solve data locality problem but I was more looking to use it in other ways like to isolate X% cpu cores for some Y threads. Again that was given as an example but I am open to other ways as well.

@reta

This is not the caveat that may be practical, it's like asking people to use serializable isolation level on their databases all the time - very good isolation, horrible performance.

Again my intentions was not to enable it for all but only when needed. Again don't think we can generalize to "horrible" performance as user is controlling that setting.
For example, if a user has a node with 16 cpu cores, and they want to run search related stuff only on 12 cores, so basically you can assume a 12 core node available to you in isolation for search stuff. Again I don't think the same user will call it horrible performance as thats a tradeoff they have taken.

@andrross

Agreed this caveat may not be practical. Also, I believe we can achieve added resiliency via under-utilization today by overriding the default search thread pool size to a smaller value

As mentioned above, if we have below two cases:
1st case: A user choses to use 12 cpu cores for search stuff out of 16 cores, so assume a 12 core search node. And search thread remains the same as
of today (((allocatedProcessors(which is 12) * 3) / 2) + 1) (I am not considering concurrent search stuff at this point)
2nd case(like you mentioned): I align my search threads to be 0.7 * 16(cores) ~ 11

I guess we still might perform better in 1st case. As in 2nd case, CPU is going to be way under utilized. IMO if we want to solve it, it shouldn't do it via 2nd case.
So we can probably say "Providing resiliency without comprising on under CPU utilization (on whatever CPU is available)"

@jainankitk
Copy link
Collaborator

@reta @andrross - Thank you for providing your inputs. While I agree that the significant (I know this is vague) under-utilization of resources might not be acceptable tradeoff, maybe limiting the cores for search threadpool to 80% or something prevents CPU from spiking to 100% bringing whole cluster to standstill.

@sgup432 - I doubt that this can be supported as dynamically configurable parameter. Are you thinking of this as static configuration with default being same as what we have today?

So this picture is a bit simplified and does not reflect the state we are now. Since we have concurrent search in GA (as of 2.12): there is a shared thread pool to manage segment traversals in parallel. In happy path, search threads do not do any I/O anymore but indexer search threads do.

@reta - I am bit unclear on how concurrent segment search is preventing/reducing search threads from getting blocked. Can you please elaborate more on this?

@reta
Copy link
Collaborator

reta commented Feb 28, 2024

@reta - I am bit unclear on how concurrent segment search is preventing/reducing search threads from getting blocked. Can you please elaborate more on this?

@jainankitk The search task (managed by search thread pool) will be waiting for numerous indexer searcher tasks (managed by indexer search pool) to be completed, in other words - it won't be doing any useful work and limit search pool will not help - the real work is done by other pool instead

@reta
Copy link
Collaborator

reta commented Feb 28, 2024

For example, if a user has a node with 16 cpu cores, and they want to run search related stuff only on 12 cores, so basically you can assume a 12 core node available to you in isolation for search stuff. Again I don't think the same user will call it horrible performance as thats a tradeoff they have taken.

@sgup432 Any user could do this right now - every thread pool is configurable and limit could be set to whatever user thinks is reasonable, no additional implementation is required.

@andrross
Copy link
Member

@sgup432 @jainankitk

As @reta mentioned, CPU-pinning is generally a low level technique to provide data locality / reduce context switching. The problem here (search operations can overwhelm the system) is probably better solved by limiting concurrency at the application level. This is indeed a hard problem but in my opinion improving search backpressure / cancellation is probably the way to go, as it can be adaptive to the workload in real time.

If you want to put an artificial cap on a class of operations, you can do so today by overriding the default thread pool size to something smaller. I'm skeptical that limiting a thread pool to use a subset of CPU cores would be significantly different in practice to just tuning the thread pool size.

@andrross
Copy link
Member

I'm skeptical that limiting a thread pool to use a subset of CPU cores would be significantly different in practice to just tuning the thread pool size.

@sgup432 Feel free to build a prototype and prove me wrong though!

@sgup432
Copy link
Contributor Author

sgup432 commented Feb 28, 2024

@reta

Any user could do this right now - every thread pool is configurable and limit could be set to whatever user thinks is reasonable, no additional implementation is required.

I am guessing you meant limiting the number of threads as @andrross suggested? If yes, then I didn't intend that.

@sgup432
Copy link
Contributor Author

sgup432 commented Feb 29, 2024

@jainankitk

@sgup432 - I doubt that this can be supported as dynamically configurable parameter. Are you thinking of this as static configuration with default being same as what we have today?

Yeah it will be a static setting and by default would function the way we have it today.

@jainankitk
Copy link
Collaborator

@jainankitk The search task (managed by search thread pool) will be waiting for numerous indexer searcher tasks (managed by indexer search pool) to be completed, in other words - it won't be doing any useful work and limit search pool will not help - the real work is done by other pool instead

@reta - If most of the work is being done by indexer searcher tasks going forward, maybe putting limitation on that makes sense!?

The problem here (search operations can overwhelm the system) is probably better solved by limiting concurrency at the application level.

I am wondering if limiting concurrency at application level can lead to under-utilization of resources.

This is indeed a hard problem but in my opinion improving search backpressure / cancellation is probably the way to go, as it can be adaptive to the workload in real time.

We are working on that as well, but will be sometime before we solve the problem for good using that approach. Since, cancellation is a hard problem in itself

I'm skeptical that limiting a thread pool to use a subset of CPU cores would be significantly different in practice to just tuning the thread pool size.

@andrross - Yes, it should be different unless my understanding is incorrect (@sgup432 - please correct me if I am wrong). Limiting the cores is like restricting the lanes on freeway allowing the traffic to still flow smoothly (non-search operations) through carpool/express lane. Limiting number of threads is like reducing number of cars on the freeway, and stays reduced even when there is no congestion.

@jainankitk
Copy link
Collaborator

@sgup432 Feel free to build a prototype and prove me wrong though!

While I am not sure about the proving wrong part! ;) It did get me wondering what it will take to build the prototype. Will it be quick or going to be few weeks of effort.

@andrross
Copy link
Member

andrross commented Mar 1, 2024

Limiting the cores is like restricting the lanes on freeway allowing the traffic to still flow smoothly (non-search operations) through carpool/express lane.

@jainankitk I'm still skeptical :) If your problem was that search operations had filled the freeway up with cars, and then you limit them to a subset of lanes, then that will reduce the total throughput for search operations. Regardless of whether there is congestion or not the total possible throughput for search operations has been reduced. This sounds the same as if you just limited the number of search "cars" allowed on the freeway with no particular lane restrictions.

@andrross
Copy link
Member

andrross commented Mar 1, 2024

I am wondering if limiting concurrency at application level can lead to under-utilization of resources.

@jainankitk To be clear, OpenSearch does this today. This is why there are like 20 different thread pools defined of all different sizes and characteristics. If we didn't want to limit concurrency of different functions, then we'd just have one giant thread pool and let everything fight over it. The problem here is that we have a heuristic (i.e. "150% of CPU cores + 1 gives the right balance for search operation concurrency") and this heuristic works well generally, but in some cases those search operations end up being more CPU-heavy than the average workload leading to over-utilization of the CPU. If the workload is predictable and known ahead of time then the search thread pool can be tuned to a different size. If the workload is not known ahead of time or is unpredictable, then I think you need to solve this with something that is adaptive like backpressure.

@sohami
Copy link
Collaborator

sohami commented Mar 4, 2024

May be something along the lines of thread priority can help to solve the problem of stability by making the ping threads, health check threads of highest priority and then indexing, search thread medium and background tasks threadpool as low priority.

@jainankitk
Copy link
Collaborator

@sgup432 - During discussion in the lucene meetup, @sohami mentioned if we can use priority for this purpose. Do you think assigning lower priority to search, indexing and other intensive thread pools achieve similar purpose as isolating CPU? - https://www.baeldung.com/java-thread-priority

@sgup432
Copy link
Contributor Author

sgup432 commented Mar 4, 2024

@sohami @jainankitk
I did evaluate this option ie thread priority earlier. But I think it can help to schedule other high priority threads but it is not guaranteed to provide true isolation and we are left to OS to take those decisions. So we may still end up in a situation where a node goes down in a search heavy workload. Correct me if I am wrong.

@jainankitk
Copy link
Collaborator

I did evaluate this option ie thread priority earlier. But I think it can help to schedule other high priority threads but it is not guaranteed to provide true isolation and we are left to OS to take those decisions.

I guess our objective is to ensure that critical tasks are not starved of CPU due to compute intensive search threads. Isolation is one of the ways, while priority achieves the same without introducing additional constraints.

So we may still end up in a situation where a node goes down in a search heavy workload. Correct me if I am wrong.

With prioritization, the search thread would yield the CPU immediately whenever higher priority thread is waiting. So, I don't think the node will go down due to CPU starvation. Maybe we can try out this approach?

@sgup432
Copy link
Contributor Author

sgup432 commented Mar 5, 2024

@jainankitk
Sure I can try this approach as well when I have some bandwidth(probably next week).

@sohami sohami added the Roadmap:Stability/Availability/Resiliency Project-wide roadmap label label May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Roadmap:Stability/Availability/Resiliency Project-wide roadmap label Search:Resiliency
Projects
Status: New
Status: Later (6 months plus)
Development

No branches or pull requests

6 participants