-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Regression in ML Trained Model Deployment Update Causes Failure #107807
Comments
This commit introduces a new method `ByteSizeValue.bytesToString(long size)` to handle negative byte values and replaces all instances of `ByteSizeValue.ofBytes().toString()` with the new method. This change prevents `IllegalArgumentException` when negative byte values are passed to the `toString()` method. Relates elastic#107807
…dates This commit reverts changes to the memory usage estimation logic introduced by PR elastic#98139, which caused failures when updating the `number_of_allocations` for trained model deployments. The reversion restores the system's stability in high-availability environments. Relates elastic#107807
Pinging @elastic/ml-core (Team:ML) |
Dear @davidkyle , I have submitted two PRs aimed at addressing the problems identified. The first PR (#107812) introduces a safeguard against negative byte values in the The second PR (#107824) reverts specific changes introduced by PR #98139 that led to incorrect memory usage calculations. This reversion is intended as a temporary measure to restore stability for users experiencing deployment update issues. It is a stop-gap solution while we work on a more comprehensive fix. I kindly request a review of these PRs from the community. Your feedback will be invaluable in ensuring that we resolve this issue effectively and maintain the robustness of the Elasticsearch ML features. Thank you for your time and consideration. Best regards, |
Thank you for reporting the issue and creating the 2 PRs. First I want to understand where the negative byte value is coming from as that should never happen. I will follow your reproduce steps and try to find the cause thank you. Regarding reverting #98139 I can give you some extra details on why we implemented that change. Each allocation of a model on a given ml node uses the same copy of the model in memory. When the model is evaluated the is an extra transient memory overhead requires to store temporary data structures. Each allocation can evaluate the model concurrently meaning peak memory usage is function of the number of allocations. #98139 is designed to protect against out of memory errors by accounting for the extra memory used by each allocation. It looks like there is a bug which produces a negative value for the memory requirement but the principles behind #98139 are sound so I would prefer to find the bug rather than reverting the change and expose the model to out of memory errors. FYI, the per allocation memory requirement is calculated in Eland when you upload the model - |
Hello @davidkyle Thank you for your prompt response and for providing additional context regarding PR #98139. I appreciate the explanation of the underlying principles and the intention to prevent out-of-memory errors by accounting for the extra memory used by each allocation. To assist in identifying the source of the negative byte values, I have added temporary logging Rassyan@eee877f. This commit includes logs that capture the state of memory calculations when Click to expand the logs for `number_of_allocations` set to 30[2024-04-23T21:46:19,836][INFO ][o.e.x.c.m.a.StartTrainedModelDeploymentAction] [1713754515006759332] !!!Rassyan Estimating memory usage for model [baai__bge-base-zh-v1.5], total definition length [406780568], per deployment memory [406716416], per allocation memory [630929564], number of allocations [30], base size [1065219376], new size [19741383904] [2024-04-23T21:46:19,836][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan memoryBytes:19741383904 thread_name:elasticsearch[1713754515006759332][ml_utility][T#1] java.base/java.lang.Thread.getStackTrace(Thread.java:2450) org.elasticsearch.ml@8.11.3-SNAPSHOT/org.elasticsearch.xpack.ml.utils.StackUtil.getStack(StackUtil.java:13) org.elasticsearch.ml@8.11.3-SNAPSHOT/org.elasticsearch.xpack.ml.inference.assignment.planning.AssignmentPlan$Deployment.(AssignmentPlan.java:56) org.elasticsearch.ml@8.11.3-SNAPSHOT/org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentRebalancer.lambda$computePlanForNormalPriorityModels$6(TrainedModelAssignmentRebalancer.java:175) java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197) java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179) java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1939) java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596) org.elasticsearch.ml@8.11.3-SNAPSHOT/org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentRebalancer.computePlanForNormalPriorityModels(TrainedModelAssignmentRebalancer.java:178) org.elasticsearch.ml@8.11.3-SNAPSHOT/org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentRebalancer.computeAssignmentPlan(TrainedModelAssignmentRebalancer.java:105) org.elasticsearch.ml@8.11.3-SNAPSHOT/org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentRebalancer.rebalance(TrainedModelAssignmentRebalancer.java:83) org.elasticsearch.ml@8.11.3-SNAPSHOT/org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentClusterService.rebalanceAssignments(TrainedModelAssignmentClusterService.java:664) org.elasticsearch.ml@8.11.3-SNAPSHOT/org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentClusterService.increaseNumberOfAllocations(TrainedModelAssignmentClusterService.java:909) org.elasticsearch.ml@8.11.3-SNAPSHOT/org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentClusterService.lambda$adjustNumberOfAllocations$18(TrainedModelAssignmentClusterService.java:887) org.elasticsearch.server@8.11.3-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916) java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) java.base/java.lang.Thread.run(Thread.java:1583)[2024-04-23T21:46:19,837][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan memoryBytes:19741383904 [2024-04-23T21:46:19,837][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan memoryBytes:19741383904 [2024-04-23T21:46:19,837][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:false remMemory:59695904 additionalModelMemory:0 [2024-04-23T21:46:19,839][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:falsev:19801079808 m.memoryBytes():19741383904 [2024-04-23T21:46:19,839][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:false remMemory:59695904 additionalModelMemory:0 [2024-04-23T21:46:19,839][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan memoryBytes:19741383904 [2024-04-23T21:46:19,839][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:falsev:19801079808 m.memoryBytes():19741383904 [2024-04-23T21:46:19,839][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:false remMemory:59695904 additionalModelMemory:0 [2024-04-23T21:46:19,840][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan memoryBytes:19741383904 [2024-04-23T21:46:19,840][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan memoryBytes:19741383904 [2024-04-23T21:46:19,840][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:false remMemory:59695904 additionalModelMemory:0 [2024-04-23T21:46:19,841][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:falsev:19801079808 m.memoryBytes():19741383904 [2024-04-23T21:46:19,841][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:false remMemory:59695904 additionalModelMemory:0 [2024-04-23T21:46:19,842][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan memoryBytes:19741383904 [2024-04-23T21:46:19,842][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:falsev:19801079808 m.memoryBytes():19741383904 [2024-04-23T21:46:19,842][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:false remMemory:59695904 additionalModelMemory:0 [2024-04-23T21:46:19,843][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan memoryBytes:19741383904 [2024-04-23T21:46:19,843][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan memoryBytes:19741383904 [2024-04-23T21:46:19,843][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:falsev:19801079808 m.memoryBytes():19741383904 [2024-04-23T21:46:19,843][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:false remMemory:59695904 additionalModelMemory:0 [2024-04-23T21:46:19,843][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:falsev:19801079808 m.memoryBytes():19741383904 [2024-04-23T21:46:19,843][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:false remMemory:59695904 additionalModelMemory:0 [2024-04-23T21:46:19,844][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:false remMemory:19801079808 additionalModelMemory:0 [2024-04-23T21:46:19,844][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:falsev:19801079808 m.memoryBytes():19741383904 [2024-04-23T21:46:19,844][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:false remMemory:19801079808 additionalModelMemory:0 [2024-04-23T21:46:19,844][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:falsev:19801079808 m.memoryBytes():19741383904 [2024-04-23T21:46:19,844][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:false remMemory:19801079808 additionalModelMemory:0 [2024-04-23T21:46:19,844][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:falsev:19801079808 m.memoryBytes():19741383904 [2024-04-23T21:46:19,844][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:false remMemory:19801079808 additionalModelMemory:0 [2024-04-23T21:46:19,845][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:falsev:19801079808 m.memoryBytes():19741383904 [2024-04-23T21:46:19,845][INFO ][o.e.x.m.i.a.TrainedModelAssignmentRebalancer] [1713754515006759332] !!!Rassyan deployment.memoryBytes():19741383904 Click to expand the logs for `number_of_allocations` set to 31\ [2024-04-23T21:46:51,611][INFO ][o.e.x.c.m.a.StartTrainedModelDeploymentAction] [1713754515006759332] !!!Rassyan Estimating memory usage for model [baai__bge-base-zh-v1.5], total definition length [406780568], per deployment memory [406716416], per allocation memory [630929564], number of allocations [31], base size [1065219376], new size [20372313468] [2024-04-23T21:46:51,611][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan memoryBytes:20372313468 thread_name:elasticsearch[1713754515006759332][ml_utility][T#2] java.base/java.lang.Thread.getStackTrace(Thread.java:2450) org.elasticsearch.ml@8.11.3-SNAPSHOT/org.elasticsearch.xpack.ml.utils.StackUtil.getStack(StackUtil.java:13) org.elasticsearch.ml@8.11.3-SNAPSHOT/org.elasticsearch.xpack.ml.inference.assignment.planning.AssignmentPlan$Deployment.(AssignmentPlan.java:56) org.elasticsearch.ml@8.11.3-SNAPSHOT/org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentRebalancer.lambda$computePlanForNormalPriorityModels$6(TrainedModelAssignmentRebalancer.java:175) java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197) java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179) java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1939) java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509) java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596) org.elasticsearch.ml@8.11.3-SNAPSHOT/org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentRebalancer.computePlanForNormalPriorityModels(TrainedModelAssignmentRebalancer.java:178) org.elasticsearch.ml@8.11.3-SNAPSHOT/org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentRebalancer.computeAssignmentPlan(TrainedModelAssignmentRebalancer.java:105) org.elasticsearch.ml@8.11.3-SNAPSHOT/org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentRebalancer.rebalance(TrainedModelAssignmentRebalancer.java:83) org.elasticsearch.ml@8.11.3-SNAPSHOT/org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentClusterService.rebalanceAssignments(TrainedModelAssignmentClusterService.java:664) org.elasticsearch.ml@8.11.3-SNAPSHOT/org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentClusterService.increaseNumberOfAllocations(TrainedModelAssignmentClusterService.java:909) org.elasticsearch.ml@8.11.3-SNAPSHOT/org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentClusterService.lambda$adjustNumberOfAllocations$18(TrainedModelAssignmentClusterService.java:887) org.elasticsearch.server@8.11.3-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916) java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) java.base/java.lang.Thread.run(Thread.java:1583)[2024-04-23T21:46:51,612][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan memoryBytes:20372313468 [2024-04-23T21:46:51,612][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan memoryBytes:20372313468 [2024-04-23T21:46:51,612][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:true remMemory:-571233660 additionalModelMemory:0 [2024-04-23T21:46:51,613][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:truev:19801079808 m.memoryBytes():20372313468 [2024-04-23T21:46:51,613][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:true remMemory:-571233660 additionalModelMemory:0 [2024-04-23T21:46:51,613][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan memoryBytes:20372313468 [2024-04-23T21:46:51,614][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:truev:19801079808 m.memoryBytes():20372313468 [2024-04-23T21:46:51,614][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:true remMemory:-571233660 additionalModelMemory:0 [2024-04-23T21:46:51,614][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan memoryBytes:20372313468 [2024-04-23T21:46:51,614][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan memoryBytes:20372313468 [2024-04-23T21:46:51,614][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:true remMemory:-571233660 additionalModelMemory:0 [2024-04-23T21:46:51,615][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:truev:19801079808 m.memoryBytes():20372313468 [2024-04-23T21:46:51,615][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:true remMemory:-571233660 additionalModelMemory:0 [2024-04-23T21:46:51,615][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan memoryBytes:20372313468 [2024-04-23T21:46:51,616][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:truev:19801079808 m.memoryBytes():20372313468 [2024-04-23T21:46:51,616][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:true remMemory:-571233660 additionalModelMemory:0 [2024-04-23T21:46:51,616][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan memoryBytes:20372313468 [2024-04-23T21:46:51,616][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan memoryBytes:20372313468 [2024-04-23T21:46:51,616][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:truev:19801079808 m.memoryBytes():20372313468 [2024-04-23T21:46:51,616][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:true remMemory:-571233660 additionalModelMemory:0 [2024-04-23T21:46:51,617][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:truev:19801079808 m.memoryBytes():20372313468 [2024-04-23T21:46:51,617][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:true remMemory:-571233660 additionalModelMemory:0 [2024-04-23T21:46:51,617][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:false remMemory:19801079808 additionalModelMemory:0 [2024-04-23T21:46:51,617][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:truev:19801079808 m.memoryBytes():20372313468 [2024-04-23T21:46:51,617][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:false remMemory:19801079808 additionalModelMemory:0 [2024-04-23T21:46:51,617][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:truev:19801079808 m.memoryBytes():20372313468 [2024-04-23T21:46:51,618][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:false remMemory:19801079808 additionalModelMemory:0 [2024-04-23T21:46:51,618][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:truev:19801079808 m.memoryBytes():20372313468 [2024-04-23T21:46:51,618][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:false remMemory:19801079808 additionalModelMemory:0 [2024-04-23T21:46:51,618][INFO ][o.e.x.m.i.a.p.AssignmentPlan] [1713754515006759332] !!!Rassyan flag:truev:19801079808 m.memoryBytes():20372313468 [2024-04-23T21:46:51,618][INFO ][o.e.x.m.i.a.TrainedModelAssignmentRebalancer] [1713754515006759332] !!!Rassyan deployment.memoryBytes():20372313468 I understand the importance of the changes introduced in #98139 and agree that a long-term solution should be discussed to address the issue without compromising the system's stability. The PR to revert the changes is indeed a temporary measure aimed at quickly resolving the bug to ensure that production environments are not adversely affected. I am committed to collaborating with the community to develop a more permanent resolution that upholds the principles of #98139 while ensuring the robustness of the deployment updates. In the meantime, I believe that the revert PR can serve as an interim solution to mitigate the immediate impact on users. I look forward to your feedback on the additional logging and the logs provided. Let's work together to resolve this issue effectively. Best regards, |
Thanks for the logs and the extra details @Rassyan The good news is that I can reproduce the error in Elasticsearch 8.11.3 with 2 * 32GB ml nodes. The numbers are slightly different for me but when updating with 40 allocations I get the I was a little confused that I could not reproduce the error on the latest version (8.13.2) then I realised the problem with the negative byte size has been fixed in 8.12. I tested version 8.12.0 and the error did not occur. If you can upgrade to 8.12 the problem will go away for you. I don't think any further changes are necessary. Please can you try a new release of Elasticsearch and check if that resolves your problems. |
Thank you for the quick response and for confirming the issue, @davidkyle . I appreciate the suggestion to upgrade to version 8.12 where the issue has been resolved. Unfortunately, due to constraints in our production environment, an immediate upgrade is not feasible for us in the short term. However, we will certainly consider upgrading as part of our long-term solution. In the meantime, I would like to inquire about the policy regarding fixes for issues like this in Elasticsearch. Is it standard practice to only address such issues in subsequent releases without backporting the fix to earlier versions like 8.11? Understanding this policy would help us plan our maintenance and upgrade strategies more effectively. If the fix will not be backported to version 8.11, I will proceed with a workaround by patching the issue locally in our deployment. It would be helpful to know if there are any known risks or considerations I should be aware of when applying such a patch. Thank you again for your assistance and for the work you do maintaining Elasticsearch. Best regards, |
Sorry @Rassyan I missed the notification for your last message. In general once a new minor version is released there will be no more releases of the previous minor. 8.12 was released in January so 8.11.4 is the final release of the 8.11 branch. Commits will not be backported to 8.11 now that the next minor (8.12) is available. The exception is the that the last minor of the version 7 is supported and fixes will be backported to 7.17 where applicable.
I don't have any experience running a patched version of Elasticsearch sorry. If possible I encourage you to upgrade to 8.12 or later to avoid the burden of maintaining the patched version. Thank you for your contributions, if you don't mind I will close the issue now. I am pleased you are using Elastic for NLP, good luck and I hope you can upgrade soon. |
Elasticsearch Version
8.11.3
Installed Plugins
No response
Java Version
21
OS Version
N/A
Problem Description
Background:
In high-availability environments with a sufficient number of ML nodes, I am experiencing failures when attempting to update the
number_of_allocations
for a trained model deployment using thePOST _ml/trained_models/{model_id}/deployment/_update
API. This regression, which surfaced after the introduction of PR #98139, prevents valid updates to the number of allocations, despite the availability of ample resources. This bug has significantly impacted the normal usage of the feature in our production environment.Description of the problem including expected versus actual behavior:
To simplify this problem, I reproduce it below. Brief description here:
The update operation fails with a
status_exception
(429) whennumber_of_allocations
is set to 30, and with anillegal_argument_exception
(400) for a negative byte value when set to 31. This behavior is a departure from previous versions where such updates were successful.Steps to Reproduce
Model information for reproduction:
Steps to reproduce:
number_of_allocations
on an environment with at least two ML nodes (32c64g).number_of_allocations
set to 30 and observe thestatus_exception
(429).number_of_allocations
set to 31 and observe theillegal_argument_exception
(400) with debug logs indicating a negative byte value.Logs (if relevant)
when
number_of_allocations
is set to 30 and 31 in version 8.11.3debug log
Reference to the behavior in versions prior to 8.11, where the issue was not present
Analysis of the problem:
Upon reviewing PR #98139, I discovered that the core issue lies in the changes made to the
org.elasticsearch.xpack.core.ml.action.StartTrainedModelDeploymentAction#estimateMemoryUsageBytes
method. The modification introduced a linear relationship between the estimated memory usage and thenumberOfAllocations
, which, when combined with the subsequent arithmetic operations inorg.elasticsearch.xpack.ml.inference.assignment.planning.AssignmentPlan.Builder#accountMemory
, can result in a negative value. This negative value is then incorrectly passed toByteSizeValue.ofBytes().toString()
, leading to anIllegalArgumentException
.Proposed Solution:
I plan to submit two PRs: the first to handle the immediate issue of negative byte values, and the second to revert the key changes introduced by PR #98139 as a temporary measure. Concurrently, I am eager to participate in discussions for a more permanent fix that aligns with the project's goals and architecture.
It is crucial to address this regression constructively, acknowledging the complexity of software development and the potential for unintended side effects in contributions. The focus is on rectifying the issue to ensure the stability and reliability of the Elasticsearch ML features.
Additional context:
The issue was identified through detailed analysis and is documented with screenshots and code references to facilitate understanding and reproduction of the problem. I am committed to working on a swift resolution for this issue, as it is affecting our production environment. I am also open to engaging in discussions for a long-term solution and willing to contribute to its development.
The regression affects all subsequent versions and requires prompt attention to mitigate its impact on production environments. I have documented the debugging process with screenshots and code references to facilitate understanding and reproduction of the problem.
The text was updated successfully, but these errors were encountered: