-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML][Inference] Adding model memory estimations #48323
[ML][Inference] Adding model memory estimations #48323
Conversation
Pinging @elastic/ml-core (:ml) |
I think we will be creating a problem for ourselves by using the term "model memory" for this, and if a future extension of this work ever uses a field If there is any other functionality where Java heap size usage can be restricted by a setting then it may be best to reuse that terminology. Or if we can somehow frame this functionality as some sort of "circuit breaker", then that may be good as that's the terminology used by aggregations (although for a slightly different use case). Or failing that we should think of a new term that isn't "model memory". |
3662d19
to
2e91eed
Compare
…ture/ml-inference-model-cache-limits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a look at the cache implementation and hopefully added some useful feedback.
* How long should a model stay in the cache since its last access | ||
* | ||
* Since the model loading service is used related to ingest scenarios it makes sense for unused models to be evicted. | ||
* If a processor references a model, but does not use that model for an extended period of time, we will evict it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It's not clear to me how model usage is tracked (promotion of cache entries by usage). Does the ingest processor gets the model from the cache for every document or for every x documents? It would be good to document the proper interaction with this service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ingest processor gets the model as many times as it needs. When a model is referenced by any processor, it is loaded into the cache. I can try to clarify the comment
@@ -77,13 +125,15 @@ public void getModel(String modelId, ActionListener<Model> modelActionListener) | |||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns false if the model is not loaded or actively being loaded
Nit: This is somewhat unclear, from the code I get that it's not loaded if not referenced which means not used e.g. in a pipeline. But this code is also called from getModel(...)
, which is confusing: I want to explicitly get a model but might not get it in the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, if getModel(...)
sees that somebody else kicked off a load (cluster state change?) but it is not loaded yet, this method will add the listener to the listener queue.
If the caller sees that the model is not in the middle of loading AND is not loaded, it will make a direct call to eager load it and NOT attempt to cache it.
I will try to clarify the comment.
} | ||
localModelCache.put(modelId, loadedModel); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the implementation has a small race: if >1 threads call this with the same modelId/Config the model gets loaded twice, gets overwritten in the cache but also will be twice in memory as the loadModel instances are returned (line 180). If line 173 kicks in you also load the model superfluously.
I suggest to remove line 168 and use computeIfAbsent on line 176 instead of put. That ensures that the model is loaded only if necessary and exactly once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If line 173 kicks in you also load the model superfluously.
That is true, but the loading already kicked off. This check is here for the scenario when a processor references a model_id, the loading starts, but the the processor is removed before it can be fully loaded. So, we should not cache it as nothing cares about it.
If this method gets called more than once for the same modelId and loadedModel, it will not get added to cache twice.
Since loadingListeners
is a map by modelId
even if more than one model load event has occurred (which I am not convinced is possible), the listeners are only notified once since the map entry is removed in line 170.
The second thread to enter the synchronized
block would see that loadingListeners
does not have a key entry for modelId
and will exit early without putting the model in cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With twice in memory I did not mean the cache: this would be impossible. I mean the instances in memory. say A and B call this, both load it. Than A enters the synchronized
block first, puts it into the cache and "returns" the model. Whoever that is uses instanceA. Than B enters the block, overwrites the cache entry and "returns" the model instanceB.
In the end you have instanceB in the cache and instanceA and instanceB in memory.
This whole thing is a bit constructed and probably not possible due to the call chain before that can happen.
loadedModels.keySet().retainAll(allReferencedModelKeys); | ||
ClusterState state = event.state(); | ||
IngestMetadata currentIngestMetadata = state.metaData().custom(IngestMetadata.TYPE); | ||
Set<String> allReferencedModelKeys = getReferencedModelKeys(currentIngestMetadata); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the cluster state update does not change anything w.r.t. models, which is probably mostly the case? I miss a shortcut optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch, should exit early if allReferenceModelKeys. equalTo(referencedModels)
for (ActionListener<Model> listener : modelAndListeners.v2()) { | ||
listener.onFailure(new ElasticsearchException(msg)); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would be good for reading to mark the end of a synchronized block as comment e.g. } // synchronized(...)
…ture/ml-inference-model-cache-limits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
} | ||
localModelCache.put(modelId, loadedModel); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With twice in memory I did not mean the cache: this would be impossible. I mean the instances in memory. say A and B call this, both load it. Than A enters the synchronized
block first, puts it into the cache and "returns" the model. Whoever that is uses instanceA. Than B enters the block, overwrites the cache entry and "returns" the model instanceB.
In the end you have instanceB in the cache and instanceA and instanceB in memory.
This whole thing is a bit constructed and probably not possible due to the call chain before that can happen.
* [ML][Inference] adds lazy model loader and inference (#47410) This adds a couple of things: - A model loader service that is accessible via transport calls. This service will load in models and cache them. They will stay loaded until a processor no longer references them - A Model class and its first sub-class LocalModel. Used to cache model information and run inference. - Transport action and handler for requests to infer against a local model Related Feature PRs: * [ML][Inference] Adjust inference configuration option API (#47812) * [ML][Inference] adds logistic_regression output aggregator (#48075) * [ML][Inference] Adding read/del trained models (#47882) * [ML][Inference] Adding inference ingest processor (#47859) * [ML][Inference] fixing classification inference for ensemble (#48463) * [ML][Inference] Adding model memory estimations (#48323) * [ML][Inference] adding more options to inference processor (#48545) * [ML][Inference] handle string values better in feature extraction (#48584) * [ML][Inference] Adding _stats endpoint for inference (#48492) * [ML][Inference] add inference processors and trained models to usage (#47869) * [ML][Inference] add new flag for optionally including model definition (#48718) * [ML][Inference] adding license checks (#49056) * [ML][Inference] Adding memory and compute estimates to inference (#48955)
* [ML][Inference] adds lazy model loader and inference (elastic#47410) This adds a couple of things: - A model loader service that is accessible via transport calls. This service will load in models and cache them. They will stay loaded until a processor no longer references them - A Model class and its first sub-class LocalModel. Used to cache model information and run inference. - Transport action and handler for requests to infer against a local model Related Feature PRs: * [ML][Inference] Adjust inference configuration option API (elastic#47812) * [ML][Inference] adds logistic_regression output aggregator (elastic#48075) * [ML][Inference] Adding read/del trained models (elastic#47882) * [ML][Inference] Adding inference ingest processor (elastic#47859) * [ML][Inference] fixing classification inference for ensemble (elastic#48463) * [ML][Inference] Adding model memory estimations (elastic#48323) * [ML][Inference] adding more options to inference processor (elastic#48545) * [ML][Inference] handle string values better in feature extraction (elastic#48584) * [ML][Inference] Adding _stats endpoint for inference (elastic#48492) * [ML][Inference] add inference processors and trained models to usage (elastic#47869) * [ML][Inference] add new flag for optionally including model definition (elastic#48718) * [ML][Inference] adding license checks (elastic#49056) * [ML][Inference] Adding memory and compute estimates to inference (elastic#48955)
* [ML] ML Model Inference Ingest Processor (#49052) * [ML][Inference] adds lazy model loader and inference (#47410) This adds a couple of things: - A model loader service that is accessible via transport calls. This service will load in models and cache them. They will stay loaded until a processor no longer references them - A Model class and its first sub-class LocalModel. Used to cache model information and run inference. - Transport action and handler for requests to infer against a local model Related Feature PRs: * [ML][Inference] Adjust inference configuration option API (#47812) * [ML][Inference] adds logistic_regression output aggregator (#48075) * [ML][Inference] Adding read/del trained models (#47882) * [ML][Inference] Adding inference ingest processor (#47859) * [ML][Inference] fixing classification inference for ensemble (#48463) * [ML][Inference] Adding model memory estimations (#48323) * [ML][Inference] adding more options to inference processor (#48545) * [ML][Inference] handle string values better in feature extraction (#48584) * [ML][Inference] Adding _stats endpoint for inference (#48492) * [ML][Inference] add inference processors and trained models to usage (#47869) * [ML][Inference] add new flag for optionally including model definition (#48718) * [ML][Inference] adding license checks (#49056) * [ML][Inference] Adding memory and compute estimates to inference (#48955) * fixing version of indexed docs for model inference
This adds model memory estimations and a setting to limit the amount of memory used by the cached models.