Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML][Inference] Adding model memory estimations #48323

Conversation

benwtrent
Copy link
Member

This adds model memory estimations and a setting to limit the amount of memory used by the cached models.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@droberts195
Copy link
Contributor

I think we will be creating a problem for ourselves by using the term "model memory" for this, and if a future extension of this work ever uses a field model_memory_limit then that will definitely cause confusion. Previous uses of model_memory_limit have referred to memory used by ML's native processes. Some people have already been confused about whether this in some way relates to the Java heap size. If we end up with some places in the ML plugin having a model_memory_limit that relates to native memory used by external processes and other places having a model_memory_limit that relates to memory used in the Java heap then that will make it really hard to answer future questions about memory limits succinctly.

If there is any other functionality where Java heap size usage can be restricted by a setting then it may be best to reuse that terminology. Or if we can somehow frame this functionality as some sort of "circuit breaker", then that may be good as that's the terminology used by aggregations (although for a slightly different use case). Or failing that we should think of a new term that isn't "model memory".

@benwtrent benwtrent force-pushed the feature/ml-inference-model-cache-limits branch from 3662d19 to 2e91eed Compare October 23, 2019 15:56
@benwtrent benwtrent marked this pull request as ready for review October 23, 2019 15:57
Copy link
Contributor

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look at the cache implementation and hopefully added some useful feedback.

* How long should a model stay in the cache since its last access
*
* Since the model loading service is used related to ingest scenarios it makes sense for unused models to be evicted.
* If a processor references a model, but does not use that model for an extended period of time, we will evict it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It's not clear to me how model usage is tracked (promotion of cache entries by usage). Does the ingest processor gets the model from the cache for every document or for every x documents? It would be good to document the proper interaction with this service.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ingest processor gets the model as many times as it needs. When a model is referenced by any processor, it is loaded into the cache. I can try to clarify the comment

@@ -77,13 +125,15 @@ public void getModel(String modelId, ActionListener<Model> modelActionListener)
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns false if the model is not loaded or actively being loaded

Nit: This is somewhat unclear, from the code I get that it's not loaded if not referenced which means not used e.g. in a pipeline. But this code is also called from getModel(...), which is confusing: I want to explicitly get a model but might not get it in the end?

Copy link
Member Author

@benwtrent benwtrent Oct 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, if getModel(...) sees that somebody else kicked off a load (cluster state change?) but it is not loaded yet, this method will add the listener to the listener queue.

If the caller sees that the model is not in the middle of loading AND is not loaded, it will make a direct call to eager load it and NOT attempt to cache it.

I will try to clarify the comment.

}
localModelCache.put(modelId, loadedModel);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the implementation has a small race: if >1 threads call this with the same modelId/Config the model gets loaded twice, gets overwritten in the cache but also will be twice in memory as the loadModel instances are returned (line 180). If line 173 kicks in you also load the model superfluously.

I suggest to remove line 168 and use computeIfAbsent on line 176 instead of put. That ensures that the model is loaded only if necessary and exactly once.

Copy link
Member Author

@benwtrent benwtrent Oct 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If line 173 kicks in you also load the model superfluously.

That is true, but the loading already kicked off. This check is here for the scenario when a processor references a model_id, the loading starts, but the the processor is removed before it can be fully loaded. So, we should not cache it as nothing cares about it.

If this method gets called more than once for the same modelId and loadedModel, it will not get added to cache twice.

Since loadingListeners is a map by modelId even if more than one model load event has occurred (which I am not convinced is possible), the listeners are only notified once since the map entry is removed in line 170.

The second thread to enter the synchronized block would see that loadingListeners does not have a key entry for modelId and will exit early without putting the model in cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With twice in memory I did not mean the cache: this would be impossible. I mean the instances in memory. say A and B call this, both load it. Than A enters the synchronized block first, puts it into the cache and "returns" the model. Whoever that is uses instanceA. Than B enters the block, overwrites the cache entry and "returns" the model instanceB.
In the end you have instanceB in the cache and instanceA and instanceB in memory.

This whole thing is a bit constructed and probably not possible due to the call chain before that can happen.

loadedModels.keySet().retainAll(allReferencedModelKeys);
ClusterState state = event.state();
IngestMetadata currentIngestMetadata = state.metaData().custom(IngestMetadata.TYPE);
Set<String> allReferencedModelKeys = getReferencedModelKeys(currentIngestMetadata);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the cluster state update does not change anything w.r.t. models, which is probably mostly the case? I miss a shortcut optimization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, should exit early if allReferenceModelKeys. equalTo(referencedModels)

for (ActionListener<Model> listener : modelAndListeners.v2()) {
listener.onFailure(new ElasticsearchException(msg));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be good for reading to mark the end of a synchronized block as comment e.g. } // synchronized(...)

Copy link
Contributor

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}
localModelCache.put(modelId, loadedModel);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With twice in memory I did not mean the cache: this would be impossible. I mean the instances in memory. say A and B call this, both load it. Than A enters the synchronized block first, puts it into the cache and "returns" the model. Whoever that is uses instanceA. Than B enters the block, overwrites the cache entry and "returns" the model instanceB.
In the end you have instanceB in the cache and instanceA and instanceB in memory.

This whole thing is a bit constructed and probably not possible due to the call chain before that can happen.

@benwtrent benwtrent merged commit c6d977c into elastic:feature/ml-inference Oct 25, 2019
@benwtrent benwtrent deleted the feature/ml-inference-model-cache-limits branch October 25, 2019 15:46
benwtrent added a commit that referenced this pull request Nov 18, 2019
* [ML][Inference] adds lazy model loader and inference (#47410)

This adds a couple of things:

- A model loader service that is accessible via transport calls. This service will load in models and cache them. They will stay loaded until a processor no longer references them
- A Model class and its first sub-class LocalModel. Used to cache model information and run inference.
- Transport action and handler for requests to infer against a local model
Related Feature PRs: 
* [ML][Inference] Adjust inference configuration option API (#47812)

* [ML][Inference] adds logistic_regression output aggregator (#48075)

* [ML][Inference] Adding read/del trained models (#47882)

* [ML][Inference] Adding inference ingest processor (#47859)

* [ML][Inference] fixing classification inference for ensemble (#48463)

* [ML][Inference] Adding model memory estimations (#48323)

* [ML][Inference] adding more options to inference processor (#48545)

* [ML][Inference] handle string values better in feature extraction (#48584)

* [ML][Inference] Adding _stats endpoint for inference (#48492)

* [ML][Inference] add inference processors and trained models to usage (#47869)

* [ML][Inference] add new flag for optionally including model definition (#48718)

* [ML][Inference] adding license checks (#49056)

* [ML][Inference] Adding memory and compute estimates to inference (#48955)
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Nov 18, 2019
* [ML][Inference] adds lazy model loader and inference (elastic#47410)

This adds a couple of things:

- A model loader service that is accessible via transport calls. This service will load in models and cache them. They will stay loaded until a processor no longer references them
- A Model class and its first sub-class LocalModel. Used to cache model information and run inference.
- Transport action and handler for requests to infer against a local model
Related Feature PRs:
* [ML][Inference] Adjust inference configuration option API (elastic#47812)

* [ML][Inference] adds logistic_regression output aggregator (elastic#48075)

* [ML][Inference] Adding read/del trained models (elastic#47882)

* [ML][Inference] Adding inference ingest processor (elastic#47859)

* [ML][Inference] fixing classification inference for ensemble (elastic#48463)

* [ML][Inference] Adding model memory estimations (elastic#48323)

* [ML][Inference] adding more options to inference processor (elastic#48545)

* [ML][Inference] handle string values better in feature extraction (elastic#48584)

* [ML][Inference] Adding _stats endpoint for inference (elastic#48492)

* [ML][Inference] add inference processors and trained models to usage (elastic#47869)

* [ML][Inference] add new flag for optionally including model definition (elastic#48718)

* [ML][Inference] adding license checks (elastic#49056)

* [ML][Inference] Adding memory and compute estimates to inference (elastic#48955)
benwtrent added a commit that referenced this pull request Nov 18, 2019
* [ML] ML Model Inference Ingest Processor (#49052)

* [ML][Inference] adds lazy model loader and inference (#47410)

This adds a couple of things:

- A model loader service that is accessible via transport calls. This service will load in models and cache them. They will stay loaded until a processor no longer references them
- A Model class and its first sub-class LocalModel. Used to cache model information and run inference.
- Transport action and handler for requests to infer against a local model
Related Feature PRs:

* [ML][Inference] Adjust inference configuration option API (#47812)

* [ML][Inference] adds logistic_regression output aggregator (#48075)

* [ML][Inference] Adding read/del trained models (#47882)

* [ML][Inference] Adding inference ingest processor (#47859)

* [ML][Inference] fixing classification inference for ensemble (#48463)

* [ML][Inference] Adding model memory estimations (#48323)

* [ML][Inference] adding more options to inference processor (#48545)

* [ML][Inference] handle string values better in feature extraction (#48584)

* [ML][Inference] Adding _stats endpoint for inference (#48492)

* [ML][Inference] add inference processors and trained models to usage (#47869)

* [ML][Inference] add new flag for optionally including model definition (#48718)

* [ML][Inference] adding license checks (#49056)

* [ML][Inference] Adding memory and compute estimates to inference (#48955)

* fixing version of indexed docs for model inference
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants