-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check model output when measuring embedding size #535
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No issues from my end, but is there a way we can add a test case for this?
Shap is incompatible with NumPy 1.24 due to a deprecated usage becoming an error. There is no fix in Shap yet so an earlier version of NumPy must be used. Pandas 2.0 was recently released we will continue to use the latest 1.5 release to avoid any incompatibilities.
…g a NLP model. (elastic#522) PyTorch models traced in version 1.13 of PyTorch cannot be evaluated in version 1.9 or earlier. With this upgrade Eland becomes incompatible with pre 8.7 Elasticsearch and will refuse to upload a model to the cluster. In this scenario either upgrade Elasticsearch or use an earlier version of Eland.
@sethmlarson I've the test you asked and I've added a check against the version of the Elasticsearch cluster so the script will not use the new feature unless it is supported by Elasticsearch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Here are some thoughts:
eland/ml/pytorch/transformers.py
Outdated
self, | ||
model_id: str, | ||
task_type: str, | ||
es_version: Tuple[int, int, int], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little apprehensive about forcing users to pass an es_version
directly into this class, is there any other way we can handle this? (Should we wait to do any version-specific logic until the model needs to be serialized to Elasticsearch, at which point we'll have a client instance?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TransformerModel
class is responsible for loading the model and converting it to a format that can be uploaded to Elasticsearch. The PyTorchModel
class does the actual upload and has the client. Yes is would be nice to not require the version here but it would require a significant refactoring of those 2 classes that would probably break existing usages.
The version is required so that new features can be supported and to avoid creating incompatible configurations on older clusters, I expect we will see more of this type of logic in the future.
I've made the version parameter optional so that upgrade won't break any existing scripts using the class. If the version is not set only the most basic settings are used. I've also added docs explaining the usage.
Another option is to pass the client rather than the cluster version to the TransformerModel
class but then you need a working client and cluster to get the version when the use case might be just to save the model locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, if you're happy with the implementation then I am okay with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had one nit, otherwise LGTM!
) | ||
# The embedding_size paramater was added in Elasticsearch 8.8 | ||
# If the version is not known use the basic config | ||
if es_version is None or (es_version[0] <= 8 and es_version[1] < 8): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's simplify this to es_version is None or es_version < (8, 8)
eland/ml/pytorch/transformers.py
Outdated
self, | ||
model_id: str, | ||
task_type: str, | ||
es_version: Tuple[int, int, int], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, if you're happy with the implementation then I am okay with it.
…#535) Only add the embedding_size config option if the target Elasticsearch cluster version supports it
#532 added code for generating a text_embedding and measuring the number of dimensions to set the
embedding_size
field. The code expected the model output to be a tuple but for some models that is not the case.embedding_size
was added to Elasticsearch in v8.8, the new field is only added if the ES cluster is v8.8 or greater.Closes #533