Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check model output when measuring embedding size #535

Merged
merged 15 commits into from
May 25, 2023

Conversation

davidkyle
Copy link
Member

@davidkyle davidkyle commented May 2, 2023

#532 added code for generating a text_embedding and measuring the number of dimensions to set the embedding_size field. The code expected the model output to be a tuple but for some models that is not the case.

embedding_size was added to Elasticsearch in v8.8, the new field is only added if the ES cluster is v8.8 or greater.

Closes #533

@davidkyle davidkyle added bug Something isn't working topic:NLP Issue or PR about NLP model support and eland_import_hub_model labels May 2, 2023
Copy link
Contributor

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues from my end, but is there a way we can add a test case for this?

Shap is incompatible with NumPy 1.24 due to a deprecated usage becoming
an error. There is no fix in Shap yet so an earlier version of NumPy must
be used.
Pandas 2.0 was recently released we will continue to use the latest 1.5 release 
to avoid any incompatibilities.
…g a NLP model. (elastic#522)

PyTorch models traced in version 1.13 of PyTorch cannot be evaluated in 
version 1.9 or earlier. With this upgrade Eland becomes incompatible with
pre 8.7 Elasticsearch and will refuse to upload a model to the cluster. 
In this scenario either upgrade Elasticsearch or use an earlier version of Eland.
@davidkyle
Copy link
Member Author

@sethmlarson I've the test you asked and I've added a check against the version of the Elasticsearch cluster so the script will not use the new feature unless it is supported by Elasticsearch

Copy link
Contributor

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Here are some thoughts:

eland/ml/pytorch/transformers.py Show resolved Hide resolved
self,
model_id: str,
task_type: str,
es_version: Tuple[int, int, int],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little apprehensive about forcing users to pass an es_version directly into this class, is there any other way we can handle this? (Should we wait to do any version-specific logic until the model needs to be serialized to Elasticsearch, at which point we'll have a client instance?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TransformerModel class is responsible for loading the model and converting it to a format that can be uploaded to Elasticsearch. The PyTorchModel class does the actual upload and has the client. Yes is would be nice to not require the version here but it would require a significant refactoring of those 2 classes that would probably break existing usages.

The version is required so that new features can be supported and to avoid creating incompatible configurations on older clusters, I expect we will see more of this type of logic in the future.

I've made the version parameter optional so that upgrade won't break any existing scripts using the class. If the version is not set only the most basic settings are used. I've also added docs explaining the usage.

Another option is to pass the client rather than the cluster version to the TransformerModel class but then you need a working client and cluster to get the version when the use case might be just to save the model locally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, if you're happy with the implementation then I am okay with it.

@davidkyle davidkyle enabled auto-merge (squash) May 24, 2023 11:04
Copy link
Contributor

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had one nit, otherwise LGTM!

)
# The embedding_size paramater was added in Elasticsearch 8.8
# If the version is not known use the basic config
if es_version is None or (es_version[0] <= 8 and es_version[1] < 8):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's simplify this to es_version is None or es_version < (8, 8)

self,
model_id: str,
task_type: str,
es_version: Tuple[int, int, int],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, if you're happy with the implementation then I am okay with it.

@davidkyle davidkyle merged commit 32ab988 into elastic:main May 25, 2023
@davidkyle davidkyle deleted the tuple-check branch May 25, 2023 18:11
picandocodigo pushed a commit that referenced this pull request Jul 11, 2023
…#535)

Only add the embedding_size config option if the target Elasticsearch 
cluster version supports it
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working topic:NLP Issue or PR about NLP model support and eland_import_hub_model
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Getting "ValueError: not enough values to unpack" when using text_embedding models
2 participants