Check model output when measuring embedding size #535

davidkyle · 2023-05-02T14:35:30Z

#532 added code for generating a text_embedding and measuring the number of dimensions to set the embedding_size field. The code expected the model output to be a tuple but for some models that is not the case.

embedding_size was added to Elasticsearch in v8.8, the new field is only added if the ES cluster is v8.8 or greater.

Closes #533

sethmlarson

No issues from my end, but is there a way we can add a test case for this?

Shap is incompatible with NumPy 1.24 due to a deprecated usage becoming an error. There is no fix in Shap yet so an earlier version of NumPy must be used. Pandas 2.0 was recently released we will continue to use the latest 1.5 release to avoid any incompatibilities.

…g a NLP model. (elastic#522) PyTorch models traced in version 1.13 of PyTorch cannot be evaluated in version 1.9 or earlier. With this upgrade Eland becomes incompatible with pre 8.7 Elasticsearch and will refuse to upload a model to the cluster. In this scenario either upgrade Elasticsearch or use an earlier version of Eland.

davidkyle · 2023-05-23T12:33:07Z

@sethmlarson I've the test you asked and I've added a check against the version of the Elasticsearch cluster so the script will not use the new feature unless it is supported by Elasticsearch

sethmlarson

Thanks! Here are some thoughts:

eland/ml/pytorch/transformers.py

sethmlarson · 2023-05-23T17:46:17Z

eland/ml/pytorch/transformers.py

+        self,
+        model_id: str,
+        task_type: str,
+        es_version: Tuple[int, int, int],


I'm a little apprehensive about forcing users to pass an es_version directly into this class, is there any other way we can handle this? (Should we wait to do any version-specific logic until the model needs to be serialized to Elasticsearch, at which point we'll have a client instance?)

The TransformerModel class is responsible for loading the model and converting it to a format that can be uploaded to Elasticsearch. The PyTorchModel class does the actual upload and has the client. Yes is would be nice to not require the version here but it would require a significant refactoring of those 2 classes that would probably break existing usages.

The version is required so that new features can be supported and to avoid creating incompatible configurations on older clusters, I expect we will see more of this type of logic in the future.

I've made the version parameter optional so that upgrade won't break any existing scripts using the class. If the version is not set only the most basic settings are used. I've also added docs explaining the usage.

Another option is to pass the client rather than the cluster version to the TransformerModel class but then you need a working client and cluster to get the version when the use case might be just to save the model locally.

Sounds good, if you're happy with the implementation then I am okay with it.

sethmlarson

Had one nit, otherwise LGTM!

sethmlarson · 2023-05-25T17:19:53Z

eland/ml/pytorch/transformers.py

-            )
+            # The embedding_size paramater was added in Elasticsearch 8.8
+            # If the version is not known use the basic config
+            if es_version is None or (es_version[0] <= 8 and es_version[1] < 8):


Let's simplify this to es_version is None or es_version < (8, 8)

sethmlarson · 2023-05-25T17:25:02Z

eland/ml/pytorch/transformers.py

+        self,
+        model_id: str,
+        task_type: str,
+        es_version: Tuple[int, int, int],


Sounds good, if you're happy with the implementation then I am okay with it.

…#535) Only add the embedding_size config option if the target Elasticsearch cluster version supports it

Test if tuple

8ced6ae

davidkyle added bug Something isn't working topic:NLP Issue or PR about NLP model support and eland_import_hub_model labels May 2, 2023

davidkyle mentioned this pull request May 2, 2023

Getting "ValueError: not enough values to unpack" when using text_embedding models #533

Closed

sethmlarson reviewed May 2, 2023

View reviewed changes

davidkyle added 8 commits May 18, 2023 11:07

Merge branch 'main' into tuple-check

aa475c4

Limit NumPy to a range of versions and note why (elastic#540)

86c0821

WIP

dcc693a

Check cluster version when creating config

70eaf74

Merge branch 'main' into tuple-check

667b433

update docs

4de95a4

davidkyle requested a review from sethmlarson May 22, 2023 18:11

Add text embedding model test

b463364

davidkyle force-pushed the tuple-check branch from 6f4bc58 to b463364 Compare May 23, 2023 06:59

Reuse version parsing code

ac1f074

sethmlarson reviewed May 23, 2023

View reviewed changes

davidkyle added 3 commits May 23, 2023 22:02

make args keyword only

2bab429

Make version optional

dd2a34d

Merge branch 'main' into tuple-check

07ba020

davidkyle enabled auto-merge (squash) May 24, 2023 11:04

fix test

3367c82

sethmlarson approved these changes May 25, 2023

View reviewed changes

davidkyle merged commit 32ab988 into elastic:main May 25, 2023

davidkyle deleted the tuple-check branch May 25, 2023 18:11

picandocodigo pushed a commit that referenced this pull request Jul 11, 2023

Tolerate different model output formats when measuring embedding size (…

1ca6f3b

…#535) Only add the embedding_size config option if the target Elasticsearch cluster version supports it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check model output when measuring embedding size #535

Check model output when measuring embedding size #535

davidkyle commented May 2, 2023 •

edited

Loading

sethmlarson left a comment

davidkyle commented May 23, 2023

sethmlarson left a comment

sethmlarson May 23, 2023

davidkyle May 24, 2023

sethmlarson May 25, 2023

sethmlarson left a comment

sethmlarson May 25, 2023

sethmlarson May 25, 2023

Check model output when measuring embedding size #535

Check model output when measuring embedding size #535

Conversation

davidkyle commented May 2, 2023 • edited Loading

sethmlarson left a comment

Choose a reason for hiding this comment

davidkyle commented May 23, 2023

sethmlarson left a comment

Choose a reason for hiding this comment

sethmlarson May 23, 2023

Choose a reason for hiding this comment

davidkyle May 24, 2023

Choose a reason for hiding this comment

sethmlarson May 25, 2023

Choose a reason for hiding this comment

sethmlarson left a comment

Choose a reason for hiding this comment

sethmlarson May 25, 2023

Choose a reason for hiding this comment

sethmlarson May 25, 2023

Choose a reason for hiding this comment

davidkyle commented May 2, 2023 •

edited

Loading