Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Getting a Value Error when using the HuggingFace embedding function #2422

Open
tomersagi opened this issue Jun 26, 2024 · 2 comments
Open
Labels
bug Something isn't working

Comments

@tomersagi
Copy link

What happened?

Hi,
I am trying to use a custom embedding model using the huggingfaceAPI. I am following the instructions from here

However, when I try to use the embedding function I get the following error:

Traceback (most recent call last):
  File "C:\Users\OT48ZK\AppData\Local\Programs\PyCharm Professional\plugins\python\helpers-pro\pydevd_asyncio\pydevd_asyncio_utils.py", line 117, in _exec_async_code
    result = func()
             ^^^^^^
  File "<input>", line 1, in <module>
  File "C:\Users\OT48ZK\PycharmProjects\retrieval-er\venv\Lib\site-packages\chromadb\api\types.py", line 198, in __call__
    return validate_embeddings(maybe_cast_one_to_many_embedding(result))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\OT48ZK\PycharmProjects\retrieval-er\venv\Lib\site-packages\chromadb\api\types.py", line 507, in validate_embeddings
    raise ValueError(
ValueError: Expected each value in the embedding to be a int or float, got an embedding with ['list'] - [[[0.21432682871818542, -0.11559132486581802, ...

Minimal example:

import chromadb
import chromadb.utils.embedding_functions as emb

chroma_client = chromadb.PersistentClient(path='mehdie.db')
huggingface_ef = emb.HuggingFaceEmbeddingFunction(model_name='google-bert/bert-base-multilingual-cased', api_key='hf_...')

val = huggingface_ef(['Washington'])

Versions

Chroma 0.5.3
Python 3.11

Relevant log output

Traceback (most recent call last):
  File "C:\Users\OT48ZK\AppData\Local\Programs\PyCharm Professional\plugins\python\helpers-pro\pydevd_asyncio\pydevd_asyncio_utils.py", line 117, in _exec_async_code
    result = func()
             ^^^^^^
  File "<input>", line 1, in <module>
  File "C:\Users\OT48ZK\PycharmProjects\retrieval-er\venv\Lib\site-packages\chromadb\api\types.py", line 198, in __call__
    return validate_embeddings(maybe_cast_one_to_many_embedding(result))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\OT48ZK\PycharmProjects\retrieval-er\venv\Lib\site-packages\chromadb\api\types.py", line 507, in validate_embeddings
    raise ValueError(
ValueError: Expected each value in the embedding to be a int or float, got an embedding with ['list'] - [[[0.21432682871818542, -0.11559132486581802, ...
@tomersagi tomersagi added the bug Something isn't working label Jun 26, 2024
@tomersagi
Copy link
Author

ok, I understand the problem now. The embedding model I am using is returning a k x F tensor, with k being the number of tokens in the query phrase and F being the number of features. The chroma huggingface embedding function is expecting a 1xF tensor only. To solve it I had to subclass the embedding function and add a mean pooling step.

Perhaps the documentation and error message can be improved here to describe the types of models this embedding function supports.

@tazarov
Copy link
Contributor

tazarov commented Jun 27, 2024

@tomersagi, you are right that the naming is a bit misleading. Under the hood, we use sentence-transformers. Technically, it also works with transformer models only and defaults to mean pooling, and without normalization.

We can do better by letting the user know that the model they are loading is not a sentence-transformer one, which may produce unsupported output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants