Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Retrieve A List Of Metadata Or A List Of Collections #2516

Closed
Alexie-Z-Yevich opened this issue Jul 15, 2024 · 9 comments
Closed
Labels
enhancement New feature or request

Comments

@Alexie-Z-Yevich
Copy link

Describe the problem

May I ask if there is a function in Chroma that can retrieve a list of metadata or a list of collections.
I encountered an issue when developing a large language model knowledge base using ChromaDB in multi-user mode, where I couldn't distinguish between multiple documents in multi-user mode. I initially used a solution where one user has one ChromaDB and one document uses one collection, but I found that there seems to be no relevant interface implementation for querying multiple collections together. Afterwards, a single ChromaDB was used for all users, with each user divided by a collection and different documents distinguished by metadata under the collection. The problem encountered was that metadata had to be set as key value pairs, and there was no direct function that could return a list of metadata to me.
Because I am a newcomer who has just started practical development and only studied in school before, I cannot be sure if my requirements follow the design team's design ideas for Chroma. My current solution is to store documents in metadata as {'source ': file_name} and use set ([metadata ['source'] for metadata in collections. get() ['mata s']) to obtain a list of documents.
But doing so would require using get() to request all the data each time, which would be very costly. Therefore, I hope to receive some advice from the Chroma team: does Chroma already have a mechanism that meets my needs; If not, will we consider resolving this issue in the future?

Describe the proposed solution

Add an interface for one click retrieval of lists related to collections or metadata, or provide examples of how different users can choose to use certain documents in Chroma.

Alternatives considered

No response

Importance

would make my life easier

Additional Information

No response

@Alexie-Z-Yevich Alexie-Z-Yevich added the enhancement New feature or request label Jul 15, 2024
@tazarov
Copy link
Contributor

tazarov commented Jul 15, 2024

@Alexie-Z-Yevich, thanks for your explanation. Let me ask a clarifying question. If your app is multi-tenant, e.g., each user owns their collection or DB, why do you need to query across all users' knowledgebases? Is that some admin functionality?

@Alexie-Z-Yevich
Copy link
Author

@Alexie-Z-Yevich, thanks for your explanation. Let me ask a clarifying question. If your app is multi-tenant, e.g., each user owns their collection or DB, why do you need to query across all users' knowledgebases? Is that some admin functionality?

Not all data uploaded by users is queried. Of course, your viewpoint also reminds me of another possibility: the company's managers can upload some public information for reference.
What I actually mean is: if a user uploads multiple documents, I want to use a list to store all the documents they have uploaded, so that the front-end can prompt the user which documents they have uploaded. For example, if he uploads a medical related document for the first time and a game related document for the second time, he can choose which document he wants to recall from to avoid mistakenly finding medical related content when searching for game content.
At the same time, it is also possible to easily delete corresponding documents based on their uploaded document names or other identifiers, and add some restrictions (such as a maximum of 10 100mb documents per user), making it easy to maintain a multi-user usage scenario.

@tazarov
Copy link
Contributor

tazarov commented Jul 15, 2024

@Alexie-Z-Yevich, the simplest solution to your problem can be document categories which can be implemented in the following ways:

Metadata, single collection per user:

each time a user uploads a document they specify what caregory it falls into. You chunk the document (if necessary) and to each chunk, you attach metadata with the corresponding category.

You query data using:

collection.query(query_texts=[user_query_input], where={"category":user_query_category}

The list of those categories you can keep either outside of Chroma in a separate relational DB or within Chroma - 1) have a metadata document that gets updated each time a new category is added or removed (this is hacky, but can work) and 2) Query for all categories each time, which is what you do in right now I believe.

Doing collection.get() each time you list the user's categories or even the individual documents, can be expensive, but perhaps you can utilize a cache to store that info

Collection per user per category:

Each type of document goes into a separate collection. Each collection has metadata associated with the category it represents.

You can create collections like so:

client.get_or_create_collection(f"{user_id}-{category}", metadata={"category":category}) # make sure to sanitize user_id and category

Then, when you present the list of categories to the user:

user_categories = [f"{c.metadata['category']}" for c in client.list_collections() if c.name.startswith(f"{user}")]

The list_collections operation is much cheaper than collection.get()

@Alexie-Z-Yevich
Copy link
Author

@tazarov It has effectively solved my problem. Thank you very much! ❤❤❤

@tazarov
Copy link
Contributor

tazarov commented Jul 15, 2024

@Alexie-Z-Yevich, glad to be able to help. If you have any further feel free to jump into Discord. We have a great community there and live interactions will make for speedy answers :)

@Alexie-Z-Yevich
Copy link
Author

@tazarov Sorry, I still have a question. Can't multiple collections be searched simultaneously? Although category distinguishes major categories, each user will have many collections, which makes it very difficult to recall if there are two texts in different collections.💦💦
Still using my previous games and medical examples, if it's a medical game, do I need to recall 3 items in the game category and then recall 3 items from the medical category? (That way, my prompt will become very long because I can't extract 3 out of 6 data points)
My original intention was that when a user uploads a document, there will be a label for the document, which defaults to recalling n pieces of data from all the documents uploaded by the user. Users can view their historical uploaded documents and choose to delete them or recall only some documents (rather than all). This scenario is more similar to a multiple-choice box.
At the same time, based on your first answer😂, my question has been further strengthened: is it possible to set up an admin role to add some enterprise documents to the bottom layer of all users without displaying them on the user panel. Although it is possible to upload the document for each user role and hide the relevant tags upon return, this would result in a huge waste of data being present in each role's database.

@Alexie-Z-Yevich
Copy link
Author

I'm sorry I have so many questions🙏🙏🙏

@tazarov
Copy link
Contributor

tazarov commented Jul 15, 2024

@Alexie-Z-Yevich, you can totally do concurrent queries across multiple collections. You can even do them in parallel/simultaneously if you'd like to reduce user latency.

I get your point that there can be an overlap between categories, and that does not necessarily need to translate into redundant copies. Let's take my first approach from above - Metadata, single collection per user:

Example:

collection.query(query_texts=[user_input_query],where={"medical":True,"game":True}

The key to the above query is that for each document, you store all the categories it belongs to as boolean metadata fields (this makes for speedier filters).

The above example can also be applied for the multi-collection approach where you can pick to store the document in only one collection, but add metadata with the other category it also matches against then at query time you can add those additional categories as above example.

Let me know if the above makes sense in your context.

Regarding the admin role question - I guess there is more than one way to achieve this. With the single collection approach, you could add extra categories to search that are only related to corporate documents. For the multi-collection one, you can conversely search either a single collection with corporate docs and filter based on category or have multiple collections (e.g. one per major category) of corporate docs that you search in parallel to the user-selected collections.

@Alexie-Z-Yevich
Copy link
Author

@tazarov I don't know if I understand your meaning correctly. As you can see, whether I use keys or values as labels, metadata returns not a list, but the key I last stored.
My partial code is as follows:

chroma_client = chromadb.PersistentClient(path="../vector_store/admin/")
collections = chroma_client.get_or_create_collection(user_name)
print(collections.get())
sources = collections.metadata  # or collections.metadata['A1']
print(sources)

t1
t2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants