Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: kernel die when training vanna ai (with chromadb) with more than 90 vectors. #2405

Open
mkhansa opened this issue Jun 23, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@mkhansa
Copy link

mkhansa commented Jun 23, 2024

What happened?

when adding training data (350 vector) to vanna ai model, it is consuming 100%+ of the cpu (12th Gen Intel(R) i7 - 12650H) and 32 GB RAM and the kernel will "die".
i need to decrease the number of vectors to 70-80 only.

python code:

class MyVanna(ChromaDB_VectorStore, GoogleGeminiChat):
def init(self, config=None):
ChromaDB_VectorStore.init(self, config={
"path": "../path/VannaAI_path"
})
GoogleGeminiChat.init(self, config={'api_key': 'XXXXXX, "temperature":0, 'model': "gemini-1.5-pro"})

vn = MyVanna()
.....

with open('../training_data/doc_training_data.json', 'r') as f:
documentation_list = json.load(f)["documentation"]

for rule in documentation_list:
print(rule)
vn.train(documentation=rule) (here, the notebook crashed)

Versions

chromadb==0.5.3 , Python 3.12.2, windows

Relevant log output

No response

@mkhansa mkhansa added the bug Something isn't working label Jun 23, 2024
@tazarov
Copy link
Contributor

tazarov commented Jun 24, 2024

@mkhansa, I'm not familiar with Vanna AI and the problem they are solving. At a glance, it seems it is a RAG application aimed at answering SQL-related questions. Their train workflow seems to be using an LLM to create embeddings from docs, schemas, DDLs etc. Their use of Chroma is also quite straightforward. Without a deeper understanding of what their training workflow does beyond adding embeddings for the documentation in Chroma, I cannot say what could be causing this issue.

To test further, can I ask you to run Chroma in a separate instance e.g. docker or CLI, and then create an HttpClient and pass that as configuration in the Vanna vector store:
https://github.com/vanna-ai/vanna/blob/8cc20fbd22d73dd0321cc7464860c0f15080f3ad/src/vanna/chromadb/chromadb_vector.py#L23

Then run your workbook as above and check your processes to see which one consumes the 100% CPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants