Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible class conflict between faiss-cpu and pymupdf #3689

Open
2 tasks
hairuoguo opened this issue Jul 25, 2024 · 6 comments
Open
2 tasks

Possible class conflict between faiss-cpu and pymupdf #3689

hairuoguo opened this issue Jul 25, 2024 · 6 comments

Comments

@hairuoguo
Copy link

hairuoguo commented Jul 25, 2024

Summary

Hello,

I am currently using the ColBERT model for a work project, which uses faiss. We had pymupdf installed in the same conda environment, as we are trying to work with scanned documents as a datasource.

ColBERT calls faiss's kmeans.train(), which led to an AssertionError on line 109 in vector_to_array.py (assert classname.endswith('Vector')). When I took a look at the input to that function it was a pymupdf proxy object instead of belonging to the expected "[dtype]Vector" classes defined in faiss.

This error disappeared after uninstalling pymupdf.

Platform

OS: Ubuntu 20.04.5 LTS (in docker container)

Faiss version: faiss-cpu 1.8.0.post1

Installed from: pip

Faiss compilation options: default flags

Running on:

  • [ X] CPU
  • GPU

Interface:

  • C++
  • [X ] Python

Reproduction instructions

Install faiss-cpu and pymupdf in conda environment using pip.
Import fitz (pymupdf) and attempt to train faiss kmeans class

OR

Install ColBERT from ColBERT repo using instructions
Install pymupdf
import fitz (pymupdf) in code that runs ColBERT's Indexer class

@hairuoguo
Copy link
Author

This may also be a potential security vulnerability depending on what is actually happening under the hood. For example, I could modify the pymupdf vector class to include malicious code in the data() function, and the pymupdf proxy class would inadvertently be used, allowing for the code to be run whenever the .data() method is called.

@mdouze
Copy link
Contributor

mdouze commented Jul 29, 2024

this may be because both Faiss and pymupdf are wrapped with SWIG.
LMC if there is a workaround for this case.

@mdouze
Copy link
Contributor

mdouze commented Jul 29, 2024

I think we could use SWIG_TYPE_TABLE to make a unique type table for Faiss.
https://www.swig.org/Doc4.2/Modules.html#Modules_nn2
It seems that it just makes sure the table holding type names is distinct for Faiss.

@junjieqi
Copy link
Contributor

@hairuoguo could you try to install Faiss through conda? and here is the instruction https://github.com/facebookresearch/faiss/blob/main/INSTALL.md . Thanks

@hairuoguo
Copy link
Author

will try this out when I have the time (next week or so), thanks

@Luffy241
Copy link

Luffy241 commented Sep 9, 2024

@hairuoguo I faced the same issue while using fitz but when I used PDFplumber there is no issue.
You can try with PDFplumber it might work, but i need to do it with fitz , is there any way to do it without using conda.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants