Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues indexing colbert example - pyserini.index.lucene(not solved) and tevatron.faiss_retriever(solved) #61

Open
lboesen opened this issue Nov 15, 2022 · 3 comments

Comments

@lboesen
Copy link

lboesen commented Nov 15, 2022

Hi,

I experienced issues when working with the colbert example.
I trained the model as per: https://github.com/texttron/tevatron/tree/main/examples/colbert

I then encoded the corpus and queries:

corpus:
python -m tevatron.driver.encode
--output_dir=temp
--model_name_or_path bert-base-uncased
--fp16
--per_device_eval_batch_size 156
--p_max_len 128
--dataset_name Tevatron/msmarco-passage-corpus
--encoded_save_path /corpus_emb_colbert/
--encode_num_shard 20
--encode_shard_index {s}

queries:
python -m tevatron.driver.encode
--output_dir=temp
--model_name_or_path bert-base-uncased
--fp16
--per_device_eval_batch_size 156
--encode_is_qry
--q_max_len 32
--dataset_name Tevatron/msmarco-passage/dev
--encoded_save_path /queries_emb.tsv"

When trying to index using:

python -m pyserini.index.lucene
--collection JsonVectorCollection
--input /model_runs/corpus_emb_colbert
--index /model_runs/index_colbert
--generator DefaultLuceneDocumentGenerator
--threads 12
--impact --pretokenized --optimize

it failed with the following messeage:

2022-11-15 09:06:18,438 ERROR [pool-2-thread-1] index.IndexCollection$LocalIndexerThread (IndexCollection.java:216) - pool-2-thread-1: Unexpected Exception:
com.fasterxml.jackson.core.JsonParseException: Unexpected character ('�' (code 65533 / 0xfffd)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
at [Source: (BufferedReader); line: 1, column: 2]
at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:2337) ~[anserini-0.15.0-fatjar.jar:?]
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:710) ~[anserini-0.15.0-fatjar.jar:?]
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar(ParserMinimalBase.java:635) ~[anserini-0.15.0-fatjar.jar:?]
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1952) ~[anserini-0.15.0-fatjar.jar:?]
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:781) ~[anserini-0.15.0-fatjar.jar:?]
at com.fasterxml.jackson.databind.ObjectReader.readValues(ObjectReader.java:1874) ~[anserini-0.15.0-fatjar.jar:?]
at io.anserini.collection.JsonCollection$Segment.(JsonCollection.java:107) ~[anserini-0.15.0-fatjar.jar:?]
at io.anserini.collection.JsonVectorCollection$Segment.(JsonVectorCollection.java:39) ~[anserini-0.15.0-fatjar.jar:?]
at io.anserini.collection.JsonVectorCollection.createFileSegment(JsonVectorCollection.java:34) ~[anserini-0.15.0-fatjar.jar:?]
at io.anserini.index.IndexCollection$LocalIndexerThread.run(IndexCollection.java:151) [anserini-0.15.0-fatjar.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
2022-11-15 09:06:18,438 ERROR [pool-2-thread-11] index.IndexCollection$LocalIndexerThread (IndexCollection.java:216) - pool-2-thread-11: Unexpected Exception:
com.fasterxml.jackson.core.JsonParseException: Unexpected character ('�' (code 65533 / 0xfffd)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false')

I then tried using the tevatron.faiss_retriever as described in your guidelines for dense retrieval:

python -m tevatron.faiss_retriever
--query_reps /home/fdt672/model_runs/queries_emb_colbert_{train_split}/queries_emb_train_split_20.tsv
--passage_reps /home/fdt672/model_runs/corpus_emb_colbert_{train_split}/'*.jsonl'
--depth 100
--batch_size -1
--save_text
--save_ranking_to /home/fdt672/model_runs/rank_colbert_{train_split}.txt

But it also faulted with:

Traceback (most recent call last):
File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/fdt672/git/MT_code/Master_Thesis_temp/src/tevatron/src/tevatron/faiss_retriever/main.py", line 91, in
main()
File "/home/fdt672/git/MT_code/Master_Thesis_temp/src/tevatron/src/tevatron/faiss_retriever/main.py", line 74, in main
retriever.add(p_reps)
File "/home/fdt672/git/MT_code/Master_Thesis_temp/src/tevatron/src/tevatron/faiss_retriever/retriever.py", line 16, in add
self.index.add(p_reps)
File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/site-packages/faiss/init.py", line 215, in replacement_add
self.add_c(n, swig_ptr(x))
File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/site-packages/faiss/swigfaiss_avx2.py", line 1618, in add
return _swigfaiss_avx2.IndexFlatCodes_add(self, n, x)
TypeError: in method 'IndexFlatCodes_add', argument 3 of type 'float const *'

Solution:
As I understand the issues was that the value need to be float32 and not float16:
So when I did these changes to the faiss_retriever/retriever.py (in the tevatron library)

_class BaseFaissIPRetriever:
    def __init__(self, init_reps: np.ndarray):
        index = faiss.IndexFlatIP(init_reps.shape[1])
        self.index = index

    def add(self, p_reps: np.ndarray):
        **p_reps_float32 = p_reps.astype(np.float32)** #  <------- issues with float16

        self.index.add(p_reps_float32)
    def search(self, q_reps: np.ndarray, k: int):
        **q_reps_float32 = q_reps.astype(np.float32)** # < ------- issues with float16

        return self.index.search(q_reps_float32, k)
       .....

the tevatron.faiss_retriever worked.

I am not sure if this is a good solution, but it solved my current issues with the colbert example (..?)

I would ideally like to build an index with my colbert model using the pyserini.index.lucene. Do you have any suggestions to this ?

Thanks alot in advance :)

@lboesen
Copy link
Author

lboesen commented Nov 15, 2022

does this have something to do with the --fp16 flag when training the model?

@MXueguang
Copy link
Contributor

Hi @lboesen,
The colbert example here is only for training the model right now. It hasn't been tested for retrieval.
Colbert is a multi-vector retrieval model, so the inference/search is not supported by tevatron yet.
I'd suggest following the original ColBERT repo to train the model and do search https://github.com/stanford-futuredata/ColBERT

@lboesen
Copy link
Author

lboesen commented Nov 16, 2022

Thank you for your quick reply and yes I will have a look at the original colbert repo.

Do you by anychange know if the colbert model's training parameters set in the tevatron - gives effectivness score equal to the original ColBERT where they measure mrr@10 = 36.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants