Issues indexing colbert example - pyserini.index.lucene(not solved) and tevatron.faiss_retriever(solved) #61

lboesen · 2022-11-15T11:43:50Z

Hi,

I experienced issues when working with the colbert example.
I trained the model as per: https://github.com/texttron/tevatron/tree/main/examples/colbert

I then encoded the corpus and queries:

corpus:
python -m tevatron.driver.encode
--output_dir=temp
--model_name_or_path bert-base-uncased
--fp16
--per_device_eval_batch_size 156
--p_max_len 128
--dataset_name Tevatron/msmarco-passage-corpus
--encoded_save_path /corpus_emb_colbert/
--encode_num_shard 20
--encode_shard_index {s}

queries:
python -m tevatron.driver.encode
--output_dir=temp
--model_name_or_path bert-base-uncased
--fp16
--per_device_eval_batch_size 156
--encode_is_qry
--q_max_len 32
--dataset_name Tevatron/msmarco-passage/dev
--encoded_save_path /queries_emb.tsv"

When trying to index using:

python -m pyserini.index.lucene
--collection JsonVectorCollection
--input /model_runs/corpus_emb_colbert
--index /model_runs/index_colbert
--generator DefaultLuceneDocumentGenerator
--threads 12
--impact --pretokenized --optimize

it failed with the following messeage:

2022-11-15 09:06:18,438 ERROR [pool-2-thread-1] index.IndexCollection$LocalIndexerThread (IndexCollection.java:216) - pool-2-thread-1: Unexpected Exception:
com.fasterxml.jackson.core.JsonParseException: Unexpected character ('�' (code 65533 / 0xfffd)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
at [Source: (BufferedReader); line: 1, column: 2]
at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:2337) ~[anserini-0.15.0-fatjar.jar:?]
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:710) ~[anserini-0.15.0-fatjar.jar:?]
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar(ParserMinimalBase.java:635) ~[anserini-0.15.0-fatjar.jar:?]
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1952) ~[anserini-0.15.0-fatjar.jar:?]
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:781) ~[anserini-0.15.0-fatjar.jar:?]
at com.fasterxml.jackson.databind.ObjectReader.readValues(ObjectReader.java:1874) ~[anserini-0.15.0-fatjar.jar:?]
at io.anserini.collection.JsonCollection$Segment.(JsonCollection.java:107) ~[anserini-0.15.0-fatjar.jar:?]
at io.anserini.collection.JsonVectorCollection$Segment.(JsonVectorCollection.java:39) ~[anserini-0.15.0-fatjar.jar:?]
at io.anserini.collection.JsonVectorCollection.createFileSegment(JsonVectorCollection.java:34) ~[anserini-0.15.0-fatjar.jar:?]
at io.anserini.index.IndexCollection$LocalIndexerThread.run(IndexCollection.java:151) [anserini-0.15.0-fatjar.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
2022-11-15 09:06:18,438 ERROR [pool-2-thread-11] index.IndexCollection$LocalIndexerThread (IndexCollection.java:216) - pool-2-thread-11: Unexpected Exception:
com.fasterxml.jackson.core.JsonParseException: Unexpected character ('�' (code 65533 / 0xfffd)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false')

I then tried using the tevatron.faiss_retriever as described in your guidelines for dense retrieval:

python -m tevatron.faiss_retriever
--query_reps /home/fdt672/model_runs/queries_emb_colbert_{train_split}/queries_emb_train_split_20.tsv
--passage_reps /home/fdt672/model_runs/corpus_emb_colbert_{train_split}/'*.jsonl'
--depth 100
--batch_size -1
--save_text
--save_ranking_to /home/fdt672/model_runs/rank_colbert_{train_split}.txt

But it also faulted with:

Traceback (most recent call last):
File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/fdt672/git/MT_code/Master_Thesis_temp/src/tevatron/src/tevatron/faiss_retriever/main.py", line 91, in
main()
File "/home/fdt672/git/MT_code/Master_Thesis_temp/src/tevatron/src/tevatron/faiss_retriever/main.py", line 74, in main
retriever.add(p_reps)
File "/home/fdt672/git/MT_code/Master_Thesis_temp/src/tevatron/src/tevatron/faiss_retriever/retriever.py", line 16, in add
self.index.add(p_reps)
File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/site-packages/faiss/init.py", line 215, in replacement_add
self.add_c(n, swig_ptr(x))
File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/site-packages/faiss/swigfaiss_avx2.py", line 1618, in add
return _swigfaiss_avx2.IndexFlatCodes_add(self, n, x)
TypeError: in method 'IndexFlatCodes_add', argument 3 of type 'float const *'

Solution:
As I understand the issues was that the value need to be float32 and not float16:
So when I did these changes to the faiss_retriever/retriever.py (in the tevatron library)

_class BaseFaissIPRetriever:
    def __init__(self, init_reps: np.ndarray):
        index = faiss.IndexFlatIP(init_reps.shape[1])
        self.index = index

    def add(self, p_reps: np.ndarray):
        **p_reps_float32 = p_reps.astype(np.float32)** #  <------- issues with float16

        self.index.add(p_reps_float32)
    def search(self, q_reps: np.ndarray, k: int):
        **q_reps_float32 = q_reps.astype(np.float32)** # < ------- issues with float16

        return self.index.search(q_reps_float32, k)
       .....

the tevatron.faiss_retriever worked.

I am not sure if this is a good solution, but it solved my current issues with the colbert example (..?)

I would ideally like to build an index with my colbert model using the pyserini.index.lucene. Do you have any suggestions to this ?

Thanks alot in advance :)

The text was updated successfully, but these errors were encountered:

lboesen · 2022-11-15T11:49:09Z

does this have something to do with the --fp16 flag when training the model?

MXueguang · 2022-11-16T16:40:50Z

Hi @lboesen,
The colbert example here is only for training the model right now. It hasn't been tested for retrieval.
Colbert is a multi-vector retrieval model, so the inference/search is not supported by tevatron yet.
I'd suggest following the original ColBERT repo to train the model and do search https://github.com/stanford-futuredata/ColBERT

lboesen · 2022-11-16T21:08:56Z

Thank you for your quick reply and yes I will have a look at the original colbert repo.

Do you by anychange know if the colbert model's training parameters set in the tevatron - gives effectivness score equal to the original ColBERT where they measure mrr@10 = 36.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues indexing colbert example - pyserini.index.lucene(not solved) and tevatron.faiss_retriever(solved) #61

Issues indexing colbert example - pyserini.index.lucene(not solved) and tevatron.faiss_retriever(solved) #61

lboesen commented Nov 15, 2022 •

edited

Loading

lboesen commented Nov 15, 2022

MXueguang commented Nov 16, 2022

lboesen commented Nov 16, 2022

Issues indexing colbert example - pyserini.index.lucene(not solved) and tevatron.faiss_retriever(solved) #61

Issues indexing colbert example - pyserini.index.lucene(not solved) and tevatron.faiss_retriever(solved) #61

Comments

lboesen commented Nov 15, 2022 • edited Loading

lboesen commented Nov 15, 2022

MXueguang commented Nov 16, 2022

lboesen commented Nov 16, 2022

lboesen commented Nov 15, 2022 •

edited

Loading