You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2022-11-15 09:06:18,438 ERROR [pool-2-thread-1] index.IndexCollection$LocalIndexerThread (IndexCollection.java:216) - pool-2-thread-1: Unexpected Exception:
com.fasterxml.jackson.core.JsonParseException: Unexpected character ('�' (code 65533 / 0xfffd)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
at [Source: (BufferedReader); line: 1, column: 2]
at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:2337) ~[anserini-0.15.0-fatjar.jar:?]
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:710) ~[anserini-0.15.0-fatjar.jar:?]
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar(ParserMinimalBase.java:635) ~[anserini-0.15.0-fatjar.jar:?]
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1952) ~[anserini-0.15.0-fatjar.jar:?]
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:781) ~[anserini-0.15.0-fatjar.jar:?]
at com.fasterxml.jackson.databind.ObjectReader.readValues(ObjectReader.java:1874) ~[anserini-0.15.0-fatjar.jar:?]
at io.anserini.collection.JsonCollection$Segment.(JsonCollection.java:107) ~[anserini-0.15.0-fatjar.jar:?]
at io.anserini.collection.JsonVectorCollection$Segment.(JsonVectorCollection.java:39) ~[anserini-0.15.0-fatjar.jar:?]
at io.anserini.collection.JsonVectorCollection.createFileSegment(JsonVectorCollection.java:34) ~[anserini-0.15.0-fatjar.jar:?]
at io.anserini.index.IndexCollection$LocalIndexerThread.run(IndexCollection.java:151) [anserini-0.15.0-fatjar.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
2022-11-15 09:06:18,438 ERROR [pool-2-thread-11] index.IndexCollection$LocalIndexerThread (IndexCollection.java:216) - pool-2-thread-11: Unexpected Exception:
com.fasterxml.jackson.core.JsonParseException: Unexpected character ('�' (code 65533 / 0xfffd)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
I then tried using the tevatron.faiss_retriever as described in your guidelines for dense retrieval:
Traceback (most recent call last):
File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/fdt672/git/MT_code/Master_Thesis_temp/src/tevatron/src/tevatron/faiss_retriever/main.py", line 91, in
main()
File "/home/fdt672/git/MT_code/Master_Thesis_temp/src/tevatron/src/tevatron/faiss_retriever/main.py", line 74, in main
retriever.add(p_reps)
File "/home/fdt672/git/MT_code/Master_Thesis_temp/src/tevatron/src/tevatron/faiss_retriever/retriever.py", line 16, in add
self.index.add(p_reps)
File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/site-packages/faiss/init.py", line 215, in replacement_add
self.add_c(n, swig_ptr(x))
File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/site-packages/faiss/swigfaiss_avx2.py", line 1618, in add
return _swigfaiss_avx2.IndexFlatCodes_add(self, n, x)
TypeError: in method 'IndexFlatCodes_add', argument 3 of type 'float const *'
Solution:
As I understand the issues was that the value need to be float32 and not float16:
So when I did these changes to the faiss_retriever/retriever.py (in the tevatron library)
Hi @lboesen,
The colbert example here is only for training the model right now. It hasn't been tested for retrieval.
Colbert is a multi-vector retrieval model, so the inference/search is not supported by tevatron yet.
I'd suggest following the original ColBERT repo to train the model and do search https://github.com/stanford-futuredata/ColBERT
Thank you for your quick reply and yes I will have a look at the original colbert repo.
Do you by anychange know if the colbert model's training parameters set in the tevatron - gives effectivness score equal to the original ColBERT where they measure mrr@10 = 36.0
Hi,
I experienced issues when working with the colbert example.
I trained the model as per: https://github.com/texttron/tevatron/tree/main/examples/colbert
I then encoded the corpus and queries:
corpus:
python -m tevatron.driver.encode
--output_dir=temp
--model_name_or_path bert-base-uncased
--fp16
--per_device_eval_batch_size 156
--p_max_len 128
--dataset_name Tevatron/msmarco-passage-corpus
--encoded_save_path /corpus_emb_colbert/
--encode_num_shard 20
--encode_shard_index {s}
queries:
python -m tevatron.driver.encode
--output_dir=temp
--model_name_or_path bert-base-uncased
--fp16
--per_device_eval_batch_size 156
--encode_is_qry
--q_max_len 32
--dataset_name Tevatron/msmarco-passage/dev
--encoded_save_path /queries_emb.tsv"
When trying to index using:
python -m pyserini.index.lucene
--collection JsonVectorCollection
--input /model_runs/corpus_emb_colbert
--index /model_runs/index_colbert
--generator DefaultLuceneDocumentGenerator
--threads 12
--impact --pretokenized --optimize
it failed with the following messeage:
2022-11-15 09:06:18,438 ERROR [pool-2-thread-1] index.IndexCollection$LocalIndexerThread (IndexCollection.java:216) - pool-2-thread-1: Unexpected Exception:
com.fasterxml.jackson.core.JsonParseException: Unexpected character ('�' (code 65533 / 0xfffd)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
at [Source: (BufferedReader); line: 1, column: 2]
at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:2337) ~[anserini-0.15.0-fatjar.jar:?]
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:710) ~[anserini-0.15.0-fatjar.jar:?]
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar(ParserMinimalBase.java:635) ~[anserini-0.15.0-fatjar.jar:?]
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1952) ~[anserini-0.15.0-fatjar.jar:?]
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:781) ~[anserini-0.15.0-fatjar.jar:?]
at com.fasterxml.jackson.databind.ObjectReader.readValues(ObjectReader.java:1874) ~[anserini-0.15.0-fatjar.jar:?]
at io.anserini.collection.JsonCollection$Segment.(JsonCollection.java:107) ~[anserini-0.15.0-fatjar.jar:?]
at io.anserini.collection.JsonVectorCollection$Segment.(JsonVectorCollection.java:39) ~[anserini-0.15.0-fatjar.jar:?]
at io.anserini.collection.JsonVectorCollection.createFileSegment(JsonVectorCollection.java:34) ~[anserini-0.15.0-fatjar.jar:?]
at io.anserini.index.IndexCollection$LocalIndexerThread.run(IndexCollection.java:151) [anserini-0.15.0-fatjar.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
2022-11-15 09:06:18,438 ERROR [pool-2-thread-11] index.IndexCollection$LocalIndexerThread (IndexCollection.java:216) - pool-2-thread-11: Unexpected Exception:
com.fasterxml.jackson.core.JsonParseException: Unexpected character ('�' (code 65533 / 0xfffd)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
I then tried using the tevatron.faiss_retriever as described in your guidelines for dense retrieval:
python -m tevatron.faiss_retriever
--query_reps /home/fdt672/model_runs/queries_emb_colbert_{train_split}/queries_emb_train_split_20.tsv
--passage_reps /home/fdt672/model_runs/corpus_emb_colbert_{train_split}/'*.jsonl'
--depth 100
--batch_size -1
--save_text
--save_ranking_to /home/fdt672/model_runs/rank_colbert_{train_split}.txt
But it also faulted with:
Traceback (most recent call last):
File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/fdt672/git/MT_code/Master_Thesis_temp/src/tevatron/src/tevatron/faiss_retriever/main.py", line 91, in
main()
File "/home/fdt672/git/MT_code/Master_Thesis_temp/src/tevatron/src/tevatron/faiss_retriever/main.py", line 74, in main
retriever.add(p_reps)
File "/home/fdt672/git/MT_code/Master_Thesis_temp/src/tevatron/src/tevatron/faiss_retriever/retriever.py", line 16, in add
self.index.add(p_reps)
File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/site-packages/faiss/init.py", line 215, in replacement_add
self.add_c(n, swig_ptr(x))
File "/home/fdt672/anaconda3/envs/myenv/lib/python3.9/site-packages/faiss/swigfaiss_avx2.py", line 1618, in add
return _swigfaiss_avx2.IndexFlatCodes_add(self, n, x)
TypeError: in method 'IndexFlatCodes_add', argument 3 of type 'float const *'
Solution:
As I understand the issues was that the value need to be float32 and not float16:
So when I did these changes to the faiss_retriever/retriever.py (in the tevatron library)
the tevatron.faiss_retriever worked.
I am not sure if this is a good solution, but it solved my current issues with the colbert example (..?)
I would ideally like to build an index with my colbert model using the pyserini.index.lucene. Do you have any suggestions to this ?
Thanks alot in advance :)
The text was updated successfully, but these errors were encountered: