Receiving a `JSONDecodeError` when running `tevatron.driver.encode` on WQ dataset #53

xhluca · 2022-07-11T19:38:31Z

I have first used tevatron to train DPR from bert-based-uncased:

python -m torch.distributed.launch --nproc_per_node=1 -m tevatron.driver.train \
  --output_dir model_wq \
  --dataset_name Tevatron/wikipedia-wq \
  --model_name_or_path bert-base-uncased \
  --do_train \
  --save_steps 20000 \
  --fp16 \
  --per_device_train_batch_size 128 \
  --train_n_passages 2 \
  --learning_rate 1e-5 \
  --q_max_len 32 \
  --p_max_len 156 \
  --num_train_epochs 40 \
  --negatives_x_device \
  --overwrite_output_dir

After the model was saved to model_wq/ (see footnote), I continued to follow the instructions to encode the passages:

export ENCODE_DIR="wq_corpus_encoded"

mkdir $ENCODE_DIR
for s in $(seq -f "%02g" 0 19)
do
python -m tevatron.driver.encode \
  --output_dir=temp \
  --model_name_or_path model_wq \
  --fp16 \
  --per_device_eval_batch_size 156 \
  --dataset_name Tevatron/wikipedia-wq-corpus \
  --encoded_save_path corpus_emb.$s.pkl \
  --encode_num_shard 20 \
  --encode_shard_index $s
done

I saved that inside a bash file and ran the bash file, but I multipleJSONDecodeError along the way, which does not seem to be expected (which is why I stopped the process):

$ bash encode_wq_corpus.sh 
mkdir: cannot create directory ‘wq_corpus_encoded’: File exists
07/11/2022 19:29:13 - INFO - tevatron.modeling.encoder -   try loading tied weight
07/11/2022 19:29:13 - INFO - tevatron.modeling.encoder -   loading model weight from model_wq
Downloading and preparing dataset wikipedia-wq-corpus/default to /tmp/.cache/huggingface/datasets/Tevatron___wikipedia-wq-corpus/default/0.0.1/69d8ab11b0c3a7443dd4f41ec73edeb30ffe1f7a0b56fe2a6b316fb77c2ec033...
Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4573.94it/s]
Extracting data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 429.92it/s]
Traceback (most recent call last):                                   
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/tmp/.local/lib/python3.7/site-packages/tevatron/driver/encode.py", line 111, in <module>
    main()
  File "/tmp/.local/lib/python3.7/site-packages/tevatron/driver/encode.py", line 70, in main
    cache_dir=data_args.data_cache_dir or model_args.cache_dir)
  File "/tmp/.local/lib/python3.7/site-packages/tevatron/datasets/dataset.py", line 83, in __init__
    data_files=data_files, cache_dir=cache_dir)[data_args.dataset_split]
  File "/opt/conda/lib/python3.7/site-packages/datasets/load.py", line 1684, in load_dataset
    use_auth_token=use_auth_token,
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 705, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 1221, in _download_and_prepare
    super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 793, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 1210, in _prepare_split
    desc=f"Generating {split_info.name} split",
  File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/tmp/.cache/huggingface/modules/datasets_modules/datasets/Tevatron--wikipedia-wq-corpus/69d8ab11b0c3a7443dd4f41ec73edeb30ffe1f7a0b56fe2a6b316fb77c2ec033/wikipedia-wq-corpus.py", line 82, in _generate_examples
    data = json.loads(line)
  File "/opt/conda/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/opt/conda/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/conda/lib/python3.7/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 30 (char 29)
07/11/2022 19:31:45 - INFO - tevatron.modeling.encoder -   try loading tied weight
07/11/2022 19:31:45 - INFO - tevatron.modeling.encoder -   loading model weight from model_wq
Downloading and preparing dataset wikipedia-wq-corpus/default to /tmp/.cache/huggingface/datasets/Tevatron___wikipedia-wq-corpus/default/0.0.1/69d8ab11b0c3a7443dd4f41ec73edeb30ffe1f7a0b56fe2a6b316fb77c2ec033...
Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 5849.80it/s]
Extracting data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 517.24it/s]
Traceback (most recent call last):                                  ^C
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/tmp/.local/lib/python3.7/site-packages/tevatron/driver/encode.py", line 111, in <module>
    main()
  File "/tmp/.local/lib/python3.7/site-packages/tevatron/driver/encode.py", line 70, in main
    cache_dir=data_args.data_cache_dir or model_args.cache_dir)
  File "/tmp/.local/lib/python3.7/site-packages/tevatron/datasets/dataset.py", line 83, in __init__
    data_files=data_files, cache_dir=cache_dir)[data_args.dataset_split]
  File "/opt/conda/lib/python3.7/site-packages/datasets/load.py", line 1684, in load_dataset
    use_auth_token=use_auth_token,
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 705, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 1221, in _download_and_prepare
    super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 793, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 1212, in _prepare_split
    example = self.info.features.encode_example(record)
  File "/opt/conda/lib/python3.7/site-packages/datasets/features/features.py", line 1579, in encode_example
    return encode_nested_example(self, example)
  File "/opt/conda/lib/python3.7/site-packages/datasets/features/features.py", line 1136, in encode_nested_example
    def encode_nested_example(schema, obj, level=0):
KeyboardInterrupt

Is this normal?

Libraries

This is my requirements file:

git+https://github.com/texttron/tevatron@b8f33900895930f9886012580e85464a5c1f7e9a
torch==1.12.*
faiss-cpu==1.7.2
transformers==4.15.0
datasets==1.17.0
pyserini

Footnote

I originally saved it as model_nq but renamed it to model_wq, I don't think this makes a difference but if it does let me know.
I also tested with wikipedia-nq and with both the latest version on master and also with the 0.1 version on pypi and I'm getting the same error.

The text was updated successfully, but these errors were encountered:

MXueguang · 2022-07-26T19:25:24Z

Hi @xhluca,
Sorry for the late reply.
Is it just the issue of Tevatron/wikipedia-wq-corpus? Tevatron/wikipedia-nq-corpus also not works?
It seems like a issue caused by the json environment?

    data = json.loads(line)
  File "/opt/conda/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/opt/conda/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/conda/lib/python3.7/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)

let me know if you still having the issue

Xueguang

xhluca · 2022-07-26T20:24:30Z

I'm not sure what json environment means here. I'm using the standard python 3.7 library in a fresh virtualenv

xhluca · 2022-07-26T20:24:47Z

I tried different datasets and the problem seems to be present

MXueguang · 2022-07-29T04:15:51Z

Could you see if a simple jsonl file can be read in your environment? or could you try conda environment? My environment is python3.8 with conda

xhluca · 2022-08-01T20:58:21Z

Yes, I tried the following example: https://stackoverflow.com/questions/50475635/loading-jsonl-file-as-json-objects

ANd it works fine in my environment

xhluca · 2022-08-01T21:02:54Z

@MXueguang My bad, I was indeed using conda. However, do you think there should be a difference whether I"m using conda or virtualenv since the libraries were installed with pip and there's no conda-specific dependency?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Receiving a `JSONDecodeError` when running `tevatron.driver.encode` on WQ dataset #53

Receiving a `JSONDecodeError` when running `tevatron.driver.encode` on WQ dataset #53

xhluca commented Jul 11, 2022 •

edited

Loading

MXueguang commented Jul 26, 2022

xhluca commented Jul 26, 2022

xhluca commented Jul 26, 2022

MXueguang commented Jul 29, 2022

xhluca commented Aug 1, 2022 •

edited

Loading

xhluca commented Aug 1, 2022

Receiving a JSONDecodeError when running tevatron.driver.encode on WQ dataset #53

Receiving a JSONDecodeError when running tevatron.driver.encode on WQ dataset #53

Comments

xhluca commented Jul 11, 2022 • edited Loading

Libraries

Footnote

MXueguang commented Jul 26, 2022

xhluca commented Jul 26, 2022

xhluca commented Jul 26, 2022

MXueguang commented Jul 29, 2022

xhluca commented Aug 1, 2022 • edited Loading

xhluca commented Aug 1, 2022

Receiving a `JSONDecodeError` when running `tevatron.driver.encode` on WQ dataset #53

Receiving a `JSONDecodeError` when running `tevatron.driver.encode` on WQ dataset #53

xhluca commented Jul 11, 2022 •

edited

Loading

xhluca commented Aug 1, 2022 •

edited

Loading