Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receiving a JSONDecodeError when running tevatron.driver.encode on WQ dataset #53

Open
xhluca opened this issue Jul 11, 2022 · 6 comments

Comments

@xhluca
Copy link

xhluca commented Jul 11, 2022

I have first used tevatron to train DPR from bert-based-uncased:

python -m torch.distributed.launch --nproc_per_node=1 -m tevatron.driver.train \
  --output_dir model_wq \
  --dataset_name Tevatron/wikipedia-wq \
  --model_name_or_path bert-base-uncased \
  --do_train \
  --save_steps 20000 \
  --fp16 \
  --per_device_train_batch_size 128 \
  --train_n_passages 2 \
  --learning_rate 1e-5 \
  --q_max_len 32 \
  --p_max_len 156 \
  --num_train_epochs 40 \
  --negatives_x_device \
  --overwrite_output_dir

After the model was saved to model_wq/ (see footnote), I continued to follow the instructions to encode the passages:

export ENCODE_DIR="wq_corpus_encoded"

mkdir $ENCODE_DIR
for s in $(seq -f "%02g" 0 19)
do
python -m tevatron.driver.encode \
  --output_dir=temp \
  --model_name_or_path model_wq \
  --fp16 \
  --per_device_eval_batch_size 156 \
  --dataset_name Tevatron/wikipedia-wq-corpus \
  --encoded_save_path corpus_emb.$s.pkl \
  --encode_num_shard 20 \
  --encode_shard_index $s
done

I saved that inside a bash file and ran the bash file, but I multipleJSONDecodeError along the way, which does not seem to be expected (which is why I stopped the process):

$ bash encode_wq_corpus.sh 
mkdir: cannot create directory ‘wq_corpus_encoded’: File exists
07/11/2022 19:29:13 - INFO - tevatron.modeling.encoder -   try loading tied weight
07/11/2022 19:29:13 - INFO - tevatron.modeling.encoder -   loading model weight from model_wq
Downloading and preparing dataset wikipedia-wq-corpus/default to /tmp/.cache/huggingface/datasets/Tevatron___wikipedia-wq-corpus/default/0.0.1/69d8ab11b0c3a7443dd4f41ec73edeb30ffe1f7a0b56fe2a6b316fb77c2ec033...
Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4573.94it/s]
Extracting data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 429.92it/s]
Traceback (most recent call last):                                   
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/tmp/.local/lib/python3.7/site-packages/tevatron/driver/encode.py", line 111, in <module>
    main()
  File "/tmp/.local/lib/python3.7/site-packages/tevatron/driver/encode.py", line 70, in main
    cache_dir=data_args.data_cache_dir or model_args.cache_dir)
  File "/tmp/.local/lib/python3.7/site-packages/tevatron/datasets/dataset.py", line 83, in __init__
    data_files=data_files, cache_dir=cache_dir)[data_args.dataset_split]
  File "/opt/conda/lib/python3.7/site-packages/datasets/load.py", line 1684, in load_dataset
    use_auth_token=use_auth_token,
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 705, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 1221, in _download_and_prepare
    super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 793, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 1210, in _prepare_split
    desc=f"Generating {split_info.name} split",
  File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/tmp/.cache/huggingface/modules/datasets_modules/datasets/Tevatron--wikipedia-wq-corpus/69d8ab11b0c3a7443dd4f41ec73edeb30ffe1f7a0b56fe2a6b316fb77c2ec033/wikipedia-wq-corpus.py", line 82, in _generate_examples
    data = json.loads(line)
  File "/opt/conda/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/opt/conda/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/conda/lib/python3.7/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 30 (char 29)
07/11/2022 19:31:45 - INFO - tevatron.modeling.encoder -   try loading tied weight
07/11/2022 19:31:45 - INFO - tevatron.modeling.encoder -   loading model weight from model_wq
Downloading and preparing dataset wikipedia-wq-corpus/default to /tmp/.cache/huggingface/datasets/Tevatron___wikipedia-wq-corpus/default/0.0.1/69d8ab11b0c3a7443dd4f41ec73edeb30ffe1f7a0b56fe2a6b316fb77c2ec033...
Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 5849.80it/s]
Extracting data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 517.24it/s]
Traceback (most recent call last):                                  ^C
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/tmp/.local/lib/python3.7/site-packages/tevatron/driver/encode.py", line 111, in <module>
    main()
  File "/tmp/.local/lib/python3.7/site-packages/tevatron/driver/encode.py", line 70, in main
    cache_dir=data_args.data_cache_dir or model_args.cache_dir)
  File "/tmp/.local/lib/python3.7/site-packages/tevatron/datasets/dataset.py", line 83, in __init__
    data_files=data_files, cache_dir=cache_dir)[data_args.dataset_split]
  File "/opt/conda/lib/python3.7/site-packages/datasets/load.py", line 1684, in load_dataset
    use_auth_token=use_auth_token,
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 705, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 1221, in _download_and_prepare
    super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 793, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 1212, in _prepare_split
    example = self.info.features.encode_example(record)
  File "/opt/conda/lib/python3.7/site-packages/datasets/features/features.py", line 1579, in encode_example
    return encode_nested_example(self, example)
  File "/opt/conda/lib/python3.7/site-packages/datasets/features/features.py", line 1136, in encode_nested_example
    def encode_nested_example(schema, obj, level=0):
KeyboardInterrupt

Is this normal?

Libraries

This is my requirements file:

git+https://github.com/texttron/tevatron@b8f33900895930f9886012580e85464a5c1f7e9a
torch==1.12.*
faiss-cpu==1.7.2
transformers==4.15.0
datasets==1.17.0
pyserini

Footnote

  • I originally saved it as model_nq but renamed it to model_wq, I don't think this makes a difference but if it does let me know.
  • I also tested with wikipedia-nq and with both the latest version on master and also with the 0.1 version on pypi and I'm getting the same error.
@MXueguang
Copy link
Contributor

Hi @xhluca,
Sorry for the late reply.
Is it just the issue of Tevatron/wikipedia-wq-corpus? Tevatron/wikipedia-nq-corpus also not works?
It seems like a issue caused by the json environment?

    data = json.loads(line)
  File "/opt/conda/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/opt/conda/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/conda/lib/python3.7/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)

let me know if you still having the issue

Xueguang

@xhluca
Copy link
Author

xhluca commented Jul 26, 2022

I'm not sure what json environment means here. I'm using the standard python 3.7 library in a fresh virtualenv

@xhluca
Copy link
Author

xhluca commented Jul 26, 2022

I tried different datasets and the problem seems to be present

@MXueguang
Copy link
Contributor

Could you see if a simple jsonl file can be read in your environment? or could you try conda environment? My environment is python3.8 with conda

@xhluca
Copy link
Author

xhluca commented Aug 1, 2022

Yes, I tried the following example: https://stackoverflow.com/questions/50475635/loading-jsonl-file-as-json-objects

ANd it works fine in my environment

@xhluca
Copy link
Author

xhluca commented Aug 1, 2022

@MXueguang My bad, I was indeed using conda. However, do you think there should be a difference whether I"m using conda or virtualenv since the libraries were installed with pip and there's no conda-specific dependency?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants