Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEIT-3] error happens when I evaluate BEiT-3 finetuned model on VQAv2 #1597

Closed
matsutaku44 opened this issue Jul 6, 2024 · 4 comments
Closed

Comments

@matsutaku44
Copy link

matsutaku44 commented Jul 6, 2024

Describe
Model I am using (UniLM, MiniLM, LayoutLM ...): BEIT-3

I want to evaluate BEiT-3 finetuned model on VQAv2.
https://github.com/microsoft/unilm/blob/master/beit3/get_started/get_started_for_vqav2.md#example-evaluate-beit-3-finetuned-model-on-vqav2-visual-question-answering

However, error happens. I cannot understand what this error message means. How do I solve this problem? Please help me.
Thank you for sharing codes of BEIT-3.

(beit3) matsuzaki.takumi@docker:~/workspace/vqa/unilm/beit3$ python -m torch.distributed.launch --nproc_per_node=2 run_beit3_finetuning.py \
>         --model beit3_base_patch16_480 \
>         --input_size 480 \
>         --task vqav2 \
>         --batch_size 4 \
>         --sentencepiece_model /mnt/new_mensa/data/VQAv2/BEIT3/beit3.spm \
>         --finetune /mnt/new_mensa/data/VQAv2/BEIT3/beit3_base_indomain_patch16_224.pth \
>         --data_path /mnt/new_mensa/data/VQAv2 \
>         --output_dir ./prediction_saveHere \
>         --eval \
>         --dist_eval
/home/matsuzaki.takumi/.conda/envs/beit3/lib/python3.8/site-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  warnings.warn(
W0706 05:15:39.667715 140419543589120 torch/distributed/run.py:757]
W0706 05:15:39.667715 140419543589120 torch/distributed/run.py:757] *****************************************
W0706 05:15:39.667715 140419543589120 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0706 05:15:39.667715 140419543589120 torch/distributed/run.py:757] *****************************************
usage: BEiT fine-tuning and evaluation script for image classification
       [--model MODEL] --task
       {nlvr2,vqav2,flickr30k,coco_retrieval,coco_captioning,nocaps,imagenet}
       [--input_size INPUT_SIZE] [--drop_path PCT]
       [--checkpoint_activations] --sentencepiece_model
       SENTENCEPIECE_MODEL [--vocab_size VOCAB_SIZE]
       [--num_max_bpe_tokens NUM_MAX_BPE_TOKENS] [--model_ema]
       [--model_ema_decay MODEL_EMA_DECAY] [--model_ema_force_cpu]
       [--opt OPTIMIZER] [--opt_eps EPSILON] [--opt_betas BETA [BETA ...]]
       [--clip_grad NORM] [--momentum M] [--weight_decay WEIGHT_DECAY]
       [--lr LR] [--layer_decay LAYER_DECAY]
       [--task_head_lr_weight TASK_HEAD_LR_WEIGHT] [--warmup_lr LR]
       [--min_lr LR] [--warmup_epochs N] [--warmup_steps N]
       [--batch_size BATCH_SIZE] [--eval_batch_size EVAL_BATCH_SIZE]
       [--epochs EPOCHS] [--update_freq UPDATE_FREQ]
       [--save_ckpt_freq SAVE_CKPT_FREQ] [--randaug]
       [--train_interpolation TRAIN_INTERPOLATION] [--finetune FINETUNE]
       [--model_key MODEL_KEY] [--model_prefix MODEL_PREFIX]
       [--data_path DATA_PATH] [--output_dir OUTPUT_DIR]
       [--log_dir LOG_DIR] [--device DEVICE] [--seed SEED]
       [--resume RESUME] [--auto_resume] [--no_auto_resume] [--save_ckpt]
       [--no_save_ckpt] [--start_epoch N] [--eval] [--dist_eval]
       [--num_workers NUM_WORKERS] [--pin_mem] [--no_pin_mem]
       [--world_size WORLD_SIZE] [--local_rank LOCAL_RANK] [--dist_on_itp]
       [--dist_url DIST_URL] [--task_cache_path TASK_CACHE_PATH]
       [--nb_classes NB_CLASSES] [--mixup MIXUP] [--cutmix CUTMIX]
       [--cutmix_minmax CUTMIX_MINMAX [CUTMIX_MINMAX ...]]
       [--mixup_prob MIXUP_PROB] [--mixup_switch_prob MIXUP_SWITCH_PROB]
       [--mixup_mode MIXUP_MODE] [--color_jitter PCT] [--aa NAME]
       [--smoothing SMOOTHING] [--crop_pct CROP_PCT] [--reprob PCT]
       [--remode REMODE] [--recount RECOUNT] [--resplit]
       [--captioning_mask_prob CAPTIONING_MASK_PROB]
       [--drop_worst_ratio DROP_WORST_RATIO]
       [--drop_worst_after DROP_WORST_AFTER] [--num_beams NUM_BEAMS]
       [--length_penalty LENGTH_PENALTY]
       [--label_smoothing LABEL_SMOOTHING] [--enable_deepspeed]
       [--initial_scale_power INITIAL_SCALE_POWER]
       [--zero_stage ZERO_STAGE]
BEiT fine-tuning and evaluation script for image classification: error: unrecognized arguments: --local-rank=0
usage: BEiT fine-tuning and evaluation script for image classification
       [--model MODEL] --task
       {nlvr2,vqav2,flickr30k,coco_retrieval,coco_captioning,nocaps,imagenet}
       [--input_size INPUT_SIZE] [--drop_path PCT]
       [--checkpoint_activations] --sentencepiece_model
       SENTENCEPIECE_MODEL [--vocab_size VOCAB_SIZE]
       [--num_max_bpe_tokens NUM_MAX_BPE_TOKENS] [--model_ema]
       [--model_ema_decay MODEL_EMA_DECAY] [--model_ema_force_cpu]
       [--opt OPTIMIZER] [--opt_eps EPSILON] [--opt_betas BETA [BETA ...]]
       [--clip_grad NORM] [--momentum M] [--weight_decay WEIGHT_DECAY]
       [--lr LR] [--layer_decay LAYER_DECAY]
       [--task_head_lr_weight TASK_HEAD_LR_WEIGHT] [--warmup_lr LR]
       [--min_lr LR] [--warmup_epochs N] [--warmup_steps N]
       [--batch_size BATCH_SIZE] [--eval_batch_size EVAL_BATCH_SIZE]
       [--epochs EPOCHS] [--update_freq UPDATE_FREQ]
       [--save_ckpt_freq SAVE_CKPT_FREQ] [--randaug]
       [--train_interpolation TRAIN_INTERPOLATION] [--finetune FINETUNE]
       [--model_key MODEL_KEY] [--model_prefix MODEL_PREFIX]
       [--data_path DATA_PATH] [--output_dir OUTPUT_DIR]
       [--log_dir LOG_DIR] [--device DEVICE] [--seed SEED]
       [--resume RESUME] [--auto_resume] [--no_auto_resume] [--save_ckpt]
       [--no_save_ckpt] [--start_epoch N] [--eval] [--dist_eval]
       [--num_workers NUM_WORKERS] [--pin_mem] [--no_pin_mem]
       [--world_size WORLD_SIZE] [--local_rank LOCAL_RANK] [--dist_on_itp]
       [--dist_url DIST_URL] [--task_cache_path TASK_CACHE_PATH]
       [--nb_classes NB_CLASSES] [--mixup MIXUP] [--cutmix CUTMIX]
       [--cutmix_minmax CUTMIX_MINMAX [CUTMIX_MINMAX ...]]
       [--mixup_prob MIXUP_PROB] [--mixup_switch_prob MIXUP_SWITCH_PROB]
       [--mixup_mode MIXUP_MODE] [--color_jitter PCT] [--aa NAME]
       [--smoothing SMOOTHING] [--crop_pct CROP_PCT] [--reprob PCT]
       [--remode REMODE] [--recount RECOUNT] [--resplit]
       [--captioning_mask_prob CAPTIONING_MASK_PROB]
       [--drop_worst_ratio DROP_WORST_RATIO]
       [--drop_worst_after DROP_WORST_AFTER] [--num_beams NUM_BEAMS]
       [--length_penalty LENGTH_PENALTY]
       [--label_smoothing LABEL_SMOOTHING] [--enable_deepspeed]
       [--initial_scale_power INITIAL_SCALE_POWER]
       [--zero_stage ZERO_STAGE]
BEiT fine-tuning and evaluation script for image classification: error: unrecognized arguments: --local-rank=1
E0706 05:15:44.682819 140419543589120 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 2) local_rank: 0 (pid: 76) of binary: /home/matsuzaki.takumi/.conda/envs/beit3/bin/python
Traceback (most recent call last):
  File "/home/matsuzaki.takumi/.conda/envs/beit3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/matsuzaki.takumi/.conda/envs/beit3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/matsuzaki.takumi/.conda/envs/beit3/lib/python3.8/site-packages/torch/distributed/launch.py", line 198, in <module>
    main()
  File "/home/matsuzaki.takumi/.conda/envs/beit3/lib/python3.8/site-packages/torch/distributed/launch.py", line 194, in main
    launch(args)
  File "/home/matsuzaki.takumi/.conda/envs/beit3/lib/python3.8/site-packages/torch/distributed/launch.py", line 179, in launch
    run(args)
  File "/home/matsuzaki.takumi/.conda/envs/beit3/lib/python3.8/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/matsuzaki.takumi/.conda/envs/beit3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/matsuzaki.takumi/.conda/envs/beit3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_beit3_finetuning.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-07-06_05:15:44
  host      : docker
  rank      : 1 (local_rank: 1)
  exitcode  : 2 (pid: 77)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-06_05:15:44
  host      : docker
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 76)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
@Sv3n01
Copy link

Sv3n01 commented Jul 6, 2024

Changing
"python -m torch.distributed.launch --nproc_per_node=2 run_beit3_finetuning.py" to
"python -m run_beit3_finetuning"
solved it for me in google colab.

@matsutaku44
Copy link
Author

@Sv3n01 Thank you for replying!
I am trying this change now.

@matsutaku44
Copy link
Author

matsutaku44 commented Jul 7, 2024

I removed "torch.distributed.launch --nproc_per_node=2" and run again.
Then, Evaluation seemed to be started. Thank you very much!
However, a different error happens.

I run this code.

python run_beit3_finetuning.py \
        --model beit3_base_patch16_480 \
        --input_size 480 \
        --task vqav2 \
        --batch_size 16 \
        --sentencepiece_model ../../../../new_mensa/data/VQAv2/BEIT3/beit3.spm \
        --finetune ../../../../new_mensa/data/VQAv2/BEIT3/beit3_base_indomain_patch16_224.pth \
        --data_path ../../../../new_mensa/data/VQAv2 \
        --output_dir ./prediction_saveHere \
        --eval \
        --dist_eval

The error

. . .

Test:  [18640/18659]  eta: 0:00:05    time: 0.2775  data: 0.0002  max mem: 3774
Test:  [18650/18659]  eta: 0:00:02    time: 0.2774  data: 0.0002  max mem: 3774
Test:  [18658/18659]  eta: 0:00:00    time: 0.2658  data: 0.0000  max mem: 3774
Test: Total time: 1:26:48 (0.2792 s / it)
Traceback (most recent call last):
  File "run_beit3_finetuning.py", line 448, in <module>
    main(opts, ds_init)
  File "run_beit3_finetuning.py", line 365, in main
    utils.dump_predictions(args, result, "vqav2_test")
  File "/home/matsuzaki.takumi/workspace/vqa/unilm/beit3/utils.py", line 845, in dump_predictions
    torch.distributed.barrier()
  File "/home/matsuzaki.takumi/.conda/envs/beit3-3.8/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
    return func(*args, **kwargs)
  File "/home/matsuzaki.takumi/.conda/envs/beit3-3.8/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3672, in barrier
    opts.device = _get_pg_default_device(group)
  File "/home/matsuzaki.takumi/.conda/envs/beit3-3.8/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 649, in _get_pg_default_device
    group = group or _get_default_group()
  File "/home/matsuzaki.takumi/.conda/envs/beit3-3.8/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1008, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.

I am trying to solve this problem now.
Do you have a solution? Please teach me.

@matsutaku44
Copy link
Author

matsutaku44 commented Jul 7, 2024

I can get submit_vqav2_test.json (the list of pairs of question_id and answer).

I write this in run_beit3_finetuning.py (line 141)
parser.add_argument("--local-rank", type=int)

Then, I run this code. (Maybe you should not omit "-m torch.distributed.launch --nproc_per_node=2")

python -m torch.distributed.launch --nproc_per_node=2 run_beit3_finetuning.py \
        --model beit3_base_patch16_480 \
        --input_size 480 \
        --task vqav2 \
        --batch_size 16 \
        --sentencepiece_model ../../../../new_mensa/data/VQAv2/BEIT3/beit3.spm \
        --finetune ../../../../new_mensa/data/VQAv2/BEIT3/beit3_base_indomain_patch16_224.pth \
        --data_path ../../../../new_mensa/data/VQAv2 \
        --output_dir ./prediction_saveHere \
        --eval \
        --dist_eval

Then, I can get submit_vqav2_test.json

. . . 

Test:  [9310/9330]  eta: 0:00:05    time: 0.2790  data: 0.0002  max mem: 4665
Test:  [9320/9330]  eta: 0:00:02    time: 0.2789  data: 0.0002  max mem: 4665
Test:  [9329/9330]  eta: 0:00:00    time: 0.2674  data: 0.0001  max mem: 4665
Test: Total time: 0:43:23 (0.2790 s / it)
Infer 447793 examples into ./prediction_saveHere/submit_vqav2_test.json

I don't know why I can get the json file. But, I close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants