Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem when finetuning DocOwl1.5-Omni #82

Open
lmydian1014 opened this issue Jun 6, 2024 · 5 comments
Open

problem when finetuning DocOwl1.5-Omni #82

lmydian1014 opened this issue Jun 6, 2024 · 5 comments

Comments

@lmydian1014
Copy link

I have the following error when finetune the DocOwl1.5-Omni. It always raises error when index is 10. Please help!!!

File "/opt/conda/envs/mplug_owl2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1735, in forward
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/envs/mplug_owl2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/peft/peft_model.py", line 922, in forward
    return self.base_model(
  File "/opt/conda/envs/mplug_owl2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/mplug_docowl/model/modeling_mplug_docowl.py", line 247, in forward
    self.prepare_inputs_labels_for_multimodal(input_ids, attention_mask, past_key_values, labels, images, patch_positions)
  File "/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/mplug_docowl/model/modeling_mplug_docowl.py", line 115, in prepare_inputs_labels_for_multimodal
    cur_image_features = image_features[cur_image_idx]
IndexError: index 10 is out of bounds for dimension 0 with size 10
  9%|████████▋                                                                                            | 9/105 [01:00<10:45,  6.72s/it]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1144756 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1144755) of binary: /opt/conda/envs/mplug_owl2/bin/python3.10
Traceback (most recent call last):
  File "/opt/conda/envs/mplug_owl2/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/mplug_owl2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/mplug_owl2/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/opt/conda/envs/mplug_owl2/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/opt/conda/envs/mplug_owl2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/mplug_owl2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
mplug_docowl/train/train_docowl.py FAILED

Below is my script.

#!/bin/bash
if [ $MASTER_ADDR ];then
    echo $MASTER_ADDR
    echo $MASTER_PORT
    echo $WORLD_SIZE
    echo $RANK
else
    MASTER_ADDR=127.0.0.1
    MASTER_PORT=2$(($RANDOM % 10))$(($RANDOM % 10))15
    WORLD_SIZE=1
    RANK=0
fi
# Change for multinode config
NNODES=${WORLD_SIZE}
NODE_RANK=${RANK}
#GPUS_PER_NODE=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
GPUS_PER_NODE=2
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
echo $DISTRIBUTED_ARGS

# change LOAD to your local path of DocOwl1.5-stage1
#LOAD='./mPLUG/DocOwl1.5-stage1'
LOAD='mPLUG/DocOwl1.5-Omni'


# batch size = per_device_train_batch_size x GPUS_PER_NODE x NNODES x gradient_accumulation_steps
#DATA_FILE=./DocDownstream-1.0/train.jsonl
DATA_FILE=/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/training_data.jsonl
IMAGE_FOLDER=/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/

torchrun $DISTRIBUTED_ARGS mplug_docowl/train/train_docowl.py \
    --lora_enable True --lora_r 128 --lora_alpha 256 --vision2text_lr 2e-5 \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path $LOAD \
    --version v1 \
    --data_path $DATA_FILE \
    --image_folder $IMAGE_FOLDER \
    --image_size 448 \
    --crop_anchors 'grid_9' \
    --add_global_img True \
    --add_textual_crop_indicator True \
    --bf16 True \
    --output_dir ./checkpoints/docowl1.5-lora \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 5 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 4 \
    --learning_rate 1e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 3600 \
    --gradient_checkpointing True \
    --tune_vision2text True \
    --freeze_vision_model True \
    --freeze_backbone True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to tensorboard

I have tried to print the batch idx, it's weird that it always equal to 0.
screenshot is provided below
image_123650291

@HAWLYQ
Copy link
Collaborator

HAWLYQ commented Jun 6, 2024

Hi, @lmydian1014 this error is similar with #65. This may be because the number of image placeholders in the text input is not equal to the number of image features. There may be multiple input images and only 1 <|image|> in your query.

@lmydian1014
Copy link
Author

Thanks!
Just to double check, if I have 5 images and each image has 3 corresponding questions. Previously the ./images folder contains 5 images and I prepared the .json file as follows

{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/81ouNxgyqBL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the added_sugar_content of this product?"}, {"role": "assistant", "content": "9g"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/81ouNxgyqBL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the added_sugar_daily_value this product?"}, {"role": "assistant", "content": "18%"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/81ouNxgyqBL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the calcium_content of this product?"}, {"role": "assistant", "content": "17mg"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/51wyfiAQhvL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the added_sugar_daily_value this product?"}, {"role": "assistant", "content": "nan"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/51wyfiAQhvL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the calcium_content of this product?"}, {"role": "assistant", "content": "nan"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/51wyfiAQhvL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the calcium_daily_value of this product?"}, {"role": "assistant", "content": "nan"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/71NXqwfW9cL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the calcium_daily_value of this product?"}, {"role": "assistant", "content": "0%"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/71NXqwfW9cL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the carbohydrate_content of this product?"}, {"role": "assistant", "content": "17g"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/71NXqwfW9cL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the carbohydrate_daily_value of this product?"}, {"role": "assistant", "content": "6%"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/81TZLOCqbjL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the vitamin_C_daily_value of this product?"}, {"role": "assistant", "content": "0%"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/81TZLOCqbjL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the vitamin_D_Content of this product?"}, {"role": "assistant", "content": "nan"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/81TZLOCqbjL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the vitamin_D_daily_value of this product?"}, {"role": "assistant", "content": "nan"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/51J22uvz4jL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the vitamin_C_daily_value of this product?"}, {"role": "assistant", "content": "nan"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/51J22uvz4jL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the vitamin_D_Content of this product?"}, {"role": "assistant", "content": "0mcg"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/51J22uvz4jL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the vitamin_D_daily_value of this product?"}, {"role": "assistant", "content": "0%"}], "task_name": "qa_sft", "dataset_name": "nutrition"}

What you mean is that I need to have 15 images in ./images folder and also need to change image name in ./images folder and in .json file as "81ouNxgyqBL_1.jpg", ''81ouNxgyqBL_2.jpg'', ''81ouNxgyqBL_3.jpg''?

@HAWLYQ
Copy link
Collaborator

HAWLYQ commented Jun 8, 2024

Thanks! Just to double check, if I have 5 images and each image has 3 corresponding questions. Previously the ./images folder contains 5 images and I prepared the .json file as follows

{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/81ouNxgyqBL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the added_sugar_content of this product?"}, {"role": "assistant", "content": "9g"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/81ouNxgyqBL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the added_sugar_daily_value this product?"}, {"role": "assistant", "content": "18%"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/81ouNxgyqBL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the calcium_content of this product?"}, {"role": "assistant", "content": "17mg"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/51wyfiAQhvL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the added_sugar_daily_value this product?"}, {"role": "assistant", "content": "nan"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/51wyfiAQhvL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the calcium_content of this product?"}, {"role": "assistant", "content": "nan"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/51wyfiAQhvL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the calcium_daily_value of this product?"}, {"role": "assistant", "content": "nan"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/71NXqwfW9cL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the calcium_daily_value of this product?"}, {"role": "assistant", "content": "0%"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/71NXqwfW9cL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the carbohydrate_content of this product?"}, {"role": "assistant", "content": "17g"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/71NXqwfW9cL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the carbohydrate_daily_value of this product?"}, {"role": "assistant", "content": "6%"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/81TZLOCqbjL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the vitamin_C_daily_value of this product?"}, {"role": "assistant", "content": "0%"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/81TZLOCqbjL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the vitamin_D_Content of this product?"}, {"role": "assistant", "content": "nan"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/81TZLOCqbjL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the vitamin_D_daily_value of this product?"}, {"role": "assistant", "content": "nan"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/51J22uvz4jL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the vitamin_C_daily_value of this product?"}, {"role": "assistant", "content": "nan"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/51J22uvz4jL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the vitamin_D_Content of this product?"}, {"role": "assistant", "content": "0mcg"}], "task_name": "qa_sft", "dataset_name": "nutrition"}
{"image": ["/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/image/51J22uvz4jL.jpg"], "messages": [{"role": "user", "content": "<|image|>What is the vitamin_D_daily_value of this product?"}, {"role": "assistant", "content": "0%"}], "task_name": "qa_sft", "dataset_name": "nutrition"}

What you mean is that I need to have 15 images in ./images folder and also need to change image name in ./images folder and in .json file as "81ouNxgyqBL_1.jpg", ''81ouNxgyqBL_2.jpg'', ''81ouNxgyqBL_3.jpg''?

Hi, @lmydian1014 , this JSON file is ok. Please check whether all samples in your JSON file raise the IndexError or just partial samples. If only partial samples, could you provide some examples? (ps: the bath_idx=0 is because the per_device_train_batch_size=1 )

@lmydian1014
Copy link
Author

lmydian1014 commented Jul 30, 2024

Hi, I encountered this problem again after around 30 training steps, I tried to print some variables

print(f"cur_image_idx: {cur_image_idx}")
print(f"image_features.shape: {[feat.shape for feat in image_features]}")
print(f"image_token_indices: {image_token_indices}")

The output is as follows,

cur_image_idx: 10
image_features.shape: [torch.Size([257, 4096]), torch.Size([257, 4096]), torch.Size([257, 4096]), torch.Size([257, 4096]), torch.Size([257, 4096]), torch.Size([257, 4096]), torch.Size([257, 4096]), torch.Size([257, 4096]), torch.Size([257, 4096]), torch.Size([257, 4096])]
image_token_indices: tensor([  16,   29,   42,   55,   68,   81,   94,  107,  120,  133,  150,  163,
         176,  189,  202,  215,  228,  241,  254,  267,  284,  297,  310,  323,
         336,  349,  362,  375,  388,  401,  418,  431,  444,  457,  470,  483,
         496,  509,  522,  535,  552,  565,  578,  591,  604,  617,  630,  643,
         656,  669,  686,  699,  712,  725,  738,  751,  764,  777,  790,  803,
         820,  833,  846,  859,  872,  885,  898,  911,  924,  937,  954,  967,
         980,  993, 1006, 1019, 1032, 1045, 1058, 1071, 1088, 1101, 1114, 1127,
        1140, 1153, 1166, 1179, 1192, 1205], device='cuda:2')
Traceback (most recent call last):
  File "/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/mplug_docowl/train/train_docowl.py", line 722, in <module>
    train()
  File "/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/mplug_docowl/train/train_docowl.py", line 701, in train
    trainer.train()
  File "/opt/conda/envs/mplug_owl2/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/opt/conda/envs/mplug_owl2/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/envs/mplug_owl2/lib/python3.10/site-packages/transformers/trainer.py", line 2654, in training_step
    loss = self.compute_loss(model, inputs)
  File "/opt/conda/envs/mplug_owl2/lib/python3.10/site-packages/transformers/trainer.py", line 2679, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/envs/mplug_owl2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/mplug_owl2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/opt/conda/envs/mplug_owl2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1735, in forward
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/envs/mplug_owl2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/mplug_owl2/lib/python3.10/site-packages/peft/peft_model.py", line 922, in forward
    return self.base_model(
  File "/opt/conda/envs/mplug_owl2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/mplug_docowl/model/modeling_mplug_docowl.py", line 254, in forward
    self.prepare_inputs_labels_for_multimodal(input_ids, attention_mask, past_key_values, labels, images, patch_positions)
  File "/home/ubuntu/moyanli/mPLUG-DocOwl/DocOwl1.5/mplug_docowl/model/modeling_mplug_docowl.py", line 122, in prepare_inputs_labels_for_multimodal
    cur_image_features = image_features[cur_image_idx]
IndexError: index 10 is out of bounds for dimension 0 with size 10

I have double checked that the number of images is equal to the number of <|image|> in the .json file. I am quite confused what is the problem here. Could you please help me regarding how to identify which samples raise this index error? it seems printing those three variables doesn't help much in debugging this issue.

Many thanks!

@zyn99xjtu
Copy link

hi! I encountered similar errors during finetuning. Do you know why?
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants