Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue while finetuning DocOwl1.5-Omni on dataset #78

Open
AkshataABhat opened this issue May 31, 2024 · 4 comments
Open

issue while finetuning DocOwl1.5-Omni on dataset #78

AkshataABhat opened this issue May 31, 2024 · 4 comments

Comments

@AkshataABhat
Copy link

AkshataABhat commented May 31, 2024

The training does not start..my memory is completely occupied but GPU is at 0%.
Screenshot attached below. Pls help .

image

@HAWLYQ
Copy link
Collaborator

HAWLYQ commented May 31, 2024

Hi, @AkshataABhat, could you provide more details, such as the GPU device and training script?

@AkshataABhat
Copy link
Author

AkshataABhat commented May 31, 2024

@HAWLYQ
GPU is NVIDIA A100-SXM4-40GB

Training script is:

#!/bin/bash
if [ $MASTER_ADDR ];then
	echo $MASTER_ADDR
    echo $MASTER_PORT
    echo $WORLD_SIZE
    echo $RANK
else
	MASTER_ADDR=127.0.0.1
    MASTER_PORT=2$(($RANDOM % 10))$(($RANDOM % 10))15
    WORLD_SIZE=1
    RANK=0
fi
# Change for multinode config
NNODES=${WORLD_SIZE}
NODE_RANK=${RANK}
GPUS_PER_NODE=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
# GPUS_PER_NODE=1
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
echo $DISTRIBUTED_ARGS

# change LOAD to your local path of DocOwl1.5-stage1
LOAD='mPLUG/DocOwl1.5-Omni'

# batch size = per_device_train_batch_size x GPUS_PER_NODE x NNODES x gradient_accumulation_steps
DATA_FILE=train.jsonl
torchrun $DISTRIBUTED_ARGS mplug_docowl/train/train_docowl.py \
    --lora_enable True --lora_r 128 --lora_alpha 256 --vision2text_lr 2e-5 \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path $LOAD \
    --version v1 \
    --data_path $DATA_FILE \
    --image_folder 'DocOwl1.5/answers/images' \
    --image_size 448 \
    --crop_anchors 'grid_9' \
    --add_global_img True \
    --output_dir 'output'
    --add_textual_crop_indicator True \
    --bf16 True \
    --output_dir ./checkpoints/docowl1.5-lora \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 4 \
    --learning_rate 1e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 3600 \
    --gradient_checkpointing True \
    --tune_vision2text True \
    --freeze_vision_model True \
    --freeze_backbone True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to tensorboard

@HAWLYQ
Copy link
Collaborator

HAWLYQ commented May 31, 2024

Hi, @AkshataABhat, the training script seems ok~ I have tested the script with A100-80G and am not sure whether it works well on A100-40G~ We will try whether it works on V100-32G, but due to the work schedule and limited machine resources, this won't be soon, sry for that~

@AkshataABhat
Copy link
Author

AkshataABhat commented May 31, 2024

@HAWLYQ
here, I am loading the model from hugging face instead of a local checkpoint.

LOAD='mPLUG/DocOwl1.5-Omni'

also, in train_docowl.py, the code is getting executed until the below line:

data_module = make_supervised_data_module(tokenizer=tokenizer,
                                              data_args=data_args)

about 35 GB is occupied until this step.
after this, trainer is not being called:

trainer.train()

pls guide whether this is a gpu issue? or would the script work if checkpoints were locally available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants