Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero3 error for pretrain #98

Open
zhww opened this issue Jun 28, 2024 · 4 comments
Open

Zero3 error for pretrain #98

zhww opened this issue Jun 28, 2024 · 4 comments

Comments

@zhww
Copy link

zhww commented Jun 28, 2024

I do pretrain with zero3 will got errors, but lora fintune with zero3 is ok.
The error info is:
python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3375, in reduce_scatter_tensor
work = group._reduce_scatter_base(output, input, opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1

The pretain setting:
MODEL_TYPE=llama3-8b
--version llama

transformers==4.40.0
deepspeed==0.14.3
pytorch==2.1.2+cu121

Will you check the pretain code with zero3 be ok?

@Isaachhh
Copy link
Collaborator

--version plain

@zhww
Copy link
Author

zhww commented Jun 28, 2024

MODEL_TYPE=llama3-8b, --version plain for pretrain,But also give same errors!

@Isaachhh
Copy link
Collaborator

Isaachhh commented Jun 28, 2024

On our device, zero3 for pre-training is ok. Please share more information. Or you can try our configured docker.

@zhww
Copy link
Author

zhww commented Jun 30, 2024

Thank you, I test on 4090 before, when change to A100 zero3 for pretrain is ok.

By the way, will you support any resolution like slice the image, as S2 is only for multi-scale features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants