-
-
Notifications
You must be signed in to change notification settings - Fork 780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train FAILED. Crashed while training with SIGTERM #1670
Comments
@RodriMora I believe this is fixed by #1676. Was the timeout happening at the end of an epoch or training? |
Seeing this same behavior. The timeout happens at the end of training, it seems to just hang at the last step, sometimes the error OP posted appears. This error doesn't happen every time though and I do not know what's different about fine-tunes that work vs ones that do not. I've tried docker images for:
They all seem to do the same thing, freeze at the last step in training. Would be happy to try anything to see if we can fix this. |
False alarm! |
I'm still getting what I think is the same issue using the Docker image The error occurs at the end of training. Use the Modal llm-finetuning repo with this updated Docker image on a single H100.
|
To be honest I don't know what I'm doing wrong. I just tried with a bunch of versions of the docker image, with winglian/axolotl:main-20240610-py3.11-cu121-2.3.0: (had to edit the ${PWD} and ${HOME} to $PWD and $HOME from the README command to work)
And then running once it downloads and loads as root to the workspace inside the docker container: I get this errors as if axototl was not installed:
|
In my case the issue disappeared when I removed the |
@psimm Is your docker container not configured to have access to the external internet? |
@winglian The Docker container has access to the external internet. I experimented more and noticed three things:
I think the issue is that the large upload took longer than the previous timeout setting which was just 60 (see https://github.com/modal-labs/llm-finetuning/blob/main/src/common.py) |
Please check that this issue hasn't been reported before.
Expected Behavior
The fine-tuning process completes without errors or crashes
Current behaviour
The process stops with SIGTERM errors
Steps to reproduce
I run the provided advanced docker command in the docs
docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,src="${PWD}",target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-latest
I get into the container just fine. Then:
CUDA_VISIBLE_DEVICES="" python -m axolotl.cli.preprocess examples/openllama-3b/lora.yml
Output:
preprocess.txt
Then:
accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml --deepspeed deepspeed_configs/zero1.json
[E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=604, OpType=ALLREDUCE, NumelIn=12712960, NumelOut=12712960, Timeout(ms)=1800000) ran for 1800616 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=604, OpType=ALLREDUCE, NumelIn=12712960, NumelOut=12712960, Timeout(ms)=1800000) ran for 1800616 milliseconds before timing out.
[2024-05-29 08:24:13,026] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 812 closing signal SIGTERM
[2024-05-29 08:24:13,291] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 813) of binary: /root/miniconda3/envs/py3.10/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/py3.10/bin/accelerate", line 8, in
sys.exit(main())
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1073, in launch_command
multi_gpu_launcher(args)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
distrib_run.run(args)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
axolotl.cli.train FAILED
Failures:
[1]:
time : 2024-05-29_08:24:13
host : dc53c9f6e164
rank : 2 (local_rank: 2)
exitcode : -6 (pid: 814)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 814
[2]:
time : 2024-05-29_08:24:13
host : dc53c9f6e164
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 815)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 815
Root Cause (first observed failure):
[0]:
time : 2024-05-29_08:24:13
host : dc53c9f6e164
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 813)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 813
Full output here:
train.txt
My system:
Ubuntu 22.04
AMD Epyc 7402
512GB RAM
4x3090's
Config yaml
The default examples/openllama-3b/lora.yml provided in the repo
Possible solution
No response
Which Operating Systems are you using?
Python Version
Python 3.10.14 - The one inside the docker image
axolotl branch-commit
main/49b967b
Acknowledgements
The text was updated successfully, but these errors were encountered: