Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train FAILED. Crashed while training with SIGTERM #1670

Open
6 of 8 tasks
RodriMora opened this issue May 29, 2024 · 6 comments
Open
6 of 8 tasks

Train FAILED. Crashed while training with SIGTERM #1670

RodriMora opened this issue May 29, 2024 · 6 comments
Labels
bug Something isn't working possibly_solved

Comments

@RodriMora
Copy link

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

The fine-tuning process completes without errors or crashes

Current behaviour

The process stops with SIGTERM errors

Steps to reproduce

I run the provided advanced docker command in the docs

docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,src="${PWD}",target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-latest

I get into the container just fine. Then:

CUDA_VISIBLE_DEVICES="" python -m axolotl.cli.preprocess examples/openllama-3b/lora.yml

Output:
preprocess.txt

Then:
accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml --deepspeed deepspeed_configs/zero1.json

[E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=604, OpType=ALLREDUCE, NumelIn=12712960, NumelOut=12712960, Timeout(ms)=1800000) ran for 1800616 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=604, OpType=ALLREDUCE, NumelIn=12712960, NumelOut=12712960, Timeout(ms)=1800000) ran for 1800616 milliseconds before timing out.
[2024-05-29 08:24:13,026] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 812 closing signal SIGTERM
[2024-05-29 08:24:13,291] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 813) of binary: /root/miniconda3/envs/py3.10/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/py3.10/bin/accelerate", line 8, in
sys.exit(main())
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1073, in launch_command
multi_gpu_launcher(args)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
distrib_run.run(args)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

axolotl.cli.train FAILED

Failures:
[1]:
time : 2024-05-29_08:24:13
host : dc53c9f6e164
rank : 2 (local_rank: 2)
exitcode : -6 (pid: 814)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 814
[2]:
time : 2024-05-29_08:24:13
host : dc53c9f6e164
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 815)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 815

Root Cause (first observed failure):
[0]:
time : 2024-05-29_08:24:13
host : dc53c9f6e164
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 813)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 813

Full output here:
train.txt

My system:

Ubuntu 22.04
AMD Epyc 7402
512GB RAM
4x3090's

image

Config yaml

The default examples/openllama-3b/lora.yml provided in the repo

Possible solution

No response

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

Python 3.10.14 - The one inside the docker image

axolotl branch-commit

main/49b967b

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@RodriMora RodriMora added the bug Something isn't working label May 29, 2024
@winglian
Copy link
Collaborator

winglian commented Jun 3, 2024

@RodriMora I believe this is fixed by #1676. Was the timeout happening at the end of an epoch or training?

@shopigarner
Copy link

Seeing this same behavior. The timeout happens at the end of training, it seems to just hang at the last step, sometimes the error OP posted appears. This error doesn't happen every time though and I do not know what's different about fine-tunes that work vs ones that do not.

I've tried docker images for:

  • winglian/axolotl:main-20240530-py3.10-cu118-2.1.2
  • winglian/axolotl:main-20240531-py3.10-cu118-2.1.2
  • winglian/axolotl:main-202400610-py3.10-cu118-2.1.2

They all seem to do the same thing, freeze at the last step in training. Would be happy to try anything to see if we can fix this.

@shopigarner
Copy link

False alarm!
The newer image winglian/axolotl:main-202400610-py3.10-cu118-2.1.2 indeed fixes the issue 🥳

@psimm
Copy link

psimm commented Jun 16, 2024

I'm still getting what I think is the same issue using the Docker image winglian/axolotl/main-20240616-py3.11-cu121-2.2.2

https://hub.docker.com/layers/winglian/axolotl/main-20240616-py3.11-cu121-2.2.2/images/sha256-81e9b559535e35e580cc0dbb43b92c2ea89a434ba3880a735360714b8182f7fd?context=explore

The error occurs at the end of training.

Use the Modal llm-finetuning repo with this updated Docker image on a single H100.

[rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2263, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=300000) ran for 300189 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f26ddd81d87 in /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f269309c6e6 in /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f269309fc3d in /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f26930a0839 in /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7f26de2b1bf4 in /root/miniconda3/envs/py3.11/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7f26df694ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126a40 (0x7f26df726a40 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[2024-06-16 13:11:36,589] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 28 closing signal SIGTERM
[2024-06-16 13:11:36,808] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 29) of binary: /root/miniconda3/envs/py3.11/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.11/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1073, in launch_command
    multi_gpu_launcher(args)
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
    distrib_run.run(args)
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
===================================================
axolotl.cli.train FAILED
---------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
---------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-16_13:11:36
  host      : localhost
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 29)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 29
===================================================
Traceback (most recent call last):
  File "/pkg/modal/_container_io_manager.py", line 487, in handle_input_exception
    yield
  File "/pkg/modal/_container_entrypoint.py", line 239, in run_input_sync
    res = finalized_function.callable(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/src/train.py", line 36, in train
    run_cmd(cmd, run_folder)
  File "/root/src/train.py", line 183, in run_cmd
    exit(exit_code)
  File "<frozen _sitebuiltins>", line 26, in __call__
SystemExit: 1

@RodriMora
Copy link
Author

To be honest I don't know what I'm doing wrong. I just tried with a bunch of versions of the docker image, with winglian/axolotl:main-20240610-py3.11-cu121-2.3.0:

(had to edit the ${PWD} and ${HOME} to $PWD and $HOME from the README command to work)

docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,src=$PWD,target=/workspace/axolotl -v $HOME/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-20240610-py3.11-cu121-2.3.0

And then running once it downloads and loads as root to the workspace inside the docker container:
accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml

I get this errors as if axototl was not installed:

[W socket.cpp:464] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost.lan]:29500 (errno: 97 - Address family not supported by protocol).
/root/miniconda3/envs/py3.11/bin/python: Error while finding module specification for 'axolotl.cli.train' (ModuleNotFoundError: No module named 'axolotl')
/root/miniconda3/envs/py3.11/bin/python: Error while finding module specification for 'axolotl.cli.train' (ModuleNotFoundError: No module named 'axolotl')
/root/miniconda3/envs/py3.11/bin/python: Error while finding module specification for 'axolotl.cli.train' (ModuleNotFoundError: No module named 'axolotl')
/root/miniconda3/envs/py3.11/bin/python: Error while finding module specification for 'axolotl.cli.train' (ModuleNotFoundError: No module named 'axolotl')
E0616 17:46:39.157000 127746857011008 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 83) of binary: /root/miniconda3/envs/py3.11/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.11/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1073, in launch_command
    multi_gpu_launcher(args)
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
    distrib_run.run(args)
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
axolotl.cli.train FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-06-16_17:46:39
  host      : b99519cb976b
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 84)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-06-16_17:46:39
  host      : b99519cb976b
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 85)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-06-16_17:46:39
  host      : b99519cb976b
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 86)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-16_17:46:39
  host      : b99519cb976b
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 83)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

@psimm
Copy link

psimm commented Jun 23, 2024

In my case the issue disappeared when I removed the hub_model_id setting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working possibly_solved
Projects
None yet
Development

No branches or pull requests

4 participants