Multi-gpu problem #1036

Cho-Hong-Seok · 2024-04-15T08:47:13Z

Before Asking

I have read the README carefully. 我已经仔细阅读了README上的操作指引。
I want to train my custom dataset, and I have read the tutorials for training your custom data carefully and organize my dataset correctly; (FYI: We recommand you to apply the config files of xx_finetune.py.) 我想训练自定义数据集，我已经仔细阅读了训练自定义数据的教程，以及按照正确的目录结构存放数据集。（FYI: 我们推荐使用xx_finetune.py等配置文件训练自定义数据集。）
I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。

Search before asking

I have searched the YOLOv6 issues and found no similar questions.

Question

I'm trying to train the yolov6_seg model with multi-gpu and I'm getting an error, I don't know which part I need to fix.

train code

!python -m torch.distributed.launch --nproc_per_node 2 tools/train.py --batch 64 --conf configs/yolov6s_seg.py --epoch 150 --data ../FST1/data.yaml --device 0,1

error

/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
[2024-04-15 19:19:43,314] torch.distributed.run: [WARNING] 
[2024-04-15 19:19:43,314] torch.distributed.run: [WARNING] *****************************************
[2024-04-15 19:19:43,314] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-04-15 19:19:43,314] torch.distributed.run: [WARNING] *****************************************
Traceback (most recent call last):
Traceback (most recent call last):
  File "/media/HDD/조홍석/YOLOv6/tools/train.py", line 143, in <module>
  File "/media/HDD/조홍석/YOLOv6/tools/train.py", line 143, in <module>
        main(args)main(args)

  File "/media/HDD/조홍석/YOLOv6/tools/train.py", line 116, in main
  File "/media/HDD/조홍석/YOLOv6/tools/train.py", line 116, in main
        cfg, device, args = check_and_init(args)cfg, device, args = check_and_init(args)

                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/media/HDD/조홍석/YOLOv6/tools/train.py", line 102, in check_and_init
  File "/media/HDD/조홍석/YOLOv6/tools/train.py", line 102, in check_and_init
    device = select_device(args.device)
    device = select_device(args.device)
                    ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^  File "/media/HDD/조홍석/YOLOv6/yolov6/utils/envs.py", line 32, in select_device
^^^^
  File "/media/HDD/조홍석/YOLOv6/yolov6/utils/envs.py", line 32, in select_device
    assert torch.cuda.is_available()
AssertionError    
assert torch.cuda.is_available()
AssertionError
[2024-04-15 19:19:48,319] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 29482) of binary: /home/dilab03/anaconda3/bin/python
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py", line 198, in <module>
    main()
  File "/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py", line 194, in main
    launch(args)
  File "/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py", line 179, in launch
    run(args)
  File "/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
tools/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-04-15_19:19:48
  host      : dilab03-Super-Server
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 29483)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-15_19:19:48
  host      : dilab03-Super-Server
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 29482)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Additional

No response

The text was updated successfully, but these errors were encountered:

Chilicyy · 2024-04-17T08:07:41Z

hi @Cho-Hong-Seok , according to the AssertionError log , it seems that cuda is not available. It might be caused by the mismatch of your torch version and cuda version. You could try to reinstall torch referring to the official page.

Cho-Hong-Seok added the question Further information is requested label Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-gpu problem #1036

Multi-gpu problem #1036

Cho-Hong-Seok commented Apr 15, 2024 •

edited

Loading

Chilicyy commented Apr 17, 2024

Multi-gpu problem #1036

Multi-gpu problem #1036

Comments

Cho-Hong-Seok commented Apr 15, 2024 • edited Loading

Before Asking

Search before asking

Question

Additional

Chilicyy commented Apr 17, 2024

Cho-Hong-Seok commented Apr 15, 2024 •

edited

Loading