Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-gpu problem #1036

Open
4 tasks done
Cho-Hong-Seok opened this issue Apr 15, 2024 · 1 comment
Open
4 tasks done

Multi-gpu problem #1036

Cho-Hong-Seok opened this issue Apr 15, 2024 · 1 comment
Labels
question Further information is requested

Comments

@Cho-Hong-Seok
Copy link

Cho-Hong-Seok commented Apr 15, 2024

Before Asking

  • I have read the README carefully. 我已经仔细阅读了README上的操作指引。

  • I want to train my custom dataset, and I have read the tutorials for training your custom data carefully and organize my dataset correctly; (FYI: We recommand you to apply the config files of xx_finetune.py.) 我想训练自定义数据集,我已经仔细阅读了训练自定义数据的教程,以及按照正确的目录结构存放数据集。(FYI: 我们推荐使用xx_finetune.py等配置文件训练自定义数据集。)

  • I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

Search before asking

  • I have searched the YOLOv6 issues and found no similar questions.

Question

I'm trying to train the yolov6_seg model with multi-gpu and I'm getting an error, I don't know which part I need to fix.

train code

!python -m torch.distributed.launch --nproc_per_node 2 tools/train.py --batch 64 --conf configs/yolov6s_seg.py --epoch 150 --data ../FST1/data.yaml --device 0,1

error

/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
[2024-04-15 19:19:43,314] torch.distributed.run: [WARNING] 
[2024-04-15 19:19:43,314] torch.distributed.run: [WARNING] *****************************************
[2024-04-15 19:19:43,314] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-04-15 19:19:43,314] torch.distributed.run: [WARNING] *****************************************
Traceback (most recent call last):
Traceback (most recent call last):
  File "/media/HDD/조홍석/YOLOv6/tools/train.py", line 143, in <module>
  File "/media/HDD/조홍석/YOLOv6/tools/train.py", line 143, in <module>
        main(args)main(args)

  File "/media/HDD/조홍석/YOLOv6/tools/train.py", line 116, in main
  File "/media/HDD/조홍석/YOLOv6/tools/train.py", line 116, in main
        cfg, device, args = check_and_init(args)cfg, device, args = check_and_init(args)

                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/media/HDD/조홍석/YOLOv6/tools/train.py", line 102, in check_and_init
  File "/media/HDD/조홍석/YOLOv6/tools/train.py", line 102, in check_and_init
    device = select_device(args.device)
    device = select_device(args.device)
                    ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^  File "/media/HDD/조홍석/YOLOv6/yolov6/utils/envs.py", line 32, in select_device
^^^^
  File "/media/HDD/조홍석/YOLOv6/yolov6/utils/envs.py", line 32, in select_device
    assert torch.cuda.is_available()
AssertionError    
assert torch.cuda.is_available()
AssertionError
[2024-04-15 19:19:48,319] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 29482) of binary: /home/dilab03/anaconda3/bin/python
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py", line 198, in <module>
    main()
  File "/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py", line 194, in main
    launch(args)
  File "/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py", line 179, in launch
    run(args)
  File "/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
tools/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-04-15_19:19:48
  host      : dilab03-Super-Server
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 29483)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-15_19:19:48
  host      : dilab03-Super-Server
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 29482)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Additional

No response

@Cho-Hong-Seok Cho-Hong-Seok added the question Further information is requested label Apr 15, 2024
@Chilicyy
Copy link
Collaborator

hi @Cho-Hong-Seok , according to the AssertionError log , it seems that cuda is not available. It might be caused by the mismatch of your torch version and cuda version. You could try to reinstall torch referring to the official page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants