You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have read the README carefully. 我已经仔细阅读了README上的操作指引。
I want to train my custom dataset, and I have read the tutorials for training your custom data carefully and organize my dataset correctly; (FYI: We recommand you to apply the config files of xx_finetune.py.) 我想训练自定义数据集,我已经仔细阅读了训练自定义数据的教程,以及按照正确的目录结构存放数据集。(FYI: 我们推荐使用xx_finetune.py等配置文件训练自定义数据集。)
I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
Search before asking
I have searched the YOLOv6 issues and found no similar questions.
Question
I'm trying to train the yolov6_seg model with multi-gpu and I'm getting an error, I don't know which part I need to fix.
/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
[2024-04-15 19:19:43,314] torch.distributed.run: [WARNING]
[2024-04-15 19:19:43,314] torch.distributed.run: [WARNING] *****************************************
[2024-04-15 19:19:43,314] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-04-15 19:19:43,314] torch.distributed.run: [WARNING] *****************************************
Traceback (most recent call last):
Traceback (most recent call last):
File "/media/HDD/조홍석/YOLOv6/tools/train.py", line 143, in <module>
File "/media/HDD/조홍석/YOLOv6/tools/train.py", line 143, in <module>
main(args)main(args)
File "/media/HDD/조홍석/YOLOv6/tools/train.py", line 116, in main
File "/media/HDD/조홍석/YOLOv6/tools/train.py", line 116, in main
cfg, device, args = check_and_init(args)cfg, device, args = check_and_init(args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/HDD/조홍석/YOLOv6/tools/train.py", line 102, in check_and_init
File "/media/HDD/조홍석/YOLOv6/tools/train.py", line 102, in check_and_init
device = select_device(args.device)
device = select_device(args.device)
^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^ File "/media/HDD/조홍석/YOLOv6/yolov6/utils/envs.py", line 32, in select_device
^^^^
File "/media/HDD/조홍석/YOLOv6/yolov6/utils/envs.py", line 32, in select_device
assert torch.cuda.is_available()
AssertionError
assert torch.cuda.is_available()
AssertionError
[2024-04-15 19:19:48,319] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 29482) of binary: /home/dilab03/anaconda3/bin/python
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py", line 198, in <module>
main()
File "/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py", line 194, in main
launch(args)
File "/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/launch.py", line 179, in launch
run(args)
File "/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/dilab03/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
tools/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-04-15_19:19:48
host : dilab03-Super-Server
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 29483)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-04-15_19:19:48
host : dilab03-Super-Server
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 29482)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Additional
No response
The text was updated successfully, but these errors were encountered:
hi @Cho-Hong-Seok , according to the AssertionError log , it seems that cuda is not available. It might be caused by the mismatch of your torch version and cuda version. You could try to reinstall torch referring to the official page.
Before Asking
I have read the README carefully. 我已经仔细阅读了README上的操作指引。
I want to train my custom dataset, and I have read the tutorials for training your custom data carefully and organize my dataset correctly; (FYI: We recommand you to apply the config files of xx_finetune.py.) 我想训练自定义数据集,我已经仔细阅读了训练自定义数据的教程,以及按照正确的目录结构存放数据集。(FYI: 我们推荐使用xx_finetune.py等配置文件训练自定义数据集。)
I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
Search before asking
Question
I'm trying to train the yolov6_seg model with multi-gpu and I'm getting an error, I don't know which part I need to fix.
Additional
No response
The text was updated successfully, but these errors were encountered: