-
-
Notifications
You must be signed in to change notification settings - Fork 15.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The train.py script prints tensor then quits #3284
Comments
👋 Hello @cubrink, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution. If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you. If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available. For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com. RequirementsPython 3.8 or later with all requirements.txt dependencies installed, including $ pip install -r requirements.txt EnvironmentsYOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
StatusIf this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit. |
I did some further testing by checking out different commits. The first commit that produced the bug is this one. |
@cubrink thanks for the bug report. Conda environments may produce problems sometimes, I would recommend you try one of our verified environments above, including the Docker image. A few comments about your example:
YOLOv5 Tutorials
|
@glenn-jocher, thank you for the advice, I'm aware of the auto-downloading features but the firewall on the machine in question blocks the download. I've since ran the script in the Docker container using DDP and have ran into similar issues. Edit: I can again confirm that the first commit to break the training script is this one. This was the same commit that broke It looks like the line that breaks if tb_writer:
tb_writer.add_graph(torch.jit.trace(model, imgs, strict=False), []) # add model graph On line 333 of The bugs:Within docker, using DDP: The training hangs indefinitely shortly after starting. Environment
To ReproduceBasic setup
(At the time of writing the commit from ultralytics/yolov5:latest is Test with DDP:
Result: Training hangs after 2 batches. I've let this sit for about 10 minutes without progress. Test without DDP:
Result: Tensor is printed by Expected BehaviorRegular training on the coco128 datatset. |
I have also encountered with this error training with my own dataset. Just to add some information about the problem, the stack trace before printing the full tensor is: Traceback (most recent call last):
File "train.py", line 591, in <module>
train(hyp, opt, device, tb_writer)
File "train.py", line 374, in train
tb_writer.add_graph(torch.jit.trace(model, imgs, strict=False), []) # add model graph
File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/jit/_trace.py", line 733, in trace
return trace_module(
File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/jit/_trace.py", line 934, in trace_module
module._c._create_method_from_trace(
File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/modules/module.py", line 725, in _call_impl
result = self._slow_forward(*input, **kwargs)
File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/modules/module.py", line 709, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 162, in forward
return self.gather(outputs, self.output_device)
File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 174, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
res = gather_map(outputs)
File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
RuntimeError: Cannot insert a Tensor that requires grad as a constant. Consider making it a parameter or input, or detaching the gradient
Tensor:
[...] Additionally, as @cubrink mentioned, commenting the line out the script runs correctly. |
Thanks guys. This must be related to a recent PR #3236 that re-added the TensorBoard graph for interactive model architecture viewing in TensorBoard. It seems to be failing in some environments as in your examples above though. I'll take a look at this today. Worst case scenario we can drop it into a try: except statement. @adrigrillo your error in particular seems to imply that a model with gradients might be causing the issue, or perhaps the nn/parallel lines are implying that only DP or DDP models in particular are causing the error ? |
Now that you comment that, the problem could be produced due to the use of DP mode which I was not intending to use in reality. I forgot to specify the GPU to use, as I wanted to use single GPU training, and I guess that, by default, if two GPUs are available and no GPU flag specified the training will use both in DP mode. Anyways, commenting the line makes it work in DP mode and with one GPU also works. Therefore, the problem may be related to the multi-gpu training and not with the gradient message. |
@adrigrillo ok thanks, that makes sense then. It's likely that the graph/script functions were just never intended for use with DP/DDP. I think we can use the same Line 393 in 407dc50
|
@adrigrillo @cubrink good news 😃! Your original issue may now be fixed ✅ in PR #3325. This PR de-parallelizes the model before passing it to the TensorBoard add_graph() function. This should resolve the original issue if it was only observed in multi-GPU trainings. There is a userwarning currently when the graph is saved, but this is expected and should not cause any problems (
Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀! |
@glenn-jocher Thanks for your great work, but I found the problem mentioned in #3284 (comment) still existed. |
#3325 solved for me |
In addition, |
@kanybekasanbekov @SkalskiP yes as you noticed we have a new Though bug fixes will take a different route: |
🐛 Bug
A clear and concise description of what the bug is.
When using
train.py
a tensor is printed to screen and then the script ends. No training occurs.To Reproduce (REQUIRED)
Input:
Output:
![start](https://user-images.githubusercontent.com/25831062/119194788-3933b100-ba49-11eb-9eac-090bcc7f3706.png)
![output](https://user-images.githubusercontent.com/25831062/119194852-523c6200-ba49-11eb-8c8d-945b723a7469.png)
The training script starts normally and sits idle briefly. Then a tensor is printed to screen and the script ends.
Inititally
Then,
After the script has stopped
runs/train/yolov5-bug
is only partially filled. I'm unsure if this is relevant.runs/train/yolov5-bug
contents:(
.tfevents
filename partially redacted for privacy reasons)Expected behavior
A clear and concise description of what you expected to happen.
Regular training on the coco128 test dataset.
Environment
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.
I've used older versions of YoloV5 before without issue. I recently decided to try updating before I ran into this issue.
I will not be able to access this machine again until Monday.
The text was updated successfully, but these errors were encountered: