Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The train.py script prints tensor then quits #3284

Closed
cubrink opened this issue May 21, 2021 · 14 comments · Fixed by #3325
Closed

The train.py script prints tensor then quits #3284

cubrink opened this issue May 21, 2021 · 14 comments · Fixed by #3325
Labels
bug Something isn't working

Comments

@cubrink
Copy link

cubrink commented May 21, 2021

🐛 Bug

A clear and concise description of what the bug is.

When using train.py a tensor is printed to screen and then the script ends. No training occurs.

To Reproduce (REQUIRED)

Input:

cd <path_to_save_to>/
git clone https://github.com/ultralytics/yolov5.git    # At time of writing: commit 683cefead4b9f2a8d062f953a912e46e456ed6ad
cd yolov5

conda create --name yolov5-bug          # Create new environment
conda activate yolov5-bug
conda install python==3.8.*             # Base python install
conda install -c pytorch torchvision    # Install pytorch from official conda channel. 
                                        # This auto configures GPU support and has worked for me in the past
                                        # Currently this installs torchvision==0.9.1, pytorch==1.8.1, cudatoolkit==10.2.89
conda install cudnn                     # More GPU setup

pip install -r requirements.txt         # Get remaining dependencies


# Download yolov5m.pt from https://github.com/ultralytics/yolov5/releases
# Place yolov5m.pt in yolov5/weights

# Download coco128
# Place at ../coco128 

python train.py --weights weights/yolov5m.pt \
                --data data/coco128.yaml \
                --cfg models/yolov5m.yaml \
                --name yolov5-bug

Output:
The training script starts normally and sits idle briefly. Then a tensor is printed to screen and the script ends.
Inititally
start
Then,
output

After the script has stopped runs/train/yolov5-bug is only partially filled. I'm unsure if this is relevant.
runs/train/yolov5-bug contents:

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d-----         5/21/2021   3:11 PM                weights
------         5/21/2021   3:11 PM           5533 events.out.tfevents...
------         5/21/2021   3:11 PM            356 hyp.yaml
------         5/21/2021   3:11 PM         457560 labels.jpg
------         5/21/2021   3:11 PM         349727 labels_correlogram.jpg
------         5/21/2021   3:11 PM            672 opt.yaml
------         5/21/2021   3:11 PM         365691 train_batch0.jpg

(.tfevents filename partially redacted for privacy reasons)

Expected behavior

A clear and concise description of what you expected to happen.

Regular training on the coco128 test dataset.

Environment

If applicable, add screenshots to help explain your problem.

  • OS: Ubuntu 18.04.5 LTS
  • GPU 4x RTX 2080 Ti

Additional context

Add any other context about the problem here.

I've used older versions of YoloV5 before without issue. I recently decided to try updating before I ran into this issue.
I will not be able to access this machine again until Monday.

@cubrink cubrink added the bug Something isn't working label May 21, 2021
@github-actions
Copy link
Contributor

github-actions bot commented May 21, 2021

👋 Hello @cubrink, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

@cubrink
Copy link
Author

cubrink commented May 21, 2021

I did some further testing by checking out different commits. The first commit that produced the bug is this one.

@glenn-jocher
Copy link
Member

glenn-jocher commented May 23, 2021

@cubrink thanks for the bug report. Conda environments may produce problems sometimes, I would recommend you try one of our verified environments above, including the Docker image.

A few comments about your example:

  • YOLOv5 models are autodownloaded on first request, so you can eliminate this step in your workflow.
  • COCO128 is autodownloaded on first request, so you can eliminate this step in your workflow
  • Multi-GPU training should be done using DDP commands for best results. See Multi-GPU training tutorial below.

YOLOv5 Tutorials

@cubrink
Copy link
Author

cubrink commented May 24, 2021

@glenn-jocher, thank you for the advice, I'm aware of the auto-downloading features but the firewall on the machine in question blocks the download.

I've since ran the script in the Docker container using DDP and have ran into similar issues.

Edit: I can again confirm that the first commit to break the training script is this one. This was the same commit that broke train.py in my initial bug report.

It looks like the line that breaks train.py is:

if tb_writer:
    tb_writer.add_graph(torch.jit.trace(model, imgs, strict=False), [])  # add model graph

On line 333 of train.py. I was able to fix train.py by commenting out the previously mentioned lines.

The bugs:

Within docker, using DDP: The training hangs indefinitely shortly after starting.
Within docker, without DDP: The original error occurs. (Tensor is printed then script exits)

Environment

  • OS: Ubuntu 18.04.5 LTS
  • GPU: 4x RTX 2080 Ti
  • NVIDIA Drivers: 465.19.01
  • Docker version: 19.03.13

To Reproduce

Basic setup

# Volume added due to autodownload being blocked due to firewall settings
# On a different machine the volume would not be needed.
# The data is from official sources so they should work without issue.

sudo docker --rm --ipc=host --gpus all -it -v <my_local_path>/yolov5_resources ultralytics/yolov5:latest

# (Due to personal firewall settings, copy weights into ./weights and dataset into ../coco128)

(At the time of writing the commit from ultralytics/yolov5:latest is 61ea23c)

Test with DDP:

python -m torch.distributed.launch \
    --nproc_per_node 4 \
    train.py \
    --weights weights/yolov5m.pt \
    --cfg models/yolov5m.yaml \
    --data data/coco128.yaml \
    --device 0,1,2,3

Result: Training hangs after 2 batches. I've let this sit for about 10 minutes without progress.
DDP
(Note the warning that is raised before hanging: UserWarning: The input to trace is already a ScriptModule, tracing it is a no-op. Returning the object as it.)

Test without DDP:

python train.py \
    --weights weights/yolov5m.pt \
    --cfg models/yolov5m.yaml \
    --data data/coco128.yaml \
    --device 0,1,2,3

noDDP

Result: Tensor is printed by train.py then then the script exits. (This is the same bug that was originally reported.)

Expected Behavior

Regular training on the coco128 datatset.

@adrigrillo
Copy link

I have also encountered with this error training with my own dataset. Just to add some information about the problem, the stack trace before printing the full tensor is:

Traceback (most recent call last):
  File "train.py", line 591, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 374, in train
    tb_writer.add_graph(torch.jit.trace(model, imgs, strict=False), [])  # add model graph
  File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/jit/_trace.py", line 733, in trace
    return trace_module(
  File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/jit/_trace.py", line 934, in trace_module
    module._c._create_method_from_trace(
  File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/modules/module.py", line 725, in _call_impl
    result = self._slow_forward(*input, **kwargs)
  File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/modules/module.py", line 709, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 162, in forward
    return self.gather(outputs, self.output_device)
  File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 174, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/home/uas/anaconda3/envs/yolo-boats/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
    return Gather.apply(target_device, dim, *outputs)
RuntimeError: Cannot insert a Tensor that requires grad as a constant. Consider making it a parameter or input, or detaching the gradient
Tensor:
[...]

Additionally, as @cubrink mentioned, commenting the line out the script runs correctly.

@glenn-jocher
Copy link
Member

Thanks guys. This must be related to a recent PR #3236 that re-added the TensorBoard graph for interactive model architecture viewing in TensorBoard. It seems to be failing in some environments as in your examples above though. I'll take a look at this today. Worst case scenario we can drop it into a try: except statement.

@adrigrillo your error in particular seems to imply that a model with gradients might be causing the issue, or perhaps the nn/parallel lines are implying that only DP or DDP models in particular are causing the error ?

image

@adrigrillo
Copy link

Now that you comment that, the problem could be produced due to the use of DP mode which I was not intending to use in reality. I forgot to specify the GPU to use, as I wanted to use single GPU training, and I guess that, by default, if two GPUs are available and no GPU flag specified the training will use both in DP mode.

Anyways, commenting the line makes it work in DP mode and with one GPU also works. Therefore, the problem may be related to the multi-gpu training and not with the gradient message.

@glenn-jocher
Copy link
Member

@adrigrillo ok thanks, that makes sense then. It's likely that the graph/script functions were just never intended for use with DP/DDP. I think we can use the same if is_parallel() statement used elsewhere in these cases:

yolov5/train.py

Line 393 in 407dc50

'model': deepcopy(model.module if is_parallel(model) else model).half(),

@glenn-jocher
Copy link
Member

@adrigrillo @cubrink good news 😃! Your original issue may now be fixed ✅ in PR #3325. This PR de-parallelizes the model before passing it to the TensorBoard add_graph() function. This should resolve the original issue if it was only observed in multi-GPU trainings. There is a userwarning currently when the graph is saved, but this is expected and should not cause any problems (UserWarning: The input to trace is already a ScriptModule, tracing it is a no-op. Returning the object as it.) To receive this update:

  • Gitgit pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
  • PyTorch Hub – Force-reload with model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
  • Notebooks – View updated notebooks Open In Colab Open In Kaggle
  • Dockersudo docker pull ultralytics/yolov5:latest to update your image Docker Pulls

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

@shufanwu
Copy link

@glenn-jocher Thanks for your great work, but I found the problem mentioned in #3284 (comment) still existed.
when I use Docker to run the code with DDP, the training hangs indefinitely.

@glenn-jocher
Copy link
Member

@shufanwu hi thanks for the feedback. Can you confirm you are seeing this error in the latest Docker image? You can pull the latest image using the command below.

  • Dockersudo docker pull ultralytics/yolov5:latest to update your image Docker Pulls

@kanybekasanbekov
Copy link

@glenn-jocher Thanks for your great work, but I found the problem mentioned in #3284 (comment) still existed.
when I use Docker to run the code with DDP, the training hangs indefinitely.

#3325 solved for me

@kanybekasanbekov
Copy link

@glenn-jocher Thanks for your great work, but I found the problem mentioned in #3284 (comment) still existed.
when I use Docker to run the code with DDP, the training hangs indefinitely.

#3325 solved for me

In addition, develop(default) branch still has this issue, whereas master branch was fixed.

@glenn-jocher
Copy link
Member

@kanybekasanbekov @SkalskiP yes as you noticed we have a new develop branch. The idea is to adopt a more best practices workflow where we mostly update the develop branch and periodically merge to master on a new patch release.
master < develop < feature

Though bug fixes will take a different route:
master < fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants