runs not logging separately in wandb.ai #1937

thesauravs · 2021-01-14T08:42:02Z

Common dataset: coco128.yaml
Common environment: Colab

🐛 Bug

Every time a new run is performed, wandb.ai logs the new run into the existing one.

To Reproduce (REQUIRED)

Input:

!git clone https://github.com/ultralytics/yolov5  # clone repo
%cd yolov5
%pip install -qr requirements.txt  # install dependencies

import torch
from IPython.display import Image, clear_output  # to display images

clear_output()
print('Setup complete. Using torch %s %s' % (torch.__version__, torch.cuda.get_device_properties(0) if torch.cuda.is_available() else 'CPU'))

# Download COCO128
torch.hub.download_url_to_file('https://github.com/ultralytics/yolov5/releases/download/v1.0/coco128.zip', 'tmp.zip')
!unzip -q tmp.zip -d ../ && rm tmp.zip

# Weights & Biases (optional)
%pip install -q wandb  
!wandb login  # use 'wandb disabled' or 'wandb enabled' to disable or enable

# Train YOLOv5s on COCO128 for 2 epochs batch size 16
!python train.py --img 640 --batch 16 --epochs 2 --data coco128.yaml --weights yolov5s.pt --nosave --cache --name bat_16 

# Train YOLOv5s on COCO128 for 2 epochs batch size 8
!python train.py --img 640 --batch 8 --epochs 2 --data coco128.yaml --weights yolov5s.pt --nosave --cache --name bat_8

Output:

Link for wandb.ai - https://wandb.ai/sauravmail/YOLOv5?workspace=user-

Expected behaviour

separate log for separate runs should be available on wandb.ai

Environment

The text was updated successfully, but these errors were encountered:

github-actions · 2021-01-14T08:42:51Z

👋 Hello @thesauravs, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab Notebook with free GPU:
Kaggle Notebook with free GPU: https://www.kaggle.com/ultralytics/yolov5
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Docker Image https://hub.docker.com/r/ultralytics/yolov5. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher · 2021-01-14T20:48:32Z

@thesauravs thanks for the bug report. I am able to reproduce this I think. I've slightly updated code to reproduce below (COCO128 will autodownload on first use, so you don't need to manually download before training).

@AyushExel this appears to be a similar --resume issue as the one I thought we fixed in PR #1852 before. I verify this is reproducible in the current master. I will look into it.

!git clone https://github.com/ultralytics/yolov5  # clone repo
%cd yolov5
%pip install -qr requirements.txt  # install dependencies

import torch
from IPython.display import Image, clear_output  # to display images

clear_output()
print('Setup complete. Using torch %s %s' % (torch.__version__, torch.cuda.get_device_properties(0) if torch.cuda.is_available() else 'CPU'))

# Weights & Biases (optional)
%pip install -q wandb  
!wandb login  # use 'wandb disabled' or 'wandb enabled' to disable or enable

# Train YOLOv5s on COCO128 for 2 epochs batch size 16
!python train.py --img 640 --batch 16 --epochs 2 --data coco128.yaml --weights yolov5s.pt --nosave --cache --name bat_16 

# Train YOLOv5s on COCO128 for 2 epochs batch size 8
!python train.py --img 640 --batch 8 --epochs 2 --data coco128.yaml --weights yolov5s.pt --nosave --cache --name bat_8

glenn-jocher · 2021-01-14T21:00:07Z

I see a separate bug also here. The training mosaics are plotted in daemon threads, but it appears that they may fail to render and save before the wandb.log() command is later run. I'l think of a fix for this second issue as well.

  File "train.py", line 518, in <module>
    train(hyp, opt, device, tb_writer, wandb)
  File "train.py", line 322, in train
    wandb.log({"Mosaics": [wandb.Image(str(x), caption=x.name) for x in save_dir.glob('train*.jpg')]})
  File "train.py", line 322, in <listcomp>
    wandb.log({"Mosaics": [wandb.Image(str(x), caption=x.name) for x in save_dir.glob('train*.jpg')]})
  File "/usr/local/lib/python3.6/dist-packages/wandb/data_types.py", line 1555, in __init__
    self._initialize_from_path(data_or_path)
  File "/usr/local/lib/python3.6/dist-packages/wandb/data_types.py", line 1625, in _initialize_from_path
    self._image = pil_image.open(path)
  File "/usr/local/lib/python3.6/dist-packages/PIL/Image.py", line 2862, in open
    "cannot identify image file %r" % (filename if filename else fp)
PIL.UnidentifiedImageError: cannot identify image file 'runs/train/bat_16/train_batch2.jpg'

glenn-jocher · 2021-01-14T21:19:29Z

I've merged a fix for the mosaic daemon bug, which was the secondary error I observed above.

I've retested after this fix and verified that the secondary issue is resolved, but that the primary issue of resuming wandb runs remains. The runs are not resumed locally, only on wandb. @AyushExel do you have ideas what might be happening?

AyushExel · 2021-01-15T07:05:55Z

@glenn-jocher I noticed something which might be useful for debugging. Somehow, the checkpoint from the first run folder is being loaded from the first training run, so every time it resumes the same run. I fixed that problem by loading a checkpoint only if opt.resume is set:

id=ckpt.get('wandb_id') if 'ckpt' in locals() and opt.resume else None

This solution works perfectly, but I'm not sure why this occurs in the first place! Any ideas?

glenn-jocher · 2021-01-15T07:58:40Z

@AyushExel @thesauravs ok I found the problem. The official models in https://github.com/ultralytics/yolov5/releases were still showing wandb_id keys from their trainings. I thought I'd stripped these and updated the models, but apparently I didn't complete the process correctly. I've repeated this again and reuploaded all 4 models, now stripped of their wandb_ids, so this should be solved now.

@thesauravs if you delete your local models and rerun your commands (allowing the updated models to autodownload I believe everything will work correctly. Can you test this out and verify that the problem is solved on your side?

glenn-jocher · 2021-01-15T08:02:25Z

I've verified the code to reproduce behaves correctly now in Colab for me.

thesauravs added the bug Something isn't working label Jan 14, 2021

glenn-jocher added the TODO label Jan 14, 2021

glenn-jocher mentioned this issue Jan 14, 2021

Daemon thread mosaic plots fix #1943

Merged

glenn-jocher linked a pull request Jan 14, 2021 that will close this issue

Daemon thread mosaic plots fix #1943

Merged

glenn-jocher closed this as completed in #1943 Jan 14, 2021

glenn-jocher reopened this Jan 14, 2021

glenn-jocher removed the TODO label Jan 15, 2021

glenn-jocher closed this as completed Jan 15, 2021

This was referenced Apr 11, 2021

YOLOv5 v5.0 Release #2762

Merged

YOLOv5 v5.0 release compatibility update for YOLOv3 ultralytics/yolov3#1737

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runs not logging separately in wandb.ai #1937

runs not logging separately in wandb.ai #1937

thesauravs commented Jan 14, 2021 •

edited

Loading

github-actions bot commented Jan 14, 2021 •

edited by glenn-jocher

Loading

glenn-jocher commented Jan 14, 2021 •

edited

Loading

glenn-jocher commented Jan 14, 2021

glenn-jocher commented Jan 14, 2021 •

edited

Loading

AyushExel commented Jan 15, 2021 •

edited

Loading

glenn-jocher commented Jan 15, 2021

glenn-jocher commented Jan 15, 2021

runs not logging separately in wandb.ai #1937

runs not logging separately in wandb.ai #1937

Comments

thesauravs commented Jan 14, 2021 • edited Loading

🐛 Bug

To Reproduce (REQUIRED)

Expected behaviour

Environment

github-actions bot commented Jan 14, 2021 • edited by glenn-jocher Loading

Requirements

Environments

Status

glenn-jocher commented Jan 14, 2021 • edited Loading

glenn-jocher commented Jan 14, 2021

glenn-jocher commented Jan 14, 2021 • edited Loading

AyushExel commented Jan 15, 2021 • edited Loading

glenn-jocher commented Jan 15, 2021

glenn-jocher commented Jan 15, 2021

thesauravs commented Jan 14, 2021 •

edited

Loading

github-actions bot commented Jan 14, 2021 •

edited by glenn-jocher

Loading

glenn-jocher commented Jan 14, 2021 •

edited

Loading

glenn-jocher commented Jan 14, 2021 •

edited

Loading

AyushExel commented Jan 15, 2021 •

edited

Loading