Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runs not logging separately in wandb.ai #1937

Closed
thesauravs opened this issue Jan 14, 2021 · 7 comments · Fixed by #1943
Closed

runs not logging separately in wandb.ai #1937

thesauravs opened this issue Jan 14, 2021 · 7 comments · Fixed by #1943
Labels
bug Something isn't working

Comments

@thesauravs
Copy link

thesauravs commented Jan 14, 2021

  • Common dataset: coco128.yaml
  • Common environment: Colab

🐛 Bug

Every time a new run is performed, wandb.ai logs the new run into the existing one.

To Reproduce (REQUIRED)

Input:

!git clone https://github.com/ultralytics/yolov5  # clone repo
%cd yolov5
%pip install -qr requirements.txt  # install dependencies

import torch
from IPython.display import Image, clear_output  # to display images

clear_output()
print('Setup complete. Using torch %s %s' % (torch.__version__, torch.cuda.get_device_properties(0) if torch.cuda.is_available() else 'CPU'))

# Download COCO128
torch.hub.download_url_to_file('https://github.com/ultralytics/yolov5/releases/download/v1.0/coco128.zip', 'tmp.zip')
!unzip -q tmp.zip -d ../ && rm tmp.zip

# Weights & Biases (optional)
%pip install -q wandb  
!wandb login  # use 'wandb disabled' or 'wandb enabled' to disable or enable

# Train YOLOv5s on COCO128 for 2 epochs batch size 16
!python train.py --img 640 --batch 16 --epochs 2 --data coco128.yaml --weights yolov5s.pt --nosave --cache --name bat_16 

# Train YOLOv5s on COCO128 for 2 epochs batch size 8
!python train.py --img 640 --batch 8 --epochs 2 --data coco128.yaml --weights yolov5s.pt --nosave --cache --name bat_8

Output:

Link for wandb.ai - https://wandb.ai/sauravmail/YOLOv5?workspace=user-

Expected behaviour

separate log for separate runs should be available on wandb.ai

Environment

@thesauravs thesauravs added the bug Something isn't working label Jan 14, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Jan 14, 2021

👋 Hello @thesauravs, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

@glenn-jocher
Copy link
Member

glenn-jocher commented Jan 14, 2021

@thesauravs thanks for the bug report. I am able to reproduce this I think. I've slightly updated code to reproduce below (COCO128 will autodownload on first use, so you don't need to manually download before training).

@AyushExel this appears to be a similar --resume issue as the one I thought we fixed in PR #1852 before. I verify this is reproducible in the current master. I will look into it.

!git clone https://github.com/ultralytics/yolov5  # clone repo
%cd yolov5
%pip install -qr requirements.txt  # install dependencies

import torch
from IPython.display import Image, clear_output  # to display images

clear_output()
print('Setup complete. Using torch %s %s' % (torch.__version__, torch.cuda.get_device_properties(0) if torch.cuda.is_available() else 'CPU'))

# Weights & Biases (optional)
%pip install -q wandb  
!wandb login  # use 'wandb disabled' or 'wandb enabled' to disable or enable

# Train YOLOv5s on COCO128 for 2 epochs batch size 16
!python train.py --img 640 --batch 16 --epochs 2 --data coco128.yaml --weights yolov5s.pt --nosave --cache --name bat_16 

# Train YOLOv5s on COCO128 for 2 epochs batch size 8
!python train.py --img 640 --batch 8 --epochs 2 --data coco128.yaml --weights yolov5s.pt --nosave --cache --name bat_8

@glenn-jocher
Copy link
Member

I see a separate bug also here. The training mosaics are plotted in daemon threads, but it appears that they may fail to render and save before the wandb.log() command is later run. I'l think of a fix for this second issue as well.

  File "train.py", line 518, in <module>
    train(hyp, opt, device, tb_writer, wandb)
  File "train.py", line 322, in train
    wandb.log({"Mosaics": [wandb.Image(str(x), caption=x.name) for x in save_dir.glob('train*.jpg')]})
  File "train.py", line 322, in <listcomp>
    wandb.log({"Mosaics": [wandb.Image(str(x), caption=x.name) for x in save_dir.glob('train*.jpg')]})
  File "/usr/local/lib/python3.6/dist-packages/wandb/data_types.py", line 1555, in __init__
    self._initialize_from_path(data_or_path)
  File "/usr/local/lib/python3.6/dist-packages/wandb/data_types.py", line 1625, in _initialize_from_path
    self._image = pil_image.open(path)
  File "/usr/local/lib/python3.6/dist-packages/PIL/Image.py", line 2862, in open
    "cannot identify image file %r" % (filename if filename else fp)
PIL.UnidentifiedImageError: cannot identify image file 'runs/train/bat_16/train_batch2.jpg'

@glenn-jocher
Copy link
Member

glenn-jocher commented Jan 14, 2021

I've merged a fix for the mosaic daemon bug, which was the secondary error I observed above.

I've retested after this fix and verified that the secondary issue is resolved, but that the primary issue of resuming wandb runs remains. The runs are not resumed locally, only on wandb. @AyushExel do you have ideas what might be happening?

@AyushExel
Copy link
Contributor

AyushExel commented Jan 15, 2021

@glenn-jocher I noticed something which might be useful for debugging. Somehow, the checkpoint from the first run folder is being loaded from the first training run, so every time it resumes the same run. I fixed that problem by loading a checkpoint only if opt.resume is set:

id=ckpt.get('wandb_id') if 'ckpt' in locals() and opt.resume else None

This solution works perfectly, but I'm not sure why this occurs in the first place! Any ideas?

@glenn-jocher
Copy link
Member

@AyushExel @thesauravs ok I found the problem. The official models in https://github.com/ultralytics/yolov5/releases were still showing wandb_id keys from their trainings. I thought I'd stripped these and updated the models, but apparently I didn't complete the process correctly. I've repeated this again and reuploaded all 4 models, now stripped of their wandb_ids, so this should be solved now.

@thesauravs if you delete your local models and rerun your commands (allowing the updated models to autodownload I believe everything will work correctly. Can you test this out and verify that the problem is solved on your side?

@glenn-jocher
Copy link
Member

I've verified the code to reproduce behaves correctly now in Colab for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants