Improves docs and handling of entities and resuming by WandbLogger #3264

charlesfrye · 2021-05-20T21:44:57Z

What does this fix?

The --entity argument for configuring the WandbLogger is currently unused. This PR fixes that, including for resumed runs, fixes a bug for run resuming, and adds a docstring to the WandbLogger class.

How has this been tested?

The core fix, to entity setting for non-resumed runs, is tested in this repro colab, and the new tagging is used in this artifact produced by that colab.

The other fixes, including resumption based on the new tags, have not been directly tested, since I don't have a good setup for testing resumption and DDP. @AyushExel could you check these?

Deep dive

Projects in Weights & Biases are organized under entities, which may be organizations, teams, or individual users. A user may sometimes wish to log to their personal entity (e.g. glenn-jocher) or to their team entity (e.g. ultralytics).

The --entity CLI argument to train.py is intended to allow users to set the entity to which they are logging. It does not do so currently, and this PR fixes that.

In addition, the handling of entities in resumed runs (and in DDP sub-processes during resumed runs) was implicit -- it was assumed that the entity of the provided run was the same as the user. This has been corrected in the PR, and resumed runs are now logged to the original entity, not the user's entity.

Separately, this surfaced an issue with the resumption of runs -- in train.py, models are given the alias last, while in the code for resumption it is assumed that they have alias latest. The former is a YOLOv5 convention and the latter is a wandb convention. For backwards compatibility and completeness, this PR applies both aliases, but could be amended to use just one or the other.

Finally, this commit adds extra docstrings to wandb_utils: at the module level and to WandbLogger.

🛠️ PR Summary

_{Made with ❤️ by Ultralytics Actions}

🌟 Summary

Enhancements to Weights & Biases integration in YOLOv5 training script.

📊 Key Changes

Updated alias from 'last' to 'latest' for logged models in train.py.
Expanded the wandb_utils.py with a docstring introduction and improved code comments.
Included entity in Weights & Biases artifact paths, allowing logging across different W&B entities.
Updated model artifact logging with new alias 'last' ensuring consistency in naming.
Added assertion in download_model_artifact to only allow resuming incomplete runs.

🎯 Purpose & Impact

🎨 Improved Weights & Biases documentation and clarity enhances user understanding of the integration.
🛠️ Refined usage of entity allows training across various Weights & Biases teams/entities, providing better organization and access control.
🚀 The updated aliasing ensures a more intuitive retrieval of the latest model state, aiding users in model management.
🛑 Enhanced resuming logic prevents unnecessary attempts to resume already completed training, avoiding user confusion and wasted resources.

github-actions

👋 Hello @charlesfrye, thank you for submitting a 🚀 PR! To allow your work to be integrated as seamlessly as possible, we advise you to:

✅ Verify your PR is up-to-date with origin/master. If your PR is behind origin/master an automatic GitHub actions rebase may be attempted by including the /rebase command in a comment body, or by running the following code, replacing 'feature' with the name of your local branch:

git remote add upstream https://github.com/ultralytics/yolov5.git
git fetch upstream
git checkout feature  # <----- replace 'feature' with local branch name
git rebase upstream/master
git push -u origin -f

✅ Verify all Continuous Integration (CI) checks are passing.
✅ Reduce changes to the absolute minimum required for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." -Bruce Lee

charlesfrye · 2021-05-20T22:55:52Z

Update

Courtesy of @AyushExel, who tested more of the features implemented here, 4d79628 improves the error message on resumption failure.

Non-parallel resumption has been tested. DDP resumption has not been tested due to issues getting an environment set up (which are the same in master and in this branch).

Deep Dive on Update

The previous check looked for whether metadata epochs_trained was less than total_epochs to determine whether training finished. This returns True when training was interrupted, as expected.

But for finished runs, these variables are actually NoneType, as that metadata is not logged for stripped models, which only have the info needed for inference, not training. The assert "fails" and the program crashes, giving correct behavior, but the error is due to applying < to a None. 4d79628 corrects the inference of whether training has finished and prints a more helpful error message.

glenn-jocher · 2021-05-21T08:17:47Z

@charlesfrye @AyushExel great, thanks for your contributions guys!

Is there a way to consolidate the last/latest checkpoint names into a single one? Maybe renaming last.pt to latest.pt on upload (so users see only latest.pt in W&B and only last.pt locally)? I think we probably want to try to avoid the redundancy to reduce user confusion if possible.

charlesfrye · 2021-05-21T19:47:08Z

I included both aliases to preserve backwards compatibility, but given that this feature is relatively fresh, I think it's right to prefer simplicity and we can just use latest. The filename (last.pt/best.pt) and the alias (latest, best, stripped) are separate, so we don't need to change those.

glenn-jocher · 2021-05-21T21:43:18Z

@charlesfrye great, PR is merged! :)

…ltralytics#3264) * adds latest tag to match wandb defaults * adds entity handling, 'last' tag * fixes bug causing finished runs to resume * removes redundant "last" tag for wandb artifact (cherry picked from commit 19100ba)

…ltralytics#3264) * adds latest tag to match wandb defaults * adds entity handling, 'last' tag * fixes bug causing finished runs to resume * removes redundant "last" tag for wandb artifact

charlesfrye added 2 commits May 20, 2021 13:23

adds latest tag to match wandb defaults

24c3d30

adds entity handling, 'last' tag

ede690f

github-actions bot reviewed May 20, 2021

View reviewed changes

fixes bug causing finished runs to resume

4d79628

glenn-jocher assigned charlesfrye May 21, 2021

removes redundant "last" tag for wandb artifact

b298a69

glenn-jocher merged commit 19100ba into ultralytics:master May 21, 2021

glenn-jocher mentioned this pull request Oct 12, 2021

YOLOv5 release v6.0 #5141

Merged

glenn-jocher mentioned this pull request Nov 7, 2021

YOLOv5 v6.0 compatibility update (draft) ultralytics/yolov3#1855

Closed

glenn-jocher mentioned this pull request Nov 14, 2021

YOLOv5 v6.0 compatibility update ultralytics/yolov3#1857

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improves docs and handling of entities and resuming by WandbLogger #3264

Improves docs and handling of entities and resuming by WandbLogger #3264

charlesfrye commented May 20, 2021 •

edited by UltralyticsAssistant

Loading

github-actions bot left a comment

charlesfrye commented May 20, 2021 •

edited

Loading

glenn-jocher commented May 21, 2021 •

edited

Loading

charlesfrye commented May 21, 2021

glenn-jocher commented May 21, 2021

Improves docs and handling of entities and resuming by WandbLogger #3264

Improves docs and handling of entities and resuming by WandbLogger #3264

Conversation

charlesfrye commented May 20, 2021 • edited by UltralyticsAssistant Loading

What does this fix?

How has this been tested?

Deep dive

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

github-actions bot left a comment

Choose a reason for hiding this comment

charlesfrye commented May 20, 2021 • edited Loading

Update

Deep Dive on Update

glenn-jocher commented May 21, 2021 • edited Loading

charlesfrye commented May 21, 2021

glenn-jocher commented May 21, 2021

charlesfrye commented May 20, 2021 •

edited by UltralyticsAssistant

Loading

charlesfrye commented May 20, 2021 •

edited

Loading

glenn-jocher commented May 21, 2021 •

edited

Loading