Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improves docs and handling of entities and resuming by WandbLogger #3264

Merged
merged 4 commits into from
May 21, 2021

Conversation

charlesfrye
Copy link
Contributor

@charlesfrye charlesfrye commented May 20, 2021

What does this fix?

The --entity argument for configuring the WandbLogger is currently unused. This PR fixes that, including for resumed runs, fixes a bug for run resuming, and adds a docstring to the WandbLogger class.

How has this been tested?

The core fix, to entity setting for non-resumed runs, is tested in this repro colab, and the new tagging is used in this artifact produced by that colab.

The other fixes, including resumption based on the new tags, have not been directly tested, since I don't have a good setup for testing resumption and DDP. @AyushExel could you check these?

Deep dive

Projects in Weights & Biases are organized under entities, which may be organizations, teams, or individual users. A user may sometimes wish to log to their personal entity (e.g. glenn-jocher) or to their team entity (e.g. ultralytics).

The --entity CLI argument to train.py is intended to allow users to set the entity to which they are logging. It does not do so currently, and this PR fixes that.

In addition, the handling of entities in resumed runs (and in DDP sub-processes during resumed runs) was implicit -- it was assumed that the entity of the provided run was the same as the user. This has been corrected in the PR, and resumed runs are now logged to the original entity, not the user's entity.

Separately, this surfaced an issue with the resumption of runs -- in train.py, models are given the alias last, while in the code for resumption it is assumed that they have alias latest. The former is a YOLOv5 convention and the latter is a wandb convention. For backwards compatibility and completeness, this PR applies both aliases, but could be amended to use just one or the other.

Finally, this commit adds extra docstrings to wandb_utils: at the module level and to WandbLogger.

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Enhancements to Weights & Biases integration in YOLOv5 training script.

📊 Key Changes

  • Updated alias from 'last' to 'latest' for logged models in train.py.
  • Expanded the wandb_utils.py with a docstring introduction and improved code comments.
  • Included entity in Weights & Biases artifact paths, allowing logging across different W&B entities.
  • Updated model artifact logging with new alias 'last' ensuring consistency in naming.
  • Added assertion in download_model_artifact to only allow resuming incomplete runs.

🎯 Purpose & Impact

  • 🎨 Improved Weights & Biases documentation and clarity enhances user understanding of the integration.
  • 🛠️ Refined usage of entity allows training across various Weights & Biases teams/entities, providing better organization and access control.
  • 🚀 The updated aliasing ensures a more intuitive retrieval of the latest model state, aiding users in model management.
  • 🛑 Enhanced resuming logic prevents unnecessary attempts to resume already completed training, avoiding user confusion and wasted resources.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 Hello @charlesfrye, thank you for submitting a 🚀 PR! To allow your work to be integrated as seamlessly as possible, we advise you to:

  • ✅ Verify your PR is up-to-date with origin/master. If your PR is behind origin/master an automatic GitHub actions rebase may be attempted by including the /rebase command in a comment body, or by running the following code, replacing 'feature' with the name of your local branch:
git remote add upstream https://github.com/ultralytics/yolov5.git
git fetch upstream
git checkout feature  # <----- replace 'feature' with local branch name
git rebase upstream/master
git push -u origin -f
  • ✅ Verify all Continuous Integration (CI) checks are passing.
  • ✅ Reduce changes to the absolute minimum required for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." -Bruce Lee

@charlesfrye
Copy link
Contributor Author

charlesfrye commented May 20, 2021

Update

Courtesy of @AyushExel, who tested more of the features implemented here, 4d79628 improves the error message on resumption failure.

Non-parallel resumption has been tested. DDP resumption has not been tested due to issues getting an environment set up (which are the same in master and in this branch).

Deep Dive on Update

The previous check looked for whether metadata epochs_trained was less than total_epochs to determine whether training finished. This returns True when training was interrupted, as expected.

But for finished runs, these variables are actually NoneType, as that metadata is not logged for stripped models, which only have the info needed for inference, not training. The assert "fails" and the program crashes, giving correct behavior, but the error is due to applying < to a None. 4d79628 corrects the inference of whether training has finished and prints a more helpful error message.

@glenn-jocher
Copy link
Member

glenn-jocher commented May 21, 2021

@charlesfrye @AyushExel great, thanks for your contributions guys!

Is there a way to consolidate the last/latest checkpoint names into a single one? Maybe renaming last.pt to latest.pt on upload (so users see only latest.pt in W&B and only last.pt locally)? I think we probably want to try to avoid the redundancy to reduce user confusion if possible.

Screenshot 2021-05-21 at 10 14 36

@charlesfrye
Copy link
Contributor Author

I included both aliases to preserve backwards compatibility, but given that this feature is relatively fresh, I think it's right to prefer simplicity and we can just use latest. The filename (last.pt/best.pt) and the alias (latest, best, stripped) are separate, so we don't need to change those.

@glenn-jocher glenn-jocher merged commit 19100ba into ultralytics:master May 21, 2021
@glenn-jocher
Copy link
Member

@charlesfrye great, PR is merged! :)

Lechtr pushed a commit to Lechtr/yolov5 that referenced this pull request Jul 20, 2021
…ltralytics#3264)

* adds latest tag to match wandb defaults

* adds entity handling, 'last' tag

* fixes bug causing finished runs to resume

* removes redundant "last" tag for wandb artifact

(cherry picked from commit 19100ba)
BjarneKuehl pushed a commit to fhkiel-mlaip/yolov5 that referenced this pull request Aug 26, 2022
…ltralytics#3264)

* adds latest tag to match wandb defaults

* adds entity handling, 'last' tag

* fixes bug causing finished runs to resume

* removes redundant "last" tag for wandb artifact
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants