Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat(wandb): Refactor to be more flexible #767

Merged
merged 7 commits into from
Dec 4, 2023

Conversation

NanoCode012
Copy link
Collaborator

Closes #236

Breaking change:

New feature:

  • Now allows any wandb_ config to be passed to env for more flexibility
  • More tests!

Copy link
Collaborator

@winglian winglian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this PR make a major difference since it still points to the same keyword arg here?

@NanoCode012
Copy link
Collaborator Author

NanoCode012 commented Oct 22, 2023

Yes!

  • it should allow other wandb env without us having to manually add.
  • we use run_id and instead of letting wandb automatically generate that. The Engineer said this wasn't good to do. They recommended using name.
  • fixed an unexpected issue with conflict of WANDB_DISABLED

Edit: I noticed one edge case from reading their docs on run_id. During resume, it might not continue in the same training and show as new/separate one.

@IgnacioFDM
Copy link

Is resume fixed with this? Currently it starts overriding old data (because step begins from 0 after resuming I think).

@NanoCode012
Copy link
Collaborator Author

@IgnacioFDM , yes! That should fix this. Previously, we set run_id which is their internal tracking method. Now, we use name. However, I'm not sure if this will create a new separate run or actually resume. If anyone has the bandwidth, feel free to do a quick test.

@IgnacioFDM
Copy link

Just tested it and each time you resume, it creates a new run (with the same name) on wandb.

@NanoCode012
Copy link
Collaborator Author

@tcapelle , thanks for your code snippet a while back and sorry for this late PR. Following your feedback to use wandb_name instead of wandb_run_id, may I ask if you have any idea how to perform a resume in such a case? According to the comment above, it creates a new run instead.

@tcapelle
Copy link
Contributor

tcapelle commented Oct 24, 2023

Ohh great! Yeah, for resuming, you need the id, so you would need to keep both attributed. Sorry for missing that.

@IgnacioFDM
Copy link

IgnacioFDM commented Oct 24, 2023

I tried setting the WANDB_RUN_ID and it works like it did before this PR: it continues the run, but still leads to wrong data (either new steps don't show up, or they show up but it starts deleting earlier steps).

I think the issue arises from the fact that step (not global_step) resets to 0 when you resume.

@NanoCode012
Copy link
Collaborator Author

Rebased! Updated it to support both wandb_name and wandb_run_id instead of forcing all to name following idea from #824

@NanoCode012 NanoCode012 merged commit a1da39c into axolotl-ai-cloud:main Dec 4, 2023
4 checks passed
@NanoCode012 NanoCode012 deleted the feat/wandb_refactor branch December 4, 2023 13:17
mkeoliya pushed a commit to mkeoliya/axolotl that referenced this pull request Dec 15, 2023
* Feat: Update to handle wandb env better

* chore: rename wandb_run_id to wandb_name

* feat: add new recommendation and update config

* fix: indent and pop disabled env if project passed

* feat: test env set for wandb and recommendation

* feat: update to use wandb_name and allow id

* chore: add info to readme
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve documentation and implementation of W&B options
4 participants