-
Notifications
You must be signed in to change notification settings - Fork 26.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix: WandbCallback uploads initial model checkpoint #30897
Bugfix: WandbCallback uploads initial model checkpoint #30897
Conversation
@pacman100 @muellerzr @amyeroberts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mgerstgrasser, thanks for opening a PR to address this!
Integrations such as WANDB aren't actively maintained by the transformers team, and instead by their original contributors / organization. With that in mind, this seems a reasonable change, but let's double check with @parambharat why this was originally added.
Regarding FDSP, deferring to @muellerzr @pacman100 on expected behaviour
@@ -812,7 +812,6 @@ def setup(self, args, state, model, **kwargs): | |||
"initial_model": True, | |||
}, | |||
) | |||
model.save_pretrained(temp_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update the comment above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amyeroberts Just to check, is this directed at me or at @parambharat ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. You mean this one?
# log the initial model and architecture to an artifact
Now that I read this again, I'm not completely sure anymore - maybe uploading the model checkpoint is intentional after all? If so, it should still respect the WANDB_LOG_MODEL
setting, but maybe an extra if
clause would be the better solution than removing this altogether in that case.
I'm really not sure though what the intent is - to me, uploading later (trained) LORA adapters makes much more sense than uploading the initial model, since the model you start from is usually stored in a permanent way elsewhere already anyway...
For now I've updated the comment.
@parambharat - could you weigh in on what the intent is here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the loading is intentional, but I agree that it might not be a good idea
@amyeroberts As per discussion in the linked issue, I've added a check to keep this but make it optional. I still am unsure what the intended (or best) behavior here is. For now I've made it so that the initial model is logged if |
Caling in @muellerzr for his opinion, as he has far more experience with trainer and what might be the canonical way to control this |
Okay, if we are looking for a proper long-term solution, then two more thoughts:
Also, I see that in later |
Thank You @mgerstgrasser |
Yes. IMO we should not be uploading models by default, and esp with FSDP as generally if you're doing that we're doing our own checkpointing locally because otherwise we can be talking about transferring and uploading hundreds of gigs of weights, which will slow down things a ton. With FSDP you end up in two scenarios depending on if the user has a sharded state dict or not, generally PyTorch is leading to the former nowadays so each shard has its own saved file. IMO: we don't upload FSDP models, and we can either add that later, or only do so if a user specifically requests it.
Just the adapter. And again, if we have a large model we're doing PEFT on, we're at a similar situation uploading the base weights. Personally I'm not sure this should be a thing in general is my take unless we should have to explicitly enable it. |
@muellerzr @mgerstgrasser OK, shall we just completely remove uploading the model then? And then we can add this back in if it gets requested as a feature? |
Ah, no, I meant that apparently calling
But if we don't do this for FSDP anyway, then the above doesn't matter. :)
Fine with me, of course. Alternatively I could add an additional configuration option specifically for uploading the initial model, but I agree that it's unlikely to be something that many people would use, given the tradeoffs. |
Yes, I agree let's remove it then bring it back if folks ask :) |
OK, I've reverted back to the version that just removes initial model upload altogether. @amyeroberts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing and iterating with us to find a solution @mgerstgrasser!
@mgerstgrasser For the quality checks - could you rebase to include the recent pushes from main? This should resolve these |
c41f120
to
66818c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much!!
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
* fix wandb always uploading initial model * Update comment. * Optionally log initial model * Revert "Optionally log initial model" This reverts commit 9602cc1.
What does this PR do?
It seems that #30135 inadvertently makes WandbCallback always upload the initial model checkpoint to wandb, without any way to disabling this, and irrespective of model size. I think this is obviously not intended behaviour.
This also breaks training with FSDP, as apparently calling
model.save_pretrained()
messes things up for latermodel.forward()
calls.Fixes #30896 #30895
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@pacman100 @muellerzr @amyeroberts