Bugfix: WandbCallback uploads initial model checkpoint #30897

mgerstgrasser · 2024-05-19T03:49:37Z

What does this PR do?

It seems that #30135 inadvertently makes WandbCallback always upload the initial model checkpoint to wandb, without any way to disabling this, and irrespective of model size. I think this is obviously not intended behaviour.

This also breaks training with FSDP, as apparently calling model.save_pretrained() messes things up for later model.forward() calls.

Fixes #30896 #30895

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@pacman100 @muellerzr @amyeroberts

mgerstgrasser · 2024-05-19T04:09:33Z

@pacman100 @muellerzr @amyeroberts
I think this might be high priority, actually. It seems that right now, just starting a training script with wandb logging will immediately upload a full model checkpoint to wandb. After just a few quick tests I'm at over 100GB in my wandb account, and getting a warning message.

amyeroberts

Hi @mgerstgrasser, thanks for opening a PR to address this!

Integrations such as WANDB aren't actively maintained by the transformers team, and instead by their original contributors / organization. With that in mind, this seems a reasonable change, but let's double check with @parambharat why this was originally added.

Regarding FDSP, deferring to @muellerzr @pacman100 on expected behaviour

amyeroberts · 2024-05-20T11:50:42Z

src/transformers/integrations/integration_utils.py

@@ -812,7 +812,6 @@ def setup(self, args, state, model, **kwargs):
                        "initial_model": True,
                    },
                )
-                model.save_pretrained(temp_dir)


Could you update the comment above?

@amyeroberts Just to check, is this directed at me or at @parambharat ?

Oh. You mean this one?
# log the initial model and architecture to an artifact
Now that I read this again, I'm not completely sure anymore - maybe uploading the model checkpoint is intentional after all? If so, it should still respect the WANDB_LOG_MODEL setting, but maybe an extra if clause would be the better solution than removing this altogether in that case.
I'm really not sure though what the intent is - to me, uploading later (trained) LORA adapters makes much more sense than uploading the initial model, since the model you start from is usually stored in a permanent way elsewhere already anyway...
For now I've updated the comment.
@parambharat - could you weigh in on what the intent is here?

I think the loading is intentional, but I agree that it might not be a good idea

mgerstgrasser · 2024-05-22T16:29:14Z

@amyeroberts As per discussion in the linked issue, I've added a check to keep this but make it optional. I still am unsure what the intended (or best) behavior here is. For now I've made it so that the initial model is logged if WANDB_LOG_MODEL is set to either checkpoint or end. I'm unsure about the end case though - uploading the initial model will double the bandwidth and storage impact compared to not doing so in this case, and I can see how that may be undesirable. That said, enabling the upload in the end case is closer to the original PR, so that seems like the "safer" option for now. (In the checkpoint case, uploading seems pretty safe - you'd already be doing so many times, so one more is a small impact, relatively speaking.)

amyeroberts · 2024-05-22T16:36:20Z

@amyeroberts As per discussion in the linked issue, I've added a check to keep this but make it optional. I still am unsure what the intended (or best) behavior here is. For now I've made it so that the initial model is logged if WANDB_LOG_MODEL is set to either checkpoint or end. I'm unsure about the end case though - uploading the initial model will double the bandwidth and storage impact compared to not doing so in this case, and I can see how that may be undesirable. That said, enabling the upload in the end case is closer to the original PR, so that seems like the "safer" option for now. (In the checkpoint case, uploading seems pretty safe - you'd already be doing so many times, so one more is a small impact, relatively speaking.)

Caling in @muellerzr for his opinion, as he has far more experience with trainer and what might be the canonical way to control this

mgerstgrasser · 2024-05-22T17:12:46Z

Caling in @muellerzr for his opinion, as he has far more experience with trainer and what might be the canonical way to control this

Okay, if we are looking for a proper long-term solution, then two more thoughts:

The current fix will still break FSDP if model upload is enabled. Maybe this won't affect too many people, since perhaps models large enough to require FSDP are also too large to be uploaded anyway, but still. Maybe someone more familiar with the internals like @muellerz can see what the issue is. I suspect it's either that (a) there's a different way to save an FSDP model than just calling model.save_pretrained(), or (b) it's something like "you need to call model.forward() before calling model.save_pretrained() so that FSDP can do all its setup correctly". If it's (b), perhaps that fix should happen elsewhere in the trainer, rather than here, or even in accelerate?
I'm not sure how all this works with PEFT/LORA models. Does save_pretrained() save the entire model, or just the adapter? If the latter, I could see how uploading the base model at the start of training once would make sense potentially, if only the adapters are uploaded later on.

Also, I see that in later on_save() and on_train_end() calls, the callback creates a fake Trainer object and then uses save_model() from there, instead of calling model.save_pretrained(). Does that make a difference for either of the above?

SilSever · 2024-05-23T15:55:08Z

Thank You @mgerstgrasser
I had your same issue and your fix worked for me!!

muellerzr · 2024-05-23T16:11:23Z

The current fix will still break FSDP if model upload is enabled. Maybe this won't affect too many people, since perhaps models large enough to require FSDP are also too large to be uploaded anyway, but still. Maybe someone more familiar with the internals like @muellerzr can see what the issue is. I suspect it's either that (a) there's a different way to save an FSDP model than just calling model.save_pretrained(), or (b) it's something like "you need to call model.forward() before calling model.save_pretrained() so that FSDP can do all its setup correctly". If it's (b), perhaps that fix should happen elsewhere in the trainer, rather than here, or even in accelerate?

Yes. IMO we should not be uploading models by default, and esp with FSDP as generally if you're doing that we're doing our own checkpointing locally because otherwise we can be talking about transferring and uploading hundreds of gigs of weights, which will slow down things a ton.

With FSDP you end up in two scenarios depending on if the user has a sharded state dict or not, generally PyTorch is leading to the former nowadays so each shard has its own saved file.

IMO: we don't upload FSDP models, and we can either add that later, or only do so if a user specifically requests it.

I'm not sure how all this works with PEFT/LORA models. Does save_pretrained() save the entire model, or just the adapter? If the latter, I could see how uploading the base model at the start of training once would make sense potentially, if only the adapters are uploaded later on.

Just the adapter. And again, if we have a large model we're doing PEFT on, we're at a similar situation uploading the base weights.

Personally I'm not sure this should be a thing in general is my take unless we should have to explicitly enable it.

amyeroberts · 2024-05-23T16:26:30Z

@muellerzr @mgerstgrasser OK, shall we just completely remove uploading the model then? And then we can add this back in if it gets requested as a feature?

mgerstgrasser · 2024-05-23T16:30:23Z

With FSDP you end up in two scenarios depending on if the user has a sharded state dict or not, generally PyTorch is leading to the former nowadays so each shard has its own saved file.

Ah, no, I meant that apparently calling model.save() here breaks FSDP training altogether, i.e. leads to a crash. See #38095

IMO: we don't upload FSDP models, and we can either add that later, or only do so if a user specifically requests it.

But if we don't do this for FSDP anyway, then the above doesn't matter. :)

OK, shall we just completely remove uploading the model then? And then we can add this back in if it gets requested as a feature?

Fine with me, of course. Alternatively I could add an additional configuration option specifically for uploading the initial model, but I agree that it's unlikely to be something that many people would use, given the tradeoffs.

muellerzr · 2024-05-23T17:57:02Z

Yes, I agree let's remove it then bring it back if folks ask :)

mgerstgrasser · 2024-05-23T18:20:33Z

OK, I've reverted back to the version that just removes initial model upload altogether. @amyeroberts

amyeroberts

Thanks for fixing and iterating with us to find a solution @mgerstgrasser!

amyeroberts · 2024-05-23T18:41:11Z

@mgerstgrasser For the quality checks - could you rebase to include the recent pushes from main? This should resolve these

This reverts commit 9602cc1.

muellerzr

Thanks so much!!

HuggingFaceDocBuilderDev · 2024-05-23T19:15:41Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

* fix wandb always uploading initial model * Update comment. * Optionally log initial model * Revert "Optionally log initial model" This reverts commit 9602cc1.

…0897) * fix wandb always uploading initial model * Update comment. * Optionally log initial model * Revert "Optionally log initial model" This reverts commit 9602cc1.

mgerstgrasser mentioned this pull request May 19, 2024

feat: Upgrade Weights & Biases callback #30135

Merged

1 task

amyeroberts reviewed May 20, 2024

View reviewed changes

amyeroberts mentioned this pull request May 22, 2024

WandbCallback always (!) uploads entire model checkpoint to wandb #30896

Closed

4 tasks

amyeroberts approved these changes May 23, 2024

View reviewed changes

mgerstgrasser added 4 commits May 23, 2024 11:53

fix wandb always uploading initial model

d425921

Update comment.

f258a1a

Optionally log initial model

9f1b55b

Revert "Optionally log initial model"

66818c5

This reverts commit 9602cc1.

mgerstgrasser force-pushed the fix_wandb_always_uploading_initial_model branch from c41f120 to 66818c5 Compare May 23, 2024 18:53

muellerzr approved these changes May 23, 2024

View reviewed changes

amyeroberts merged commit 6657fb5 into huggingface:main May 23, 2024
21 checks passed

mgerstgrasser mentioned this pull request Jun 18, 2024

Logging to wandb breaks FSDP in 4.41.0 #30895

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix: WandbCallback uploads initial model checkpoint #30897

Bugfix: WandbCallback uploads initial model checkpoint #30897

mgerstgrasser commented May 19, 2024

mgerstgrasser commented May 19, 2024

amyeroberts left a comment

amyeroberts May 20, 2024

mgerstgrasser May 20, 2024

amyeroberts May 21, 2024

mgerstgrasser May 21, 2024 •

edited

Loading

amyeroberts May 21, 2024

mgerstgrasser commented May 22, 2024

amyeroberts commented May 22, 2024

mgerstgrasser commented May 22, 2024 •

edited

Loading

SilSever commented May 23, 2024 •

edited

Loading

muellerzr commented May 23, 2024 •

edited

Loading

amyeroberts commented May 23, 2024

mgerstgrasser commented May 23, 2024

muellerzr commented May 23, 2024

mgerstgrasser commented May 23, 2024

amyeroberts left a comment

amyeroberts commented May 23, 2024

muellerzr left a comment

HuggingFaceDocBuilderDev commented May 23, 2024

Bugfix: WandbCallback uploads initial model checkpoint #30897

Bugfix: WandbCallback uploads initial model checkpoint #30897

Conversation

mgerstgrasser commented May 19, 2024

What does this PR do?

Before submitting

Who can review?

mgerstgrasser commented May 19, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts May 20, 2024

Choose a reason for hiding this comment

mgerstgrasser May 20, 2024

Choose a reason for hiding this comment

amyeroberts May 21, 2024

Choose a reason for hiding this comment

mgerstgrasser May 21, 2024 • edited Loading

Choose a reason for hiding this comment

amyeroberts May 21, 2024

Choose a reason for hiding this comment

mgerstgrasser commented May 22, 2024

amyeroberts commented May 22, 2024

mgerstgrasser commented May 22, 2024 • edited Loading

SilSever commented May 23, 2024 • edited Loading

muellerzr commented May 23, 2024 • edited Loading

amyeroberts commented May 23, 2024

mgerstgrasser commented May 23, 2024

muellerzr commented May 23, 2024

mgerstgrasser commented May 23, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts commented May 23, 2024

muellerzr left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented May 23, 2024

mgerstgrasser May 21, 2024 •

edited

Loading

mgerstgrasser commented May 22, 2024 •

edited

Loading

SilSever commented May 23, 2024 •

edited

Loading

muellerzr commented May 23, 2024 •

edited

Loading