Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix qadam NAN problem #654

Merged
merged 4 commits into from
Feb 4, 2023
Merged

fix: fix qadam NAN problem #654

merged 4 commits into from
Feb 4, 2023

Conversation

wangraying
Copy link
Member

@wangraying wangraying commented Dec 5, 2022

fix #485

@wangraying wangraying changed the title fix: fix qadam nan problem fix: fix qadam NAN problem Dec 5, 2022
@wangraying
Copy link
Member Author

This pr is ready, please take a look.

@ganshaoduo
Copy link
Contributor

I think maybe it's better to keep the 'weight_decay' and set it to 0 by default, for consistency with the original Adam optimizer.

@wangraying
Copy link
Member Author

wangraying commented Jan 3, 2023

I think maybe it's better to keep the 'weight_decay' and set it to 0 by default, for consistency with the original Adam optimizer.

Thanks @ganshaoduo , wondering how do we use weight_decay when calculating momentum, the relative code is here. The original paper does not discuss the situation with weight_decay, so I am not sure whether this will still converge in theory. In fact, I did some simple tests, seems the performance is not as good (I did not tune the parameters much).

@ganshaoduo
Copy link
Contributor

Well spotted! Actually, according to the current implementation, we only apply the 'weight_decay' in the warm-up phase but not in the compression phase. This change of grad is only counted if step_id < self.warmup_steps. However, when we calculate the momentum here, the 'weight_decay' has been ignored.

About the necessity of "weight_decay", I would say it does not have a direct impact on the convergence. All it does is to add a small value to the gradient, which won't hurt the theoretical analysis of Qadam in my view.

Therefore, the question here is whether we add the 'weight_decay' operation here such that we use the same weight decay during both warm-up and compression phases, OR we delete the weight_decay operation here such that we do not consider the weight decay at all. I would prefer the former.

@wangraying
Copy link
Member Author

Thanks for your detailed explanations, Shaoduo. Yes, I added weight_decay using the first way you mentioned, but the performance drops on the synthetic benchmark. Since this pr is for fixing NAN problem for QAdam, how about we removing weight_decay temporarily here but adding an issue about this. We could add it back when it is totally prepared after more validations.

@wangraying
Copy link
Member Author

wangraying commented Jan 18, 2023

The current implementation of weight_decay is added by me last time during refactoring. Our version #0 does not have this weight_decay. And we have not used it in our benchmarks and published results. Or maybe keeping current implementation unchanged is also a good idea?

Any opinions from other reviewers?

@wangraying
Copy link
Member Author

After discussions offline, we decided to rollback the modifications for weight_decay.

@woqidaideshi woqidaideshi merged commit 12562b0 into master Feb 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

check qadam NAN problem
3 participants