-
Notifications
You must be signed in to change notification settings - Fork 538
Conversation
Codecov Report
|
Codecov Report
@@ Coverage Diff @@
## master #733 +/- ##
==========================================
+ Coverage 90.56% 90.58% +0.02%
==========================================
Files 65 66 +1
Lines 6071 6118 +47
==========================================
+ Hits 5498 5542 +44
- Misses 573 576 +3
|
Job PR-733/1 is complete. |
Job PR-733/2 is complete. |
Job PR-733/3 is complete. |
Job PR-733/4 is complete. |
src/gluonnlp/optimizer/lamb.py
Outdated
var_hat = var / (1. - power(self.beta2, t)) | ||
|
||
r1 = weight.norm() | ||
g = mean_hat / sqrt(var_hat + self.epsilon) + wd * weight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put epsilon outside the sqrt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I compared the calculation formula in the original paper and found that epsilon is in sqrt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they change the algorithm in their newest arxiv version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for your reminder. I found that the paper I referenced was not synchronized with the latest one. I will update the relevant code later.
src/gluonnlp/optimizer/lamb.py
Outdated
|
||
# calculate lamb_trust_ratio | ||
r = 1. if r1 == 0. or r2 == 0. else minimum( | ||
maximum(r1 / r2, self.lower_bound), self.upper_bound) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the clip func only performs on r1 and g is normalized by r2.
src/gluonnlp/optimizer/lamb.py
Outdated
|
||
# execution bias correction | ||
mean_hat = mean / (1. - power(self.beta1, t)) | ||
var_hat = var / (1. - power(self.beta2, t)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems that bias correction is not performed in the algorithm. needs to double check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about we keep both versions of LAMB, and test which one is better by ourselves? There are significant differences between these 2, especially the existence of bias correction. Just add a flag so that the users can choose which one to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two versions have been controlled using parameters,use_latest
. I have tested two versions on a small model and there is no obvious difference. I think it should be tested on larger models and data.
Job PR-733/5 is complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor issue on the comments
Job PR-733/6 is complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Job PR-733/10 is complete. |
Job PR-733/11 is complete. |
* add lamb optimizer to gluonnlp * add lamb optimizer to gluonnlp * Add a simple test for LAMB to verify if it will converge * add the latest version of the calculation for LAMB * update doc of lamb * add optimizer to the docs * rename and remove arguments * Correction of typos * fix lint * fix doc lint * update doc
Description
The LAMB optimizer:
It has been proposed in Reducing BERT Pre-Training Time from 3 Days to 76 Minutes.
A simple neural network has been used for verification, and the results show that the model can converge normally with large batchsize. Need to be verified further on the BERT.
@eric-haibin-lin Verification may require your help because of some computing resource limitations.
Checklist
Essentials
Changes
Comments