Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

[FEATURE] Add LAMB optimizer #733

Merged
merged 11 commits into from
Jun 4, 2019
Merged

[FEATURE] Add LAMB optimizer #733

merged 11 commits into from
Jun 4, 2019

Conversation

vanewu
Copy link
Contributor

@vanewu vanewu commented May 27, 2019

Description

The LAMB optimizer:
It has been proposed in Reducing BERT Pre-Training Time from 3 Days to 76 Minutes.

A simple neural network has been used for verification, and the results show that the model can converge normally with large batchsize. Need to be verified further on the BERT.

@eric-haibin-lin Verification may require your help because of some computing resource limitations.

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@vanewu vanewu requested a review from szha as a code owner May 27, 2019 11:23
@codecov
Copy link

codecov bot commented May 27, 2019

Codecov Report

❗ No coverage uploaded for pull request head (LAMB@104486d). Click here to learn what that means.
The diff coverage is n/a.

@codecov
Copy link

codecov bot commented May 27, 2019

Codecov Report

Merging #733 into master will increase coverage by 0.02%.
The diff coverage is 90.9%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #733      +/-   ##
==========================================
+ Coverage   90.56%   90.58%   +0.02%     
==========================================
  Files          65       66       +1     
  Lines        6071     6118      +47     
==========================================
+ Hits         5498     5542      +44     
- Misses        573      576       +3
Impacted Files Coverage Δ
src/gluonnlp/optimizer/__init__.py 100% <100%> (ø) ⬆️
src/gluonnlp/optimizer/lamb.py 90.47% <90.47%> (ø)
src/gluonnlp/data/corpora/google_billion_word.py 66.66% <0%> (-8.34%) ⬇️
src/gluonnlp/data/utils.py 76.25% <0%> (-2.1%) ⬇️
src/gluonnlp/model/__init__.py 96% <0%> (-0.16%) ⬇️
src/gluonnlp/vocab/vocab.py 97.94% <0%> (ø) ⬆️
src/gluonnlp/model/utils.py 76.72% <0%> (ø) ⬆️
src/gluonnlp/data/dataloader.py 88.79% <0%> (+5.17%) ⬆️
...p/data/corpora/large_text_compression_benchmark.py 89.28% <0%> (+8.92%) ⬆️

@mli
Copy link
Member

mli commented May 27, 2019

Job PR-733/1 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-733/1/index.html

@mli
Copy link
Member

mli commented May 27, 2019

Job PR-733/2 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-733/2/index.html

@mli
Copy link
Member

mli commented May 27, 2019

Job PR-733/3 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-733/3/index.html

@szha szha requested a review from szhengac May 27, 2019 19:07
@mli
Copy link
Member

mli commented May 27, 2019

Job PR-733/4 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-733/4/index.html

var_hat = var / (1. - power(self.beta2, t))

r1 = weight.norm()
g = mean_hat / sqrt(var_hat + self.epsilon) + wd * weight
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put epsilon outside the sqrt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I compared the calculation formula in the original paper and found that epsilon is in sqrt.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they change the algorithm in their newest arxiv version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your reminder. I found that the paper I referenced was not synchronized with the latest one. I will update the relevant code later.


# calculate lamb_trust_ratio
r = 1. if r1 == 0. or r2 == 0. else minimum(
maximum(r1 / r2, self.lower_bound), self.upper_bound)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the clip func only performs on r1 and g is normalized by r2.


# execution bias correction
mean_hat = mean / (1. - power(self.beta1, t))
var_hat = var / (1. - power(self.beta2, t))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that bias correction is not performed in the algorithm. needs to double check.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about we keep both versions of LAMB, and test which one is better by ourselves? There are significant differences between these 2, especially the existence of bias correction. Just add a flag so that the users can choose which one to use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two versions have been controlled using parameters,use_latest. I have tested two versions on a small model and there is no obvious difference. I think it should be tested on larger models and data.

@mli
Copy link
Member

mli commented May 29, 2019

Job PR-733/5 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-733/5/index.html

Copy link

@xcgoner xcgoner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor issue on the comments

src/gluonnlp/optimizer/lamb.py Outdated Show resolved Hide resolved
src/gluonnlp/optimizer/lamb.py Outdated Show resolved Hide resolved
@mli
Copy link
Member

mli commented May 30, 2019

Job PR-733/6 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-733/6/index.html

Copy link

@xcgoner xcgoner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

src/gluonnlp/optimizer/lamb.py Outdated Show resolved Hide resolved
src/gluonnlp/optimizer/lamb.py Show resolved Hide resolved
src/gluonnlp/optimizer/lamb.py Outdated Show resolved Hide resolved
@szha szha added the release focus Progress focus for release label May 31, 2019
@mli
Copy link
Member

mli commented Jun 3, 2019

Job PR-733/10 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-733/10/index.html

@mli
Copy link
Member

mli commented Jun 3, 2019

Job PR-733/11 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-733/11/index.html

@szhengac szhengac merged commit 65dbde9 into dmlc:master Jun 4, 2019
paperplanet pushed a commit to paperplanet/gluon-nlp that referenced this pull request Jun 9, 2019
* add lamb optimizer to gluonnlp

* add lamb optimizer to gluonnlp

* Add a simple test for LAMB to verify if it will converge

* add the latest version of the calculation for LAMB

* update doc of lamb

* add optimizer to the docs

* rename and remove arguments

* Correction of typos

* fix lint

* fix doc lint

* update doc
@vanewu vanewu deleted the LAMB branch June 12, 2019 09:55
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
release focus Progress focus for release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants