[REFACTOR] Refactor TokenEmbedding to reduce number of places that initialize internals #750

leezu · 2019-06-05T13:14:01Z

Prior to this commit, the TokenEmbedding constructor could only construct an
empty TokenEmbedding. However, an empty TokenEmbedding is of little use, thus
there exist a variety of places that modify and overwrite TokenEmbedding
internals after construction to "fill" the idx_to_token, idx_to_vec.
Examples are _load_embedding_text or _load_embedding_serialized.

This commits

makes these methods static and changes them to return the
idx_to_token and idx_to_vec,
extends the TokenEmbedding constructor to allow constructing a "non-empty"
TokenEmbedding given newly added idx_to_token and idx_to_vec arguments

This change is backwards compatible in that it does not change any public API
besides introducing idx_to_token and idx_to_vec arguments to TokenEmbedding. For
the future, the "empty" TokenEmbedding initialization may be removed as it
provides little benefit. For now it is kept for backwards compatibility.

Checklist

Essentials

PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Changes

Add optional idx_to_token and idx_to_vec arguments to TokenEmbedding

Comments

This extends work of #732. Only the last commit is related to this PR.

codecov · 2019-06-05T13:14:04Z

Codecov Report

❗ No coverage uploaded for pull request head (refactortokenembinternals@90da77d). Click here to learn what that means.
The diff coverage is n/a.

codecov · 2019-06-05T13:14:04Z

Codecov Report

Merging #750 into master will decrease coverage by 9.29%.
The diff coverage is 90.97%.

@@            Coverage Diff            @@
##           master     #750     +/-   ##
=========================================
- Coverage   90.61%   81.31%   -9.3%     
=========================================
  Files          64       64             
  Lines        6064     6113     +49     
=========================================
- Hits         5495     4971    -524     
- Misses        569     1142    +573

Impacted Files	Coverage Δ
src/gluonnlp/vocab/vocab.py	`97.29% <100%> (-0.06%)`	⬇️
src/gluonnlp/embedding/token_embedding.py	`88.43% <90.9%> (-2.05%)`	⬇️
src/gluonnlp/model/train/cache.py	`26.19% <0%> (-71.43%)`	⬇️
src/gluonnlp/model/train/language_model.py	`42.04% <0%> (-55.12%)`	⬇️
src/gluonnlp/embedding/evaluation.py	`41.8% <0%> (-54.1%)`	⬇️
src/gluonnlp/data/batchify/language_model.py	`44.03% <0%> (-52.3%)`	⬇️
src/gluonnlp/model/translation.py	`20.63% <0%> (-50.8%)`	⬇️
src/gluonnlp/model/language_model.py	`50.38% <0%> (-49.62%)`	⬇️
src/gluonnlp/model/bert.py	`70.28% <0%> (-28.99%)`	⬇️
src/gluonnlp/data/translation.py	`73.64% <0%> (-26.36%)`	⬇️
... and 13 more

mli · 2019-06-05T14:46:07Z

Job PR-750/2 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-750/2/index.html

mli · 2019-06-06T20:35:42Z

Job PR-750/3 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-750/3/index.html

src/gluonnlp/embedding/token_embedding.py

…ternals Prior to this commit, the TokenEmbedding constructor could only construct an empty TokenEmbedding. However, an empty TokenEmbedding is of little use, thus there exist a variety of places that modify and overwrite TokenEmbedding internals after construction to "fill" the idx_to_token, idx_to_vec. Examples are _load_embedding_text or _load_embedding_serialized within the TokenEmbedding class, but also set_embedding in the Vocab class. This commits 1) makes these methods static and changes them to return the idx_to_token and idx_to_vec, 2) extends the TokenEmbedding constructor to allow constructing a "non-empty" TokenEmbedding given newly added idx_to_token and idx_to_vec arguments This change is backwards compatible in that it does not change any public API besides introducing idx_to_token and idx_to_vec arguments to TokenEmbedding. For the future, the "empty" TokenEmbedding initialization may be removed as it provides little benefit. For now it is kept for backwards compatibility.

mli · 2019-06-10T14:19:35Z

Job PR-750/5 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-750/5/index.html

mli · 2019-06-11T00:26:54Z

Job PR-750/9 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-750/9/index.html

* Fix evaluate_pretrain.py * Correctly specify unknown_lookup * Fix test * Refactor based on #750 * Fix lint

leezu requested a review from szha as a code owner June 5, 2019 13:14

leezu mentioned this pull request Jun 5, 2019

[FEATURE] Flexible vocabulary #732

Merged

6 tasks

leezu force-pushed the refactortokenembinternals branch from 90da77d to 814de50 Compare June 5, 2019 13:31

leezu force-pushed the refactortokenembinternals branch from 814de50 to 1976a60 Compare June 6, 2019 18:51

leezu requested a review from eric-haibin-lin June 6, 2019 18:52

szha approved these changes Jun 8, 2019

View reviewed changes

eric-haibin-lin reviewed Jun 8, 2019

View reviewed changes

src/gluonnlp/embedding/token_embedding.py Show resolved Hide resolved

leezu added 2 commits June 9, 2019 11:25

Add idx_to_token and idx_to_vec docstrings

1fcb344

leezu force-pushed the refactortokenembinternals branch from 1976a60 to a5a684c Compare June 9, 2019 13:07

Add more tests

25d0ee8

leezu force-pushed the refactortokenembinternals branch from a5a684c to 25d0ee8 Compare June 10, 2019 13:18

Fix Py2 support

8ea19fa

leezu force-pushed the refactortokenembinternals branch from 1a67b04 to c0243be Compare June 10, 2019 21:44

eric-haibin-lin approved these changes Jun 10, 2019

View reviewed changes

szha added the release focus Progress focus for release label Jun 10, 2019

leezu added 2 commits June 10, 2019 23:18

Update deprecated configs in docs/conf.py

3103f4f

Make linkcheck optional

7fdd0bb

leezu force-pushed the refactortokenembinternals branch from c0243be to 7fdd0bb Compare June 10, 2019 23:18

leezu merged commit fba25a3 into dmlc:master Jun 11, 2019

leezu deleted the refactortokenembinternals branch June 11, 2019 00:30

leezu mentioned this pull request Jun 11, 2019

[BUGFIX] Fix TokenEmbedding serialization with emb[emb.unknown_token] != 0 #763

Merged

4 tasks

leezu mentioned this pull request Jun 23, 2019

vocabulary set_embedding(glove) maps all OOV terms to the same vector if no lookup provided #680

Open

This was referenced Aug 28, 2019

KeyError: '<pad>' if run /scripts/word_embeddings/evaluate_pretrained.py with flag analog-max-vocab-size #905

Closed

Fix #905 #906

Merged

leezu added a commit to leezu/gluon-nlp that referenced this pull request Aug 29, 2019

Refactor based on dmlc#750

e30a4da

szha pushed a commit that referenced this pull request Sep 11, 2019

Fix #905 (#906)

b810d10

* Fix evaluate_pretrain.py * Correctly specify unknown_lookup * Fix test * Refactor based on #750 * Fix lint

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REFACTOR] Refactor TokenEmbedding to reduce number of places that initialize internals #750

[REFACTOR] Refactor TokenEmbedding to reduce number of places that initialize internals #750

leezu commented Jun 5, 2019 •

edited

Loading

codecov bot commented Jun 5, 2019

codecov bot commented Jun 5, 2019 •

edited

Loading

mli commented Jun 5, 2019

mli commented Jun 6, 2019

mli commented Jun 10, 2019

mli commented Jun 11, 2019

[REFACTOR] Refactor TokenEmbedding to reduce number of places that initialize internals #750

[REFACTOR] Refactor TokenEmbedding to reduce number of places that initialize internals #750

Conversation

leezu commented Jun 5, 2019 • edited Loading

Checklist

Essentials

Changes

Comments

codecov bot commented Jun 5, 2019

Codecov Report

codecov bot commented Jun 5, 2019 • edited Loading

Codecov Report

mli commented Jun 5, 2019

mli commented Jun 6, 2019

mli commented Jun 10, 2019

mli commented Jun 11, 2019

leezu commented Jun 5, 2019 •

edited

Loading

codecov bot commented Jun 5, 2019 •

edited

Loading