Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

[REFACTOR] Refactor TokenEmbedding to reduce number of places that initialize internals #750

Merged
merged 6 commits into from
Jun 11, 2019

Conversation

leezu
Copy link
Contributor

@leezu leezu commented Jun 5, 2019

Prior to this commit, the TokenEmbedding constructor could only construct an
empty TokenEmbedding. However, an empty TokenEmbedding is of little use, thus
there exist a variety of places that modify and overwrite TokenEmbedding
internals after construction to "fill" the idx_to_token, idx_to_vec.
Examples are _load_embedding_text or _load_embedding_serialized.

This commits

  1. makes these methods static and changes them to return the
    idx_to_token and idx_to_vec,
  2. extends the TokenEmbedding constructor to allow constructing a "non-empty"
    TokenEmbedding given newly added idx_to_token and idx_to_vec arguments

This change is backwards compatible in that it does not change any public API
besides introducing idx_to_token and idx_to_vec arguments to TokenEmbedding. For
the future, the "empty" TokenEmbedding initialization may be removed as it
provides little benefit. For now it is kept for backwards compatibility.

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Add optional idx_to_token and idx_to_vec arguments to TokenEmbedding

Comments

This extends work of #732. Only the last commit is related to this PR.

@leezu leezu requested a review from szha as a code owner June 5, 2019 13:14
@codecov
Copy link

codecov bot commented Jun 5, 2019

Codecov Report

❗ No coverage uploaded for pull request head (refactortokenembinternals@90da77d). Click here to learn what that means.
The diff coverage is n/a.

@codecov
Copy link

codecov bot commented Jun 5, 2019

Codecov Report

Merging #750 into master will decrease coverage by 9.29%.
The diff coverage is 90.97%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #750     +/-   ##
=========================================
- Coverage   90.61%   81.31%   -9.3%     
=========================================
  Files          64       64             
  Lines        6064     6113     +49     
=========================================
- Hits         5495     4971    -524     
- Misses        569     1142    +573
Impacted Files Coverage Δ
src/gluonnlp/vocab/vocab.py 97.29% <100%> (-0.06%) ⬇️
src/gluonnlp/embedding/token_embedding.py 88.43% <90.9%> (-2.05%) ⬇️
src/gluonnlp/model/train/cache.py 26.19% <0%> (-71.43%) ⬇️
src/gluonnlp/model/train/language_model.py 42.04% <0%> (-55.12%) ⬇️
src/gluonnlp/embedding/evaluation.py 41.8% <0%> (-54.1%) ⬇️
src/gluonnlp/data/batchify/language_model.py 44.03% <0%> (-52.3%) ⬇️
src/gluonnlp/model/translation.py 20.63% <0%> (-50.8%) ⬇️
src/gluonnlp/model/language_model.py 50.38% <0%> (-49.62%) ⬇️
src/gluonnlp/model/bert.py 70.28% <0%> (-28.99%) ⬇️
src/gluonnlp/data/translation.py 73.64% <0%> (-26.36%) ⬇️
... and 13 more

@leezu leezu mentioned this pull request Jun 5, 2019
6 tasks
@leezu leezu force-pushed the refactortokenembinternals branch from 90da77d to 814de50 Compare June 5, 2019 13:31
@mli
Copy link
Member

mli commented Jun 5, 2019

Job PR-750/2 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-750/2/index.html

@mli
Copy link
Member

mli commented Jun 6, 2019

Job PR-750/3 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-750/3/index.html

leezu added 2 commits June 9, 2019 11:25
…ternals

Prior to this commit, the TokenEmbedding constructor could only construct an
empty TokenEmbedding. However, an empty TokenEmbedding is of little use, thus
there exist a variety of places that modify and overwrite TokenEmbedding
internals after construction to "fill" the idx_to_token, idx_to_vec. Examples
are _load_embedding_text or _load_embedding_serialized within the TokenEmbedding
class, but also set_embedding in the Vocab class.

This commits
1) makes these methods static and changes them to return the
   idx_to_token and idx_to_vec,
2) extends the TokenEmbedding constructor to allow constructing a "non-empty"
   TokenEmbedding given newly added idx_to_token and idx_to_vec arguments

This change is backwards compatible in that it does not change any public API
besides introducing idx_to_token and idx_to_vec arguments to TokenEmbedding. For
the future, the "empty" TokenEmbedding initialization may be removed as it
provides little benefit. For now it is kept for backwards compatibility.
@leezu leezu force-pushed the refactortokenembinternals branch from 1976a60 to a5a684c Compare June 9, 2019 13:07
@leezu leezu force-pushed the refactortokenembinternals branch from a5a684c to 25d0ee8 Compare June 10, 2019 13:18
@mli
Copy link
Member

mli commented Jun 10, 2019

Job PR-750/5 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-750/5/index.html

@leezu leezu force-pushed the refactortokenembinternals branch from 1a67b04 to c0243be Compare June 10, 2019 21:44
@szha szha added the release focus Progress focus for release label Jun 10, 2019
@leezu leezu force-pushed the refactortokenembinternals branch from c0243be to 7fdd0bb Compare June 10, 2019 23:18
@mli
Copy link
Member

mli commented Jun 11, 2019

Job PR-750/9 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-750/9/index.html

@leezu leezu merged commit fba25a3 into dmlc:master Jun 11, 2019
@leezu leezu deleted the refactortokenembinternals branch June 11, 2019 00:30
leezu added a commit to leezu/gluon-nlp that referenced this pull request Aug 29, 2019
szha pushed a commit that referenced this pull request Sep 11, 2019
* Fix evaluate_pretrain.py

* Correctly specify unknown_lookup

* Fix test

* Refactor based on #750

* Fix lint
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
release focus Progress focus for release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants