Skip to content

Commit

Permalink
Neural vocabulary selection. (#1046)
Browse files Browse the repository at this point in the history
Co-authored-by: Tobias Domhan <domhant@amazon.com>
  • Loading branch information
tdomhan and Tobias Domhan committed May 4, 2022
1 parent 63286ff commit 94cdad7
Show file tree
Hide file tree
Showing 24 changed files with 930 additions and 163 deletions.
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,12 @@ Note that Sockeye has checks in place to not translate with an old model that wa

Each version section may have subsections for: _Added_, _Changed_, _Removed_, _Deprecated_, and _Fixed_.

## [3.1.14]

### Added
- Added the implementation of Neural vocabulary selection to Sockeye as presented in our NAACL 2022 paper "The Devil is in the Details: On the Pitfalls of Vocabulary Selection in Neural Machine Translation" (Tobias Domhan, Eva Hasler, Ke Tran, Sony Trenous, Bill Byrne and Felix Hieber).
- To use NVS simply specify `--neural-vocab-selection` to `sockeye-train`. This will train a model with Neural Vocabulary Selection that is automatically used by `sockeye-translate`. If you want look at translations without vocabulary selection specify `--skip-nvs` as an argument to `sockeye-translate`.

## [3.1.13]

### Added
Expand Down
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,17 +84,18 @@ For more information about Sockeye, see our papers ([BibTeX](sockeye.bib)).
## Research with Sockeye

Sockeye has been used for both academic and industrial research. A list of known publications that use Sockeye is shown below.
If you know more, please let us know or submit a pull request (last updated: April 2022).
If you know more, please let us know or submit a pull request (last updated: May 2022).

### 2022
* Weller-Di Marco, Marion, Matthias Huck, Alexander Fraser. "Modeling Target-Side Morphology in Neural Machine Translation: A Comparison of Strategies
". arXiv preprint arXiv:2203.13550 (2022)
* Tobias Domhan, Eva Hasler, Ke Tran, Sony Trenous, Bill Byrne and Felix Hieber. "The Devil is in the Details: On the Pitfalls of Vocabulary Selection in Neural Machine Translation". Proceedings of NAACL-HLT (2022)

### 2021

* Bergmanis, Toms, Mārcis Pinnis. "Facilitating Terminology Translation with Target Lemma Annotations". arXiv preprint arXiv:2101.10035 (2021)
* Briakou, Eleftheria, Marine Carpuat. "Beyond Noise: Mitigating the Impact of Fine-grained Semantic Divergences on Neural Machine Translation". arXiv preprint arXiv:2105.15087 (2021)
* Hasler, Eva, Tobias Domhan, Jonay Trenous, Ke Tran, Bill Byrne, Felix Hieber. "Improving the Quality Trade-Off for Neural Machine Translation Multi-Domain Adaptation". Proceedings of EMNLP (2021)
* Hasler, Eva, Tobias Domhan, Sony Trenous, Ke Tran, Bill Byrne, Felix Hieber. "Improving the Quality Trade-Off for Neural Machine Translation Multi-Domain Adaptation". Proceedings of EMNLP (2021)
* Tang, Gongbo, Philipp Rönchen, Rico Sennrich, Joakim Nivre. "Revisiting Negation in Neural Machine Translation". Transactions of the Association for Computation Linguistics 9 (2021)
* Vu, Thuy, Alessandro Moschitti. "Machine Translation Customization via Automatic Training Data Selection from the Web". arXiv preprint arXiv:2102.1024 (2021)
* Xu, Weijia, Marine Carpuat. "EDITOR: An Edit-Based Transformer with Repositioning for Neural Machine Translation with Soft Lexical Constraints." Transactions of the Association for Computation Linguistics 9 (2021)
Expand Down
10 changes: 10 additions & 0 deletions docs/training.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,3 +175,13 @@ that can be enabled by setting `--length-task`, respectively, to `ratio` or to `
Specify `--length-task-layers` to set the number of layers in the prediction MLP.
The weight of the loss in the global training objective is controlled with `--length-task-weight` (standard cross-entropy loss has weight 1.0).
During inference the predictions can be used to reward longer translations by enabling `--brevity-penalty-type`.
## Neural Vocabulary Selection (NVS)
When Neural Vocabulary Selection (NVS) gets enabled a target bag-of-word model will be trained.
During decoding the output vocabulary gets reduced to the set of predicted target words speeding up decoding
This is similar to using `--restrict-lexicon` for `sockeye-translate` with the advantage that no external alignment model is required and that the contextualized hidden encoder representations are used to predict the set of target words.
To use NVS simply specify `--neural-vocab-selection` to `sockeye-train`.
This will train a model with NVS that is automatically used by `sockeye-translate`.
If you want look at translations without vocabulary selection specify `--skip-nvs` as an argument to `sockeye-translate`.
2 changes: 1 addition & 1 deletion sockeye/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@
# express or implied. See the License for the specific language governing
# permissions and limitations under the License.

__version__ = '3.1.13'
__version__ = '3.1.14'
79 changes: 70 additions & 9 deletions sockeye/arguments.py
Original file line number Diff line number Diff line change
Expand Up @@ -326,18 +326,23 @@ def add_rerank_args(params):
help="Returns the reranking scores as scores in output JSON objects.")


def add_lexicon_args(params):
def add_lexicon_args(params, is_for_block_lexicon: bool = False):
lexicon_params = params.add_argument_group("Model & Top-k")
lexicon_params.add_argument("--model", "-m", required=True,
help="Model directory containing source and target vocabularies.")
lexicon_params.add_argument("-k", type=int, default=200,
help="Number of target translations to keep per source. Default: %(default)s.")
if not is_for_block_lexicon:
lexicon_params.add_argument("-k", type=int, default=200,
help="Number of target translations to keep per source. Default: %(default)s.")


def add_lexicon_create_args(params):
def add_lexicon_create_args(params, is_for_block_lexicon: bool = False):
lexicon_params = params.add_argument_group("I/O")
if is_for_block_lexicon:
input_help = "A text file with tokens that shall be blocked. All token must be in the model vocabulary."
else:
input_help = "Probabilistic lexicon (fast_align format) to build top-k lexicon from."
lexicon_params.add_argument("--input", "-i", required=True,
help="Probabilistic lexicon (fast_align format) to build top-k lexicon from.")
help=input_help)
lexicon_params.add_argument("--output", "-o", required=True, help="File name to write top-k lexicon to.")


Expand Down Expand Up @@ -743,6 +748,21 @@ def add_model_parameters(params):
'PyTorch AMP with some additional risk and requires installing Apex: '
'https://github.com/NVIDIA/apex')

model_params.add_argument('--neural-vocab-selection',
type=str,
default=None,
choices=C.NVS_TYPES,
help='When enabled the model contains a neural vocabulary selection model that restricts '
'the target output vocabulary to speed up inference.'
'logit_max: predictions are made per source token and combined by max pooling.'
'eos: the prediction is based on the hidden representation of the <eos> token.')

model_params.add_argument('--neural-vocab-selection-block-loss',
action='store_true',
help='When enabled, gradients for NVS are blocked from propagating back to the encoder. '
'This means that NVS learns to work with the main model\'s representations but '
'does not influence its training.')


def add_batch_args(params, default_batch_size=4096, default_batch_type=C.BATCH_TYPE_WORD):
params.add_argument('--batch-size', '-b',
Expand Down Expand Up @@ -773,6 +793,25 @@ def add_batch_args(params, default_batch_size=4096, default_batch_type=C.BATCH_T
'size 10240). Default: %(default)s.')


def add_nvs_train_parameters(params):
params.add_argument(
'--bow-task-weight',
type=float_greater_or_equal(0.0),
default=1.0,
help=
'The weight of the auxiliary Bag-of-word (BOW) loss when --neural-vocab-selection is enabled. Default %(default)s.'
)

params.add_argument(
'--bow-task-pos-weight',
type=float_greater_or_equal(0.0),
default=10,
help='The weight of the positive class (the set of words present on the target side) for the BOW loss '
'when --neural-vocab-selection is set as x * num_negative_class / num_positive_class where x is the '
'--bow-task-pos-weight. Higher values will bias more towards recall, resulting in larger vocabularies '
'at test time trading off larger vocabularies for higher translation quality. Default %(default)s.')


def add_training_args(params):
train_params = params.add_argument_group("Training parameters")

Expand Down Expand Up @@ -803,6 +842,8 @@ def add_training_args(params):
default=1,
help='Number of fully-connected layers for predicting the length ratio. Default %(default)s.')

add_nvs_train_parameters(train_params)

train_params.add_argument('--target-factors-weight',
type=float,
nargs='+',
Expand Down Expand Up @@ -1203,18 +1244,38 @@ def add_inference_args(params):
nargs='+',
type=multiple_values(num_values=2, data_type=str),
default=None,
help="Specify top-k lexicon to restrict output vocabulary to the k most likely context-"
"free translations of the source words in each sentence (Devlin, 2017). See the "
"lexicon module for creating top-k lexicons. To use multiple lexicons, provide "
help="Specify block or top-k lexicon. A top-k lexicon will pose a positive constraint, "
"by providing the set of allowed target words. While a blocking lexicon poses a "
"negative constraint on providing a set of target words to be avoided. "
"Specifically, a top-k lexicon will restrict the output vocabulary to the k most "
"likely context-free translations of the source words in each sentence "
"(Devlin, 2017). See the lexicon module for creating lexicons, i.e. by running "
"sockeye-lexicon. To use multiple lexicons, provide "
"'--restrict-lexicon key1:path1 key2:path2 ...' and use JSON input to specify the "
"lexicon for each sentence: "
"{\"text\": \"some input string\", \"restrict_lexicon\": \"key\"}. "
"If a single lexicon is specified it will be applied to all inputs. "
"If multiple lexica are specified they can be selected via the JSON input or it "
"can be skipped by not providing a lexicon in the JSON input. "
"Default: %(default)s.")
decode_params.add_argument('--restrict-lexicon-topk',
type=int,
default=None,
help="Specify the number of translations to load for each source word from the lexicon "
"given with --restrict-lexicon. Default: Load all entries from the lexicon.")
"given with --restrict-lexicon top-k lexicon. "
"Default: Load all entries from the lexicon.")

decode_params.add_argument('--skip-nvs',
action='store_true',
help='Manually turn off Neural Vocabulary Selection (NVS) to do a softmax over the full target vocabulary.',
default=False)

decode_params.add_argument('--nvs-thresh',
type=float,
help='The probability threshold for a word to be added to the set of target words. '
'Default: 0.5.',
default=0.5)

decode_params.add_argument('--strip-unknown-words',
action='store_true',
default=False,
Expand Down
Loading

0 comments on commit 94cdad7

Please sign in to comment.