Neural vocabulary selection. (#1046)

Co-authored-by: Tobias Domhan <domhant@amazon.com>
awslabs · May 4, 2022 · 94cdad7 · 94cdad7
1 parent 63286ff
commit 94cdad7
Show file tree

Hide file tree

Showing 24 changed files with 930 additions and 163 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -11,6 +11,12 @@ Note that Sockeye has checks in place to not translate with an old model that wa
 
 Each version section may have subsections for: _Added_, _Changed_, _Removed_, _Deprecated_, and _Fixed_.
 
+## [3.1.14]
+
+### Added
+- Added the implementation of Neural vocabulary selection to Sockeye as presented in our NAACL 2022 paper "The Devil is in the Details: On the Pitfalls of Vocabulary Selection in Neural Machine Translation" (Tobias Domhan, Eva Hasler, Ke Tran, Sony Trenous, Bill Byrne and Felix Hieber).
+  - To use NVS simply specify `--neural-vocab-selection` to `sockeye-train`. This will train a model with Neural Vocabulary Selection that is automatically used by `sockeye-translate`. If you want look at translations without vocabulary selection specify `--skip-nvs` as an argument to `sockeye-translate`.
+
 ## [3.1.13]
 
 ### Added

diff --git a/README.md b/README.md
@@ -84,17 +84,18 @@ For more information about Sockeye, see our papers ([BibTeX](sockeye.bib)).
 ## Research with Sockeye
 
 Sockeye has been used for both academic and industrial research. A list of known publications that use Sockeye is shown below.
-If you know more, please let us know or submit a pull request (last updated: April 2022).
+If you know more, please let us know or submit a pull request (last updated: May 2022).
 
 ### 2022
 * Weller-Di Marco, Marion, Matthias Huck, Alexander Fraser. "Modeling Target-Side Morphology in Neural Machine Translation: A Comparison of Strategies
 ". arXiv preprint arXiv:2203.13550 (2022)
+* Tobias Domhan, Eva Hasler, Ke Tran, Sony Trenous, Bill Byrne and Felix Hieber. "The Devil is in the Details: On the Pitfalls of Vocabulary Selection in Neural Machine Translation". Proceedings of NAACL-HLT (2022)
 
 ### 2021
 
 * Bergmanis, Toms, Mārcis Pinnis. "Facilitating Terminology Translation with Target Lemma Annotations". arXiv preprint arXiv:2101.10035 (2021)
 * Briakou, Eleftheria, Marine Carpuat. "Beyond Noise: Mitigating the Impact of Fine-grained Semantic Divergences on Neural Machine Translation". arXiv preprint arXiv:2105.15087 (2021)
-* Hasler, Eva, Tobias Domhan, Jonay Trenous, Ke Tran, Bill Byrne, Felix Hieber. "Improving the Quality Trade-Off for Neural Machine Translation Multi-Domain Adaptation". Proceedings of EMNLP (2021)
+* Hasler, Eva, Tobias Domhan, Sony Trenous, Ke Tran, Bill Byrne, Felix Hieber. "Improving the Quality Trade-Off for Neural Machine Translation Multi-Domain Adaptation". Proceedings of EMNLP (2021)
 * Tang, Gongbo, Philipp Rönchen, Rico Sennrich, Joakim Nivre. "Revisiting Negation in Neural Machine Translation". Transactions of the Association for Computation Linguistics 9 (2021)
 * Vu, Thuy, Alessandro Moschitti. "Machine Translation Customization via Automatic Training Data Selection from the Web". arXiv preprint arXiv:2102.1024 (2021)
 * Xu, Weijia, Marine Carpuat. "EDITOR: An Edit-Based Transformer with Repositioning for Neural Machine Translation with Soft Lexical Constraints." Transactions of the Association for Computation Linguistics 9 (2021)

diff --git a/docs/training.md b/docs/training.md
@@ -175,3 +175,13 @@ that can be enabled by setting `--length-task`, respectively, to `ratio` or to `
 Specify `--length-task-layers` to set the number of layers in the prediction MLP.
 The weight of the loss in the global training objective is controlled with `--length-task-weight` (standard cross-entropy loss has weight 1.0).
 During inference the predictions can be used to reward longer translations by enabling `--brevity-penalty-type`.
+
+
+## Neural Vocabulary Selection (NVS)
+
+When Neural Vocabulary Selection (NVS) gets enabled a target bag-of-word model will be trained.
+During decoding the output vocabulary gets reduced to the set of predicted target words speeding up decoding
+This is similar to using `--restrict-lexicon` for `sockeye-translate` with the advantage that no external alignment model is required and that the contextualized hidden encoder representations are used to predict the set of target words.
+To use NVS simply specify `--neural-vocab-selection` to `sockeye-train`.
+This will train a model with NVS that is automatically used by `sockeye-translate`.
+If you want look at translations without vocabulary selection specify `--skip-nvs` as an argument to `sockeye-translate`.
diff --git a/sockeye/__init__.py b/sockeye/__init__.py
@@ -11,4 +11,4 @@
 # express or implied. See the License for the specific language governing
 # permissions and limitations under the License.
 
-__version__ = '3.1.13'
+__version__ = '3.1.14'
diff --git a/sockeye/arguments.py b/sockeye/arguments.py
@@ -326,18 +326,23 @@ def add_rerank_args(params):
                                help="Returns the reranking scores as scores in output JSON objects.")
 
 
-def add_lexicon_args(params):
+def add_lexicon_args(params, is_for_block_lexicon: bool = False):
     lexicon_params = params.add_argument_group("Model & Top-k")
     lexicon_params.add_argument("--model", "-m", required=True,
                                 help="Model directory containing source and target vocabularies.")
-    lexicon_params.add_argument("-k", type=int, default=200,
-                                help="Number of target translations to keep per source. Default: %(default)s.")
+    if not is_for_block_lexicon:
+        lexicon_params.add_argument("-k", type=int, default=200,
+                                    help="Number of target translations to keep per source. Default: %(default)s.")
 
 
-def add_lexicon_create_args(params):
+def add_lexicon_create_args(params, is_for_block_lexicon: bool = False):
     lexicon_params = params.add_argument_group("I/O")
+    if is_for_block_lexicon:
+        input_help = "A text file with tokens that shall be blocked. All token must be in the model vocabulary."
+    else:
+        input_help = "Probabilistic lexicon (fast_align format) to build top-k lexicon from."
     lexicon_params.add_argument("--input", "-i", required=True,
-                                help="Probabilistic lexicon (fast_align format) to build top-k lexicon from.")
+                                help=input_help)
     lexicon_params.add_argument("--output", "-o", required=True, help="File name to write top-k lexicon to.")
 
 
@@ -743,6 +748,21 @@ def add_model_parameters(params):
                                    'PyTorch AMP with some additional risk and requires installing Apex: '
                                    'https://github.com/NVIDIA/apex')
 
+    model_params.add_argument('--neural-vocab-selection',
+                              type=str,
+                              default=None,
+                              choices=C.NVS_TYPES,
+                              help='When enabled the model contains a neural vocabulary selection model that restricts '
+                                   'the target output vocabulary to speed up inference.'
+                                   'logit_max: predictions are made per source token and combined by max pooling.'
+                                   'eos: the prediction is based on the hidden representation of the <eos> token.')
+
+    model_params.add_argument('--neural-vocab-selection-block-loss',
+                              action='store_true',
+                              help='When enabled, gradients for NVS are blocked from propagating back to the encoder. '
+                                    'This means that NVS learns to work with the main model\'s representations but '
+                                    'does not influence its training.')
+
 
 def add_batch_args(params, default_batch_size=4096, default_batch_type=C.BATCH_TYPE_WORD):
     params.add_argument('--batch-size', '-b',
@@ -773,6 +793,25 @@ def add_batch_args(params, default_batch_size=4096, default_batch_type=C.BATCH_T
                              'size 10240). Default: %(default)s.')
 
 
+def add_nvs_train_parameters(params):
+    params.add_argument(
+        '--bow-task-weight',
+        type=float_greater_or_equal(0.0),
+        default=1.0,
+        help=
+        'The weight of the auxiliary Bag-of-word (BOW) loss when --neural-vocab-selection is enabled. Default %(default)s.'
+    )
+
+    params.add_argument(
+        '--bow-task-pos-weight',
+        type=float_greater_or_equal(0.0),
+        default=10,
+        help='The weight of the positive class (the set of words present on the target side) for the BOW loss '
+             'when --neural-vocab-selection is set as x * num_negative_class / num_positive_class where x is the '
+             '--bow-task-pos-weight. Higher values will bias more towards recall, resulting in larger vocabularies '
+             'at test time trading off larger vocabularies for higher translation quality. Default %(default)s.')
+
+
 def add_training_args(params):
     train_params = params.add_argument_group("Training parameters")
 
@@ -803,6 +842,8 @@ def add_training_args(params):
                               default=1,
                               help='Number of fully-connected layers for predicting the length ratio. Default %(default)s.')
 
+    add_nvs_train_parameters(train_params)
+
     train_params.add_argument('--target-factors-weight',
                               type=float,
                               nargs='+',
@@ -1203,18 +1244,38 @@ def add_inference_args(params):
                                nargs='+',
                                type=multiple_values(num_values=2, data_type=str),
                                default=None,
-                               help="Specify top-k lexicon to restrict output vocabulary to the k most likely context-"
-                                    "free translations of the source words in each sentence (Devlin, 2017). See the "
-                                    "lexicon module for creating top-k lexicons. To use multiple lexicons, provide "
+                               help="Specify block or top-k lexicon. A top-k lexicon will pose a positive constraint, "
+                                    "by providing the set of allowed target words. While a blocking lexicon poses a "
+                                    "negative constraint on providing a set of target words to be avoided. "
+                                    "Specifically, a top-k lexicon will restrict the output vocabulary to the k most "
+                                    "likely context-free translations of the source words in each sentence "
+                                    "(Devlin, 2017). See the lexicon module for creating lexicons, i.e. by running "
+                                    "sockeye-lexicon. To use multiple lexicons, provide "
                                     "'--restrict-lexicon key1:path1 key2:path2 ...' and use JSON input to specify the "
                                     "lexicon for each sentence: "
                                     "{\"text\": \"some input string\", \"restrict_lexicon\": \"key\"}. "
+                                    "If a single lexicon is specified it will be applied to all inputs. "
+                                    "If multiple lexica are specified they can be selected via the JSON input or it "
+                                    "can be skipped by not providing a lexicon in the JSON input. "
                                     "Default: %(default)s.")
     decode_params.add_argument('--restrict-lexicon-topk',
                                type=int,
                                default=None,
                                help="Specify the number of translations to load for each source word from the lexicon "
-                                    "given with --restrict-lexicon. Default: Load all entries from the lexicon.")
+                                    "given with --restrict-lexicon top-k lexicon. "
+                                    "Default: Load all entries from the lexicon.")
+
+    decode_params.add_argument('--skip-nvs',
+                               action='store_true',
+                               help='Manually turn off Neural Vocabulary Selection (NVS) to do a softmax over the full target vocabulary.',
+                               default=False)
+
+    decode_params.add_argument('--nvs-thresh',
+                               type=float,
+                               help='The probability threshold for a word to be added to the set of target words. '
+                                    'Default: 0.5.',
+                               default=0.5)
+
     decode_params.add_argument('--strip-unknown-words',
                                action='store_true',
                                default=False,