Implemented sampled softmax for NextItemPredictionTask #671

gabrielspmoreira · 2023-04-07T21:08:29Z

Goals ⚽

Implements sampled softmax for NextItemPredictionTask. It allows for faster training and evaluation.

Implementation Details 🚧

Refactored NextItemPredictionTask to have a standard output layer op (a dot product) when weight_tying is both enabled or not.
Added sampled_softmax option to NextItemPredictionTask

tr.NextItemPredictionTask(weight_tying=True, sampled_softmax=True, max_n_samples=1000)

Implemented a LogUniformSampler that is able to return probabilities for both unique_sampling = True or False
Implemented a generic logQ correction for NextItemPredictionTask
Changed LabelSmoothCrossEntropyLoss to be just an alias of torch.nn.CrossEntropyLoss(label_smoothing=...), as PyTorch has added label_smoothing in one of its last versions. Added a DeprecationWarning to LabelSmoothCrossEntropyLoss

Testing Details 🔍

Created a test to check and demonstrate the usage of sampled softmax: test_with_next_item_pred_sampled_softmax

Benchmark 🔍

I have performed a benchmark of sampled softmax in different configurations (weight tying enabled and disabled and with different # samples) to understand the impact of sampled softmax in training throughtput and accuracy.

Setup

The experiments were performed using the T4Rec paper reproducibility script, which was changed to accept new CLI args --sampled_softmax and --sampled_softmax_max_n_samples, and the REES46 preprocessed dataset.

The benchmark was done using Merlin PyTorch 23.02 container, with manual update of the core, dataloader and models folders to pull and install their latest version from GitHub.

Command line
The script performs incremental training and evaluation. I use the first five days for training and evaluation is computed for each next day. Here is the base command line with the utilized hparams.
The hparams that are changed for the experiments are --mf_constrained_embeddings (enables weight-tying if provided, i.e., reusing the item id embedding table as output layer), --sampled_softmax (enables sampled softmax if provided) and --sampled_softmax_max_n_samples (number of negative samples).

cd /transformers4rec/examples/t4rec_paper_experiments/t4r_paper_repro
CUDA_VISIBLE_DEVICES=0 python3 transf_exp_main.py --output_dir ./tmp/ --overwrite_output_dir --do_train --do_eval --validate_every 10 --logging_steps 20 --save_steps 0 --data_path $DATA_PATH --features_schema_path "../datasets_configs/ecom_rees46/rees46_schema.pbtxt" --fp16 --data_loader_engine merlin --start_time_window_index 1 --final_time_window_index 6 --time_window_folder_pad_digits 4 --model_type albert --loss_type cross_entropy --per_device_eval_batch_size 128 --similarity_type concat_mlp --tf_out_activation tanh --inp_merge mlp --learning_rate_warmup_steps 0 --learning_rate_schedule linear_with_warmup --hidden_act gelu --num_train_epochs 5 --dataloader_drop_last --compute_metrics_each_n_steps 1 --session_seq_length_max 20 --eval_on_last_item_seq_only  --layer_norm_featurewise --mlm --num_hidden_groups -1 --inner_group_num 1 --per_device_train_batch_size 512 --learning_rate 0.0004904752786458524 --dropout 0.0 --input_dropout 0.1 --weight_decay 9.565968888623912e-05 --d_model 320 --item_embedding_dim 320 --n_layer 2 --n_head 8  --stochastic_shared_embeddings_replacement_prob 0.06 --item_id_embeddings_init_std 0.11 --other_embeddings_init_std 0.025 --mlm_probability 0.6000000000000001 --eval_on_test_set --seed 100 --report_to none --label_smoothing 0.2 --mf_constrained_embeddings --sampled_softmax --sampled_softmax_max_n_samples 1000

Results

The results can be seen in the following table. Steps/sec represents the throughtput and Recall and NDCG are accuracy top-k metrics.

The gist is that it is possible to get both a better training throughput with a gain of accuracy by using sampled softmax.

Some notes from this results:

Throughput (steps/sec) always increase with sampled sofmax and smaller number of examples, as expected.
The best accuracies was obtained with sampled softmax for both weight-tying False and True. But the best overall accuracy obtained with weight tying=True, sampled softmax and logQ correction
There are specific lines in the results table where we report results without logQ correction proposed for sampled softmax. It is noticeable that without it the sampled softmax underperforms in terms of accuracy, as it overpenalizes popular items, that are sampled more often as negatives.
A side note is that weight-tying typically provides better accuracy when enabled, as we previously observed reported in our RecSys competition papers and in T4Rec paper.

Disclaimer: These experiments were not hypertuned for every configuration. Furthermore, accuracy results might differ a lot with different runs in particular when smaller number of samples (e.g. 1k) are used.

github-actions · 2023-04-09T15:17:20Z

Documentation preview

https://nvidia-merlin.github.io/Transformers4Rec/review/pr-671

sararb

The PR looks good to me. I just left some remarks/questions to understand the code base.

sararb · 2023-04-12T13:48:50Z

transformers4rec/torch/model/prediction_task.py

        else:
-            logits = self.output_layer(inputs)
+            logits = inputs @ output_weights


We might need to keep the bias parameter self.output_layer_bias: logits = inputs @ output_weights + self.output_layer_bias?

I removed the bias because it think would not be available if ANN is used later for serving. Does that make sense?
I can do some benchmark later to see if bias helps to improve accuracy.

You're right, that makes sense! Otherwise, we'll need to save the output_bias vector in addition to the pre-trained candidate embeddings..

sararb · 2023-04-12T13:55:19Z

transformers4rec/torch/model/prediction_task.py


-        return predictions
+        logits = torch.cat([positive_scores, negative_scores], axis=1)
+        new_targets = torch.zeros(logits.shape[0], dtype=torch.int64)


The first element of each row should be 1 instead of 0 to account for the positive target, right?

The targets are the ids sparse representation, not the one-hot representation. Does that make sense?

Oh I see, so the new_targets is a 1-D vector that contains the index of the positive item in the logits tensor (which is always corresponding to index 0)

sararb · 2023-04-14T12:09:47Z

transformers4rec/torch/block/base.py

-    def forward(self, inputs):
-        return self.module(inputs)
+    def forward(self, inputs, **kwargs):
+        return self.module(inputs, **kwargs)


Can you explain why we need the extra **kwargs here?

sararb · 2023-04-14T12:11:00Z

transformers4rec/torch/losses.py

-                " [`sum`, `none`, `mean`]"
-            )
-        return loss
+    return torch.nn.CrossEntropyLoss(label_smoothing=smoothing, reduction=reduction, **kwargs)


It's great to see that label_smoothing was added in the latest version of CrossEntropyLoss!

sararb · 2023-04-14T12:22:51Z

transformers4rec/torch/model/prediction_task.py

+            y = labels_all
+            x, y = self.pre(x, targets=y, training=training, testing=testing)  # type: ignore
+
+            loss = self.loss(x, y)
            return {
                "loss": loss,
                "labels": labels_all,


I understand that self.pre invokes the next-item task using the sampled softmax option, which returns logits x related to the list of [positive_item, sampled negatives]. So I wonder how these logits are connected to labels_all (which is a tensor of positive item ids) for metrics calculation.

sararb · 2023-04-14T13:18:14Z

transformers4rec/torch/model/prediction_task.py

+                dist = self.unique_sampling_dist
+            else:
+                dist = self.dist
+            dist = dist.to(device)


Would it be possible to move the definition of dist to the class constructor, to avoid copying the tensor to the GPU/CPU device multiple times?

The challenge is how to get the device in the constructor. Any ideas?

You can use register_buffer to register the variable dist. Then, the method model.to(device) will ensure that the buffer is copied to the right device. It is something like:

in the constructor, you set: self.register_buffer('dist', dist)

in the method sampled: you can just call the registered buffer self.dist

sararb · 2023-04-14T13:21:35Z

transformers4rec/torch/model/prediction_task.py

+        so we use `torch.multinomial(..., replacement=True).unique()` which doesn't guarantee
+        the same number of unique sampled items. You can try to increase
+        n_samples_multiplier_before_unique to increase the chances to have more
+        unique samples in that case.


+1 !! Thank you for creating this class. It was very helpful for learning how to approximate item frequency distributions for both sampling with and without repetition!

…label_smoothing). Fixed test

… adjusted min const value used to fix sampling accidental hits to work properly with fp16. Ensures targets are torch.long, otherwise losses raise an error. Turning metrics top_ks as lists rather than tensors

…stributions as a buffer, so that they are automatically assigned to the right device and also serialized correctly

gabrielspmoreira self-assigned this Apr 7, 2023

gabrielspmoreira added the enhancement New feature or request label Apr 7, 2023

gabrielspmoreira added this to the Merlin 23.04 milestone Apr 7, 2023

gabrielspmoreira requested review from sararb and marcromeyn April 9, 2023 15:09

gabrielspmoreira requested a review from bschifferer April 9, 2023 16:46

viswa-nvidia linked an issue Apr 11, 2023 that may be closed by this pull request

[Task] Add a softmax sampling #667

Closed

gabrielspmoreira force-pushed the sampled_softmax branch from 670821b to 0c1be06 Compare April 13, 2023 01:45

sararb approved these changes Apr 14, 2023

View reviewed changes

gabrielspmoreira added 8 commits April 17, 2023 13:55

Implemented sampled softmax for NextItemPredictionTask

b59ae2c

Made LabelSmoothCrossEntropyLoss just a wrapper for CrossEntropyLoss(…

59f285e

…label_smoothing). Fixed test

Adjusted tensor devices to be able do use sampled softmax on GPU, and…

18533b7

… adjusted min const value used to fix sampling accidental hits to work properly with fp16. Ensures targets are torch.long, otherwise losses raise an error. Turning metrics top_ks as lists rather than tensors

Fixed tests

df77e3a

Trying to fix test

41db7c7

Fixing test

dd75787

Returning labels that match preds when sampled softmax is enabled

6f480d1

Making the LogUniformSampler a torch.nn.Module and registering the di…

5c282d3

…stributions as a buffer, so that they are automatically assigned to the right device and also serialized correctly

gabrielspmoreira force-pushed the sampled_softmax branch from c59f278 to 5c282d3 Compare April 17, 2023 16:55

gabrielspmoreira merged commit 517cb5f into main Apr 17, 2023

rnyak deleted the sampled_softmax branch April 18, 2023 07:14

sararb mentioned this pull request May 9, 2023

Extend ContrastiveOutput to support sequential encoders NVIDIA-Merlin/models#1086

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implemented sampled softmax for NextItemPredictionTask #671

Implemented sampled softmax for NextItemPredictionTask #671

gabrielspmoreira commented Apr 7, 2023 •

edited

Loading

github-actions bot commented Apr 9, 2023

sararb left a comment

sararb Apr 12, 2023

gabrielspmoreira Apr 14, 2023 •

edited

Loading

sararb Apr 14, 2023

sararb Apr 12, 2023

gabrielspmoreira Apr 14, 2023

sararb Apr 14, 2023

sararb Apr 14, 2023

sararb Apr 14, 2023

sararb Apr 14, 2023

sararb Apr 14, 2023

gabrielspmoreira Apr 14, 2023

sararb Apr 14, 2023 •

edited

Loading

sararb Apr 14, 2023

Implemented sampled softmax for NextItemPredictionTask #671

Implemented sampled softmax for NextItemPredictionTask #671

Conversation

gabrielspmoreira commented Apr 7, 2023 • edited Loading

Goals ⚽

Implementation Details 🚧

Testing Details 🔍

Benchmark 🔍

Setup

Results

github-actions bot commented Apr 9, 2023

Documentation preview

sararb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gabrielspmoreira Apr 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sararb Apr 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gabrielspmoreira commented Apr 7, 2023 •

edited

Loading

gabrielspmoreira Apr 14, 2023 •

edited

Loading

sararb Apr 14, 2023 •

edited

Loading