NVIDIA-Merlin · gabrielspmoreira · Jun 15, 2023 · Jun 9, 2023 · Jun 14, 2023 · Jun 15, 2023
diff --git a/examples/quick_start/__init__.py b/examples/quick_start/__init__.py
diff --git a/examples/quick_start/ranking.md b/examples/quick_start/ranking.md
@@ -82,13 +82,14 @@ In this example, we set some options for preprocessing. Here is the explanation
 
 For larger dataset (like the full TenRec dataset), in particular when using filtering options that require dask_cudf filtering (e.g. `--filter_query`, `--min_item_freq`) we recommend using the following options to avoid out-of-memory errors:
 - `--enable_dask_cuda_cluster` - Initializes a dask-cudf `LocalCUDACluster` for managed single or multi-GPU preprocessing
-- `--persist_intermediate_files` - Persists/caches to disk intermediate files during preprocessing (in paricular after filtering). 
+- `--persist_intermediate_files` - Persists/caches to disk intermediate files during preprocessing (in paricular after filtering).  
+*Note*: If you want to preprocess the full TenRec dataset, set `--data_path /data/QK-video.csv` in the following command.
 
 
 ```bash
-cd /Merlin/examples/quick_start/scripts/preproc/
+cd /Merlin/examples/
 OUT_DATASET_PATH=/outputs/dataset
-python preprocessing.py --input_data_format=csv --csv_na_values=\\N --data_path /data/QK-video.csv --filter_query="click==1 or (click==0 and follow==0 and like==0 and share==0)" --min_item_freq=30 --min_user_freq=30 --max_user_freq=150 --num_max_rounds_filtering=5 --enable_dask_cuda_cluster --persist_intermediate_files --output_path=$OUT_DATASET_PATH --categorical_features=user_id,item_id,video_category,gender,age --binary_classif_targets=click,follow,like,share --regression_targets=watching_times --to_int32=user_id,item_id --to_int16=watching_times --to_int8=gender,age,video_category,click,follow,like,share --user_id_feature=user_id --item_id_feature=item_id --dataset_split_strategy=random_by_user --random_split_eval_perc=0.2
+python -m quick_start.scripts.preproc.preprocessing --input_data_format=csv --csv_na_values=\\N --data_path /data/QK-video-10M.csv --filter_query="click==1 or (click==0 and follow==0 and like==0 and share==0)" --min_item_freq=30 --min_user_freq=30 --max_user_freq=150 --num_max_rounds_filtering=5 --enable_dask_cuda_cluster --persist_intermediate_files --output_path=$OUT_DATASET_PATH --categorical_features=user_id,item_id,video_category,gender,age --binary_classif_targets=click,follow,like,share --regression_targets=watching_times --to_int32=user_id,item_id --to_int16=watching_times --to_int8=gender,age,video_category,click,follow,like,share --user_id_feature=user_id --item_id_feature=item_id --dataset_split_strategy=random_by_user --random_split_eval_perc=0.2
 ```
 
 After you execute this script, a folder `dataset` will be created in `--output_path` with the preprocessed datasets , with `train` and `eval` folders. You will find a number of partitioned parquet files inside those dataset folders, as well as a `schema.pbtxt` file produced by `NVTabular` which is very important for automated model building in the next step.
@@ -98,14 +99,15 @@ Merlin Models is a Merlin library that makes it easy to build and train RecSys m
 
 A number of popular ranking models are available in Merlin Models API like **DLRM**, **DCN-v2**, **Wide&Deep**, **DeepFM**. This Quick-start provides a generic ranking script [ranking.py](scripts/ranking/ranking.py) for building and training those models using Models API.
 
- In the following command example, you can easily train the popular **DLRM** model which performs 2nd level feature interaction. It sets `--model dlrm` and `--embeddings_dim 64` because DLRM models require all categorical columns to be embedded with the same dimension for the feature interaction. You notice that we can set many of the common model (e.g. top `--mlp_layers`) and training hyperparameters like learning rate (`--lr`) and its decay (`--lr_decay_rate`, `--lr_decay_steps`), L2 regularization (`--l2_reg`, `embeddings_l2_reg`), `--dropout` among others.  We set `--epochs 1` and `--train_steps_per_epoch 10` to train for just 10 batches and make runtime faster. If you have a  GPU with more memory (e.g. V100 with 32 GB), you might increase `--train_batch_size` and `--eval_batch_size` to a much larger batch size, for example to `65536`.  
+ In the following command example, you can easily train the popular **DLRM** model which performs 2nd level feature interaction. It sets `--model dlrm` and `--embeddings_dim 64` because DLRM models require all categorical columns to be embedded with the same dimension for the feature interaction. You notice that we can set many of the common model (e.g. top `--mlp_layers`) and training hyperparameters like learning rate (`--lr`) and its decay (`--lr_decay_rate`, `--lr_decay_steps`), L2 regularization (`--l2_reg`, `embeddings_l2_reg`), `--dropout` among others.  We set `--train_batch_size` and `--eval_batch_size` to `65536` for faster training and set `--epochs 2`. If you have preprocessed the Full TenRec dataset, you can set `--train_steps_per_epoch 100` to limit the number of training steps.
 There are many target columns available in the dataset, and you can select one of them for training by setting `--tasks=click`. In this dataset, there are about 3.7 negative examples (`click=0`) for each positive example (`click=1`). That leads to some class unbalance. We can deal with that by setting `--stl_positive_class_weight` to give more weight to the loss for positive examples, which are rarer.
 
+Note:
 
 ```bash
-cd /Merlin/examples/quick_start/scripts/ranking/
 OUT_DATASET_PATH=/outputs/dataset
-CUDA_VISIBLE_DEVICES=0 TF_GPU_ALLOCATOR=cuda_malloc_async python  ranking.py --train_data_path $OUT_DATASET_PATH/train --eval_data_path $OUT_DATASET_PATH/eval --output_path ./outputs/ --tasks=click --stl_positive_class_weight 3 --model dlrm --embeddings_dim 64 --l2_reg 1e-4 --embeddings_l2_reg 1e-6 --dropout 0.05 --mlp_layers 64,32  --lr 1e-4 --lr_decay_rate 0.99 --lr_decay_steps 100 --train_batch_size 65536 --eval_batch_size 65536 --epochs 1 --train_steps_per_epoch 10 
+cd /Merlin/examples/
+CUDA_VISIBLE_DEVICES=0 TF_GPU_ALLOCATOR=cuda_malloc_async python -m quick_start.scripts.ranking.ranking --train_data_path $OUT_DATASET_PATH/train --eval_data_path $OUT_DATASET_PATH/eval --output_path ./outputs/ --tasks=click --stl_positive_class_weight 3 --model dlrm --embeddings_dim 64 --l2_reg 1e-4 --embeddings_l2_reg 1e-6 --dropout 0.05 --mlp_layers 64,32  --lr 1e-3 --lr_decay_rate 0.99 --lr_decay_steps 100 --train_batch_size 65536 --eval_batch_size 65536 --epochs 2 
 ```
 You can explore the [full documentation and best practices for ranking models](scripts/ranking/README.md), which contains details about the command line arguments.
 
@@ -121,9 +123,9 @@ In the following example, we use the popular **MMOE** (`--model mmoe`) architect
 You can also balance the loss weights by setting `--mtl_loss_weight_*` arguments and the tasks positive class weight by setting `--mtl_pos_class_weight_*`.
 
 ```bash
-cd /Merlin/examples/quick_start/scripts/ranking/
-
-CUDA_VISIBLE_DEVICES=0 TF_GPU_ALLOCATOR=cuda_malloc_async python ranking.py --train_data_path $OUT_DATASET_PATH/train --eval_data_path $OUT_DATASET_PATH/eval --output_path ./outputs/ --tasks=click,like,follow,share --model mmoe --mmoe_num_mlp_experts 3 --expert_mlp_layers 128 --gate_dim 32 --use_task_towers=True --tower_layers 64 --embedding_sizes_multiplier 4 --l2_reg 1e-5 --embeddings_l2_reg 1e-6 --dropout 0.05  --lr 1e-4 --lr_decay_rate 0.99 --lr_decay_steps 100 --train_batch_size 65536 --eval_batch_size 65536 --epochs 1 --mtl_pos_class_weight_click=1 --mtl_pos_class_weight_like=2 --mtl_pos_class_weight_share=3 --mtl_pos_class_weight_follow=4  --mtl_loss_weight_click=3 --mtl_loss_weight_like=3 --mtl_loss_weight_follow=1 --mtl_loss_weight_share=1 --train_steps_per_epoch 10 
+OUT_DATASET_PATH=/outputs/dataset
+cd /Merlin/examples/
+CUDA_VISIBLE_DEVICES=0 TF_GPU_ALLOCATOR=cuda_malloc_async python -m quick_start.scripts.ranking.ranking --train_data_path $OUT_DATASET_PATH/train --eval_data_path $OUT_DATASET_PATH/eval --output_path ./outputs/ --tasks=click,like,follow,share --model mmoe --mmoe_num_mlp_experts 3 --expert_mlp_layers 128 --gate_dim 32 --use_task_towers=True --tower_layers 64 --embedding_sizes_multiplier 4 --l2_reg 1e-5 --embeddings_l2_reg 1e-8 --dropout 0.05  --lr 1e-3 --lr_decay_rate 0.99 --lr_decay_steps 100 --train_batch_size 65536 --eval_batch_size 65536 --epochs 2 --mtl_pos_class_weight_click=1 --mtl_pos_class_weight_like=2 --mtl_pos_class_weight_share=3 --mtl_pos_class_weight_follow=4  --mtl_loss_weight_click=3 --mtl_loss_weight_like=3 --mtl_loss_weight_follow=1 --mtl_loss_weight_share=1 
 ```
 
 You can find more quick-start information on multi-task learning and MMOE architecture [here](scripts/ranking/README.md).

diff --git a/examples/quick_start/scripts/__init__.py b/examples/quick_start/scripts/__init__.py
diff --git a/examples/quick_start/scripts/inference/__init__.py b/examples/quick_start/scripts/inference/__init__.py
diff --git a/examples/quick_start/scripts/preproc/README.md b/examples/quick_start/scripts/preproc/README.md
@@ -48,7 +48,7 @@ Feature engineering allows designing new features from raw data that are can pro
 
 In this section we list common feature engineering techniques. Most of them are implemented as [ops](https://nvidia-merlin.github.io/NVTabular/v23.02.00/api.html#categorical-operators) in [NVTabular](https://github.com/NVIDIA-Merlin/NVTabular). User defined functions (UDF) can be implemented with [Lambda](https://nvidia-merlin.github.io/NVTabular/v23.02.00/generated/nvtabular.ops.LambdaOp.html#nvtabular.ops.LambdaOp) op, which are very useful for example for dealing with temporal and geographic feature engineering. 
 
-This preprocessing script provides just basic feature engineering. For more using those more advanced techniques you can copy the `preprocessing.py` script and add them to the NVTabular workflow within `generate_nvt_workflow_features()`.
+TIP: This preprocessing script provides just basic feature engineering. For more using those more advanced techniques you can either copy `preprocessing.py` and change it, or you can create a class inheriting from the `PreprocessingRunner` class (in `preprocessing.py`) and override the `generate_nvt_features()` method to customize the preprocessing workflow with different NVTabular ops.
 
 **Continuous features**  
 - Smoothing long-tailed distributions of continuous features with [Log](https://nvidia-merlin.github.io/NVTabular/v23.02.00/generated/nvtabular.ops.LogOp.html#nvtabular.ops.LogOp), so that the range of large numbers is compressed and the range of small numbers is expanded. 
@@ -57,7 +57,7 @@ This preprocessing script provides just basic feature engineering. For more usin
 **Categorical features**    
 - Besides contiguous ids, categorical features can be also represented by global statistics of their values, or by statistics conditioned in other columns. Some popular techniques are:
   - **Count encoding** - represents the count of a given categorical value across the whole dataset (e.g. count of user past interactions) 
-  - **Target encoding** - represents one statistic of a categorical column conditioned on a target column. One example would be computing the average of click binary target segmented by the item id categorical values, which represents its Click-Through Rate (CTR) or likelihood to be clicked by a random user. [*Target encoding*](https://nvidia-merlin.github.io/NVTabular/v23.02.00/generated/nvtabular.ops.TargetEncoding.html#nvtabular.ops.TargetEncoding) is a very powerful feature engineering technique, and has been a key for many of our [winning solutions](https://medium.com/rapids-ai/winning-solution-of-recsys2020-challenge-gpu-accelerated-feature-engineering-and-training-for-cd67c5a87b1f) for RecSys competitions.
+  - **Target encoding** - represents one statistic of a categorical column conditioned on a target column. One example would be computing the average of click binary target segmented by the item id categorical values, which represents its Click-Through Rate (CTR) or likelihood to be clicked by a random user. [*Target encoding*](https://nvidia-merlin.github.io/NVTabular/v23.02.00/generated/nvtabular.ops.TargetEncoding.html#nvtabular.ops.TargetEncoding) is a very powerful feature engineering technique, and has been a key for many of our [winning solutions](https://medium.com/rapids-ai/winning-solution-of-recsys2020-challenge-gpu-accelerated-feature-engineering-and-training-for-cd67c5a87b1f) for RecSys competitions. You can create target encoded features with this script, by setting the `--target_encoding_features` and `--target_encoding_targets` arguments to define which categorical columns and targets should be used for generating the target encoded features.
 
 
 **Temporal features**
@@ -87,9 +87,9 @@ Here is an example command line for running preprocessing for the TenRec dataset
  The parameters and their values can be separated by either space or by `=`.
 
 ```bash
-cd /Merlin/examples/quick_start/scripts/preproc/
+cd /Merlin/examples/
 OUT_DATASET_PATH=/outputs/dataset
-python preprocessing.py --input_data_format=csv --csv_na_values=\\N --data_path /data/QK-video.csv --filter_query="click==1 or (click==0 and follow==0 and like==0 and share==0)" --min_item_freq=30 --min_user_freq=30 --max_user_freq=150 --num_max_rounds_filtering=5 --enable_dask_cuda_cluster --persist_intermediate_files --output_path=$OUT_DATASET_PATH --categorical_features=user_id,item_id,video_category,gender,age --binary_classif_targets=click,follow,like,share --regression_targets=watching_times --to_int32=user_id,item_id --to_int16=watching_times --to_int8=gender,age,video_category,click,follow,like,share --user_id_feature=user_id --item_id_feature=item_id --dataset_split_strategy=random_by_user --random_split_eval_perc=0.2
+python -m quick_start.scripts.preproc.preprocessing --input_data_format=csv --csv_na_values=\\N --data_path /data/QK-video.csv --filter_query="click==1 or (click==0 and follow==0 and like==0 and share==0)" --min_item_freq=30 --min_user_freq=30 --max_user_freq=150 --num_max_rounds_filtering=5 --enable_dask_cuda_cluster --persist_intermediate_files --output_path=$OUT_DATASET_PATH --categorical_features=user_id,item_id,video_category,gender,age --binary_classif_targets=click,follow,like,share --regression_targets=watching_times --to_int32=user_id,item_id --to_int16=watching_times --to_int8=gender,age,video_category,click,follow,like,share --user_id_feature=user_id --item_id_feature=item_id --dataset_split_strategy=random_by_user --random_split_eval_perc=0.2
 ```
 
 
@@ -186,6 +186,43 @@ python preprocessing.py --input_data_format=csv --csv_na_values=\\N --data_path
                         regression head for each of these targets.
 ```
 
+### Target encoding features
+
+```
+  --target_encoding_features 
+                        Columns (comma-sep) with categorical/discrete
+                        features for which target encoding features will be
+                        generated, with the average of the target columns
+                        for each categorical value. The target columns are
+                        defined in --target_encoding_targets. If
+                        --target_encoding_features is not provided but
+                        --target_encoding_targets is, all categorical
+                        features will be used.
+  --target_encoding_targets 
+                        Columns (comma-sep) with target columns that will be
+                        used to compute target encoding features with the
+                        average of the target columns for categorical
+                        features value. The categorical features are defined
+                        in --target_encoding_features. If
+                        --target_encoding_targets is not provided but
+                        --target_encoding_features is, all target columns
+                        will be used.
+  --target_encoding_kfold 
+                        Number of folds for target encoding, in order to
+                        avoid that the current example is considered in the
+                        target encoding feature computation, which could
+                        cause overfitting for infrequent categorical values.
+                        Default is 5
+  --target_encoding_smoothing 
+                        Smoothing factor that is used in the target encoding
+                        computation, as statistics for infrequent
+                        categorical values might be noisy. It makes target
+                        encoding formula = `sum_target_per_categ_value +
+                        (global_target_avg * smooth) / categ_value_count +
+                        smooth`. Default is 10
+
+```
+
 ### Data casting and filtering
 ```
   --to_int32            Cast these columns (comma-sep) to int32.

diff --git a/examples/quick_start/scripts/preproc/__init__.py b/examples/quick_start/scripts/preproc/__init__.py
diff --git a/examples/quick_start/scripts/preproc/args_parsing.py b/examples/quick_start/scripts/preproc/args_parsing.py
@@ -121,6 +121,46 @@ def build_arg_parser():
         help="Columns (comma-sep) that should be tagged in the schema as binary target. "
         "Merlin Models will create a regression head for each of these targets.",
     )
+    parser.add_argument(
+        "--target_encoding_features",
+        default="",
+        help="Columns (comma-sep) with categorical/discrete features "
+        "for which target encoding features will be generated, with "
+        "the average of the target columns for each categorical value. "
+        "The target columns are defined in --target_encoding_targets. "
+        "If --target_encoding_features is not provided but --target_encoding_targets "
+        "is, all categorical features will be used.",
+    )
+    parser.add_argument(
+        "--target_encoding_targets",
+        default="",
+        help="Columns (comma-sep) with target columns "
+        "that will be used to compute target encoding features "
+        "with the average of the target columns for categorical features value. "
+        "The categorical features are defined in --target_encoding_features. "
+        "If --target_encoding_targets is not provided but --target_encoding_features is, "
+        "all target columns will be used.",
+    )
+
+    parser.add_argument(
+        "--target_encoding_kfold",
+        default=5,
+        type=int,
+        help="Number of folds for target encoding, in order to avoid that the current example "
+        "is considered in the target encoding feature computation, which could cause "
+        "overfitting for infrequent categorical values. Default is 5",
+    )
+
+    parser.add_argument(
+        "--target_encoding_smoothing",
+        default=10,
+        type=int,
+        help="Smoothing factor that is used in the target encoding computation, as statistics for "
+        "infrequent categorical values might be noisy. "
+        "It makes target encoding formula = "
+        "`sum_target_per_categ_value + (global_target_avg * smooth) / categ_value_count + smooth`. "
+        "Default is 10",
+    )
 
     parser.add_argument(
         "--user_id_feature",
@@ -299,9 +339,9 @@ def parse_list_arg(v):
     return v.split(",")
 
 
-def parse_arguments():
+def parse_arguments(args=None):
     parser = build_arg_parser()
-    args = parser.parse_args()
+    args = parser.parse_args(args)
 
     # Parsing list args
     args.control_features = parse_list_arg(args.control_features)
@@ -311,6 +351,9 @@ def parse_arguments():
     args.binary_classif_targets = parse_list_arg(args.binary_classif_targets)
     args.regression_targets = parse_list_arg(args.regression_targets)
 
+    args.target_encoding_features = parse_list_arg(args.target_encoding_features)
+    args.target_encoding_targets = parse_list_arg(args.target_encoding_targets)
+
     args.user_features = parse_list_arg(args.user_features)
     args.item_features = parse_list_arg(args.item_features)
     args.to_int32 = parse_list_arg(args.to_int32)