Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements in Quick-start for Ranking #1014

Merged
merged 4 commits into from
Jun 15, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file.
20 changes: 11 additions & 9 deletions examples/quick_start/ranking.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,13 +82,14 @@ In this example, we set some options for preprocessing. Here is the explanation

For larger dataset (like the full TenRec dataset), in particular when using filtering options that require dask_cudf filtering (e.g. `--filter_query`, `--min_item_freq`) we recommend using the following options to avoid out-of-memory errors:
- `--enable_dask_cuda_cluster` - Initializes a dask-cudf `LocalCUDACluster` for managed single or multi-GPU preprocessing
- `--persist_intermediate_files` - Persists/caches to disk intermediate files during preprocessing (in paricular after filtering).
- `--persist_intermediate_files` - Persists/caches to disk intermediate files during preprocessing (in paricular after filtering).
*Note*: If you want to preprocess the full TenRec dataset, set `--data_path /data/QK-video.csv` in the following command.


```bash
cd /Merlin/examples/quick_start/scripts/preproc/
cd /Merlin/examples/
OUT_DATASET_PATH=/outputs/dataset
python preprocessing.py --input_data_format=csv --csv_na_values=\\N --data_path /data/QK-video.csv --filter_query="click==1 or (click==0 and follow==0 and like==0 and share==0)" --min_item_freq=30 --min_user_freq=30 --max_user_freq=150 --num_max_rounds_filtering=5 --enable_dask_cuda_cluster --persist_intermediate_files --output_path=$OUT_DATASET_PATH --categorical_features=user_id,item_id,video_category,gender,age --binary_classif_targets=click,follow,like,share --regression_targets=watching_times --to_int32=user_id,item_id --to_int16=watching_times --to_int8=gender,age,video_category,click,follow,like,share --user_id_feature=user_id --item_id_feature=item_id --dataset_split_strategy=random_by_user --random_split_eval_perc=0.2
python -m quick_start.scripts.preproc.preprocessing --input_data_format=csv --csv_na_values=\\N --data_path /data/QK-video-10M.csv --filter_query="click==1 or (click==0 and follow==0 and like==0 and share==0)" --min_item_freq=30 --min_user_freq=30 --max_user_freq=150 --num_max_rounds_filtering=5 --enable_dask_cuda_cluster --persist_intermediate_files --output_path=$OUT_DATASET_PATH --categorical_features=user_id,item_id,video_category,gender,age --binary_classif_targets=click,follow,like,share --regression_targets=watching_times --to_int32=user_id,item_id --to_int16=watching_times --to_int8=gender,age,video_category,click,follow,like,share --user_id_feature=user_id --item_id_feature=item_id --dataset_split_strategy=random_by_user --random_split_eval_perc=0.2
```

After you execute this script, a folder `dataset` will be created in `--output_path` with the preprocessed datasets , with `train` and `eval` folders. You will find a number of partitioned parquet files inside those dataset folders, as well as a `schema.pbtxt` file produced by `NVTabular` which is very important for automated model building in the next step.
Expand All @@ -98,14 +99,15 @@ Merlin Models is a Merlin library that makes it easy to build and train RecSys m

A number of popular ranking models are available in Merlin Models API like **DLRM**, **DCN-v2**, **Wide&Deep**, **DeepFM**. This Quick-start provides a generic ranking script [ranking.py](scripts/ranking/ranking.py) for building and training those models using Models API.

In the following command example, you can easily train the popular **DLRM** model which performs 2nd level feature interaction. It sets `--model dlrm` and `--embeddings_dim 64` because DLRM models require all categorical columns to be embedded with the same dimension for the feature interaction. You notice that we can set many of the common model (e.g. top `--mlp_layers`) and training hyperparameters like learning rate (`--lr`) and its decay (`--lr_decay_rate`, `--lr_decay_steps`), L2 regularization (`--l2_reg`, `embeddings_l2_reg`), `--dropout` among others. We set `--epochs 1` and `--train_steps_per_epoch 10` to train for just 10 batches and make runtime faster. If you have a GPU with more memory (e.g. V100 with 32 GB), you might increase `--train_batch_size` and `--eval_batch_size` to a much larger batch size, for example to `65536`.
In the following command example, you can easily train the popular **DLRM** model which performs 2nd level feature interaction. It sets `--model dlrm` and `--embeddings_dim 64` because DLRM models require all categorical columns to be embedded with the same dimension for the feature interaction. You notice that we can set many of the common model (e.g. top `--mlp_layers`) and training hyperparameters like learning rate (`--lr`) and its decay (`--lr_decay_rate`, `--lr_decay_steps`), L2 regularization (`--l2_reg`, `embeddings_l2_reg`), `--dropout` among others. We set `--train_batch_size` and `--eval_batch_size` to `65536` for faster training and set `--epochs 2`. If you have preprocessed the Full TenRec dataset, you can set `--train_steps_per_epoch 100` to limit the number of training steps.
There are many target columns available in the dataset, and you can select one of them for training by setting `--tasks=click`. In this dataset, there are about 3.7 negative examples (`click=0`) for each positive example (`click=1`). That leads to some class unbalance. We can deal with that by setting `--stl_positive_class_weight` to give more weight to the loss for positive examples, which are rarer.

Note:

```bash
cd /Merlin/examples/quick_start/scripts/ranking/
OUT_DATASET_PATH=/outputs/dataset
CUDA_VISIBLE_DEVICES=0 TF_GPU_ALLOCATOR=cuda_malloc_async python ranking.py --train_data_path $OUT_DATASET_PATH/train --eval_data_path $OUT_DATASET_PATH/eval --output_path ./outputs/ --tasks=click --stl_positive_class_weight 3 --model dlrm --embeddings_dim 64 --l2_reg 1e-4 --embeddings_l2_reg 1e-6 --dropout 0.05 --mlp_layers 64,32 --lr 1e-4 --lr_decay_rate 0.99 --lr_decay_steps 100 --train_batch_size 65536 --eval_batch_size 65536 --epochs 1 --train_steps_per_epoch 10
cd /Merlin/examples/
CUDA_VISIBLE_DEVICES=0 TF_GPU_ALLOCATOR=cuda_malloc_async python -m quick_start.scripts.ranking.ranking --train_data_path $OUT_DATASET_PATH/train --eval_data_path $OUT_DATASET_PATH/eval --output_path ./outputs/ --tasks=click --stl_positive_class_weight 3 --model dlrm --embeddings_dim 64 --l2_reg 1e-4 --embeddings_l2_reg 1e-6 --dropout 0.05 --mlp_layers 64,32 --lr 1e-3 --lr_decay_rate 0.99 --lr_decay_steps 100 --train_batch_size 65536 --eval_batch_size 65536 --epochs 2
```
You can explore the [full documentation and best practices for ranking models](scripts/ranking/README.md), which contains details about the command line arguments.

Expand All @@ -121,9 +123,9 @@ In the following example, we use the popular **MMOE** (`--model mmoe`) architect
You can also balance the loss weights by setting `--mtl_loss_weight_*` arguments and the tasks positive class weight by setting `--mtl_pos_class_weight_*`.

```bash
cd /Merlin/examples/quick_start/scripts/ranking/

CUDA_VISIBLE_DEVICES=0 TF_GPU_ALLOCATOR=cuda_malloc_async python ranking.py --train_data_path $OUT_DATASET_PATH/train --eval_data_path $OUT_DATASET_PATH/eval --output_path ./outputs/ --tasks=click,like,follow,share --model mmoe --mmoe_num_mlp_experts 3 --expert_mlp_layers 128 --gate_dim 32 --use_task_towers=True --tower_layers 64 --embedding_sizes_multiplier 4 --l2_reg 1e-5 --embeddings_l2_reg 1e-6 --dropout 0.05 --lr 1e-4 --lr_decay_rate 0.99 --lr_decay_steps 100 --train_batch_size 65536 --eval_batch_size 65536 --epochs 1 --mtl_pos_class_weight_click=1 --mtl_pos_class_weight_like=2 --mtl_pos_class_weight_share=3 --mtl_pos_class_weight_follow=4 --mtl_loss_weight_click=3 --mtl_loss_weight_like=3 --mtl_loss_weight_follow=1 --mtl_loss_weight_share=1 --train_steps_per_epoch 10
OUT_DATASET_PATH=/outputs/dataset
cd /Merlin/examples/
CUDA_VISIBLE_DEVICES=0 TF_GPU_ALLOCATOR=cuda_malloc_async python -m quick_start.scripts.ranking.ranking --train_data_path $OUT_DATASET_PATH/train --eval_data_path $OUT_DATASET_PATH/eval --output_path ./outputs/ --tasks=click,like,follow,share --model mmoe --mmoe_num_mlp_experts 3 --expert_mlp_layers 128 --gate_dim 32 --use_task_towers=True --tower_layers 64 --embedding_sizes_multiplier 4 --l2_reg 1e-5 --embeddings_l2_reg 1e-8 --dropout 0.05 --lr 1e-3 --lr_decay_rate 0.99 --lr_decay_steps 100 --train_batch_size 65536 --eval_batch_size 65536 --epochs 2 --mtl_pos_class_weight_click=1 --mtl_pos_class_weight_like=2 --mtl_pos_class_weight_share=3 --mtl_pos_class_weight_follow=4 --mtl_loss_weight_click=3 --mtl_loss_weight_like=3 --mtl_loss_weight_follow=1 --mtl_loss_weight_share=1
```

You can find more quick-start information on multi-task learning and MMOE architecture [here](scripts/ranking/README.md).
Expand Down
Empty file.
Empty file.
45 changes: 41 additions & 4 deletions examples/quick_start/scripts/preproc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ Feature engineering allows designing new features from raw data that are can pro

In this section we list common feature engineering techniques. Most of them are implemented as [ops](https://nvidia-merlin.github.io/NVTabular/v23.02.00/api.html#categorical-operators) in [NVTabular](https://github.com/NVIDIA-Merlin/NVTabular). User defined functions (UDF) can be implemented with [Lambda](https://nvidia-merlin.github.io/NVTabular/v23.02.00/generated/nvtabular.ops.LambdaOp.html#nvtabular.ops.LambdaOp) op, which are very useful for example for dealing with temporal and geographic feature engineering.

This preprocessing script provides just basic feature engineering. For more using those more advanced techniques you can copy the `preprocessing.py` script and add them to the NVTabular workflow within `generate_nvt_workflow_features()`.
TIP: This preprocessing script provides just basic feature engineering. For more using those more advanced techniques you can either copy `preprocessing.py` and change it, or you can create a class inheriting from the `PreprocessingRunner` class (in `preprocessing.py`) and override the `generate_nvt_features()` method to customize the preprocessing workflow with different NVTabular ops.

**Continuous features**
- Smoothing long-tailed distributions of continuous features with [Log](https://nvidia-merlin.github.io/NVTabular/v23.02.00/generated/nvtabular.ops.LogOp.html#nvtabular.ops.LogOp), so that the range of large numbers is compressed and the range of small numbers is expanded.
Expand All @@ -57,7 +57,7 @@ This preprocessing script provides just basic feature engineering. For more usin
**Categorical features**
- Besides contiguous ids, categorical features can be also represented by global statistics of their values, or by statistics conditioned in other columns. Some popular techniques are:
- **Count encoding** - represents the count of a given categorical value across the whole dataset (e.g. count of user past interactions)
- **Target encoding** - represents one statistic of a categorical column conditioned on a target column. One example would be computing the average of click binary target segmented by the item id categorical values, which represents its Click-Through Rate (CTR) or likelihood to be clicked by a random user. [*Target encoding*](https://nvidia-merlin.github.io/NVTabular/v23.02.00/generated/nvtabular.ops.TargetEncoding.html#nvtabular.ops.TargetEncoding) is a very powerful feature engineering technique, and has been a key for many of our [winning solutions](https://medium.com/rapids-ai/winning-solution-of-recsys2020-challenge-gpu-accelerated-feature-engineering-and-training-for-cd67c5a87b1f) for RecSys competitions.
- **Target encoding** - represents one statistic of a categorical column conditioned on a target column. One example would be computing the average of click binary target segmented by the item id categorical values, which represents its Click-Through Rate (CTR) or likelihood to be clicked by a random user. [*Target encoding*](https://nvidia-merlin.github.io/NVTabular/v23.02.00/generated/nvtabular.ops.TargetEncoding.html#nvtabular.ops.TargetEncoding) is a very powerful feature engineering technique, and has been a key for many of our [winning solutions](https://medium.com/rapids-ai/winning-solution-of-recsys2020-challenge-gpu-accelerated-feature-engineering-and-training-for-cd67c5a87b1f) for RecSys competitions. You can create target encoded features with this script, by setting the `--target_encoding_features` and `--target_encoding_targets` arguments to define which categorical columns and targets should be used for generating the target encoded features.


**Temporal features**
Expand Down Expand Up @@ -87,9 +87,9 @@ Here is an example command line for running preprocessing for the TenRec dataset
The parameters and their values can be separated by either space or by `=`.

```bash
cd /Merlin/examples/quick_start/scripts/preproc/
cd /Merlin/examples/
OUT_DATASET_PATH=/outputs/dataset
python preprocessing.py --input_data_format=csv --csv_na_values=\\N --data_path /data/QK-video.csv --filter_query="click==1 or (click==0 and follow==0 and like==0 and share==0)" --min_item_freq=30 --min_user_freq=30 --max_user_freq=150 --num_max_rounds_filtering=5 --enable_dask_cuda_cluster --persist_intermediate_files --output_path=$OUT_DATASET_PATH --categorical_features=user_id,item_id,video_category,gender,age --binary_classif_targets=click,follow,like,share --regression_targets=watching_times --to_int32=user_id,item_id --to_int16=watching_times --to_int8=gender,age,video_category,click,follow,like,share --user_id_feature=user_id --item_id_feature=item_id --dataset_split_strategy=random_by_user --random_split_eval_perc=0.2
python -m quick_start.scripts.preproc.preprocessing --input_data_format=csv --csv_na_values=\\N --data_path /data/QK-video.csv --filter_query="click==1 or (click==0 and follow==0 and like==0 and share==0)" --min_item_freq=30 --min_user_freq=30 --max_user_freq=150 --num_max_rounds_filtering=5 --enable_dask_cuda_cluster --persist_intermediate_files --output_path=$OUT_DATASET_PATH --categorical_features=user_id,item_id,video_category,gender,age --binary_classif_targets=click,follow,like,share --regression_targets=watching_times --to_int32=user_id,item_id --to_int16=watching_times --to_int8=gender,age,video_category,click,follow,like,share --user_id_feature=user_id --item_id_feature=item_id --dataset_split_strategy=random_by_user --random_split_eval_perc=0.2
```


Expand Down Expand Up @@ -186,6 +186,43 @@ python preprocessing.py --input_data_format=csv --csv_na_values=\\N --data_path
regression head for each of these targets.
```

### Target encoding features

```
--target_encoding_features
Columns (comma-sep) with categorical/discrete
features for which target encoding features will be
generated, with the average of the target columns
for each categorical value. The target columns are
defined in --target_encoding_targets. If
--target_encoding_features is not provided but
--target_encoding_targets is, all categorical
features will be used.
--target_encoding_targets
Columns (comma-sep) with target columns that will be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wont giving multiple targets create issue? you were facing issues for that.. was that fixed? also what about test set needs target column issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I split the targets and create one TargetEncoding op for each to avoid the issue

used to compute target encoding features with the
average of the target columns for categorical
features value. The categorical features are defined
in --target_encoding_features. If
--target_encoding_targets is not provided but
--target_encoding_features is, all target columns
will be used.
--target_encoding_kfold
Number of folds for target encoding, in order to
avoid that the current example is considered in the
target encoding feature computation, which could
cause overfitting for infrequent categorical values.
Default is 5
--target_encoding_smoothing
Smoothing factor that is used in the target encoding
computation, as statistics for infrequent
categorical values might be noisy. It makes target
encoding formula = `sum_target_per_categ_value +
(global_target_avg * smooth) / categ_value_count +
smooth`. Default is 10

```

### Data casting and filtering
```
--to_int32 Cast these columns (comma-sep) to int32.
Expand Down
Empty file.
47 changes: 45 additions & 2 deletions examples/quick_start/scripts/preproc/args_parsing.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,46 @@ def build_arg_parser():
help="Columns (comma-sep) that should be tagged in the schema as binary target. "
"Merlin Models will create a regression head for each of these targets.",
)
parser.add_argument(
"--target_encoding_features",
default="",
help="Columns (comma-sep) with categorical/discrete features "
"for which target encoding features will be generated, with "
"the average of the target columns for each categorical value. "
"The target columns are defined in --target_encoding_targets. "
"If --target_encoding_features is not provided but --target_encoding_targets "
"is, all categorical features will be used.",
)
parser.add_argument(
"--target_encoding_targets",
default="",
help="Columns (comma-sep) with target columns "
"that will be used to compute target encoding features "
"with the average of the target columns for categorical features value. "
"The categorical features are defined in --target_encoding_features. "
"If --target_encoding_targets is not provided but --target_encoding_features is, "
"all target columns will be used.",
)

parser.add_argument(
"--target_encoding_kfold",
default=5,
type=int,
help="Number of folds for target encoding, in order to avoid that the current example "
"is considered in the target encoding feature computation, which could cause "
"overfitting for infrequent categorical values. Default is 5",
)

parser.add_argument(
"--target_encoding_smoothing",
default=10,
type=int,
help="Smoothing factor that is used in the target encoding computation, as statistics for "
"infrequent categorical values might be noisy. "
"It makes target encoding formula = "
"`sum_target_per_categ_value + (global_target_avg * smooth) / categ_value_count + smooth`. "
"Default is 10",
)

parser.add_argument(
"--user_id_feature",
Expand Down Expand Up @@ -299,9 +339,9 @@ def parse_list_arg(v):
return v.split(",")


def parse_arguments():
def parse_arguments(args=None):
parser = build_arg_parser()
args = parser.parse_args()
args = parser.parse_args(args)

# Parsing list args
args.control_features = parse_list_arg(args.control_features)
Expand All @@ -311,6 +351,9 @@ def parse_arguments():
args.binary_classif_targets = parse_list_arg(args.binary_classif_targets)
args.regression_targets = parse_list_arg(args.regression_targets)

args.target_encoding_features = parse_list_arg(args.target_encoding_features)
args.target_encoding_targets = parse_list_arg(args.target_encoding_targets)

args.user_features = parse_list_arg(args.user_features)
args.item_features = parse_list_arg(args.item_features)
args.to_int32 = parse_list_arg(args.to_int32)
Expand Down
Loading