Quick-start for ranking with Merlin Models #915

gabrielspmoreira · 2023-04-19T21:44:43Z

This PR is a port of the #988 PR that was originally in models repo, then ported here as the quick-start involves different Merlin libraries: NVTabular, models, and systems

Fixes #916 , fixes #986 , fixes #918, fixes #680, fixes #681, fixes #666

Goals ⚽

This PR introduces a quick-start example for preprocessing, training, evaluating and deploying ranking models.
It is composed by a set of scripts and markdown documents. We use in the example the TenRec dataset, but the scripts are generic and can be used with customer own data, provided that they have the right shape: positive and potentially negative user-item events with tabular features.

Implementation Details 🚧

preprocessing.py - Generic script for preprocessing with CLI arguments for preprocessing a raw dataset (CSV or parquet) with NVTabular. It contains arguments to configure input path and format, categorical and continuous features, configuring the features tagging (user_id, item_id, ...), to filter interactions by using min/max frequency for users or items and dataset split.
Example command line for TenRec dataset:

python preprocessing.py --input_data_format=csv --csv_na_values=\\N --input_data_path /data/QK-video.csv --output_path=$OUT_DATASET_PATH --categorical_features=user_id,item_id,video_category,gender,age --binary_classif_targets=click,follow,like,share --regression_targets=watching_times --to_int32=user_id,item_id --to_int16=watching_times --to_int8=gender,age,video_category,click,follow,like,share --user_id_feature=user_id --item_id_feature=item_id --min_user_freq 5 --persist_intermediate_files --dataset_split_strategy=random --random_split_eval_perc=0.2

ranking_train_eval.py - Generic script for training and evaluation of ranking models. It takes the preprocessed dataset from preprocessing.py and schema as input. You can set many different training and model hparams for train both single-task learning (MLP, DCN, DLRM, Wide&Deep, DeepFM) and multi-task learning specific models (e.g. MMOE, CGC, PLE).

python  ranking_train_eval.py --train_path $OUT_DATASET_PATH/final_dataset/train --eval_path $OUT_DATASET_PATH/final_dataset/eval --output_path ./outputs/ --tasks=click --stl_positive_class_weight 4 --model dlrm --embeddings_dim 64 --l2_reg 1e-5 --embeddings_l2_reg 1e-6 --dropout 0.05 --mlp_layers 64,32  --lr 1e-4 --lr_decay_rate 0.99 --lr_decay_steps 100 --train_batch_size 4096 --eval_batch_size 4096 --epochs 1 --train_steps_per_epoch 10

Testing Details 🔍

The preprocessing and training ranking scripts are going to be added as integration tests.

Tasks

Implementation

[Task] Create a generic dataset preprocessing template for ranking models #916
Update the research STL/MTL ranking training script to use the latest API (PredictionBlock instead of PredictionTask) #917
Create a generic STL/MTL ranking training/eval script based on the research scripts #918
Refine preprocessing.py to provide additional dataset split strategies (e.g. random_by_user, temporal).
Adapt preprocessing.py to use Dask Distributed client for preprocessing larger/full dataset (single or multiple GPU)

Experimentation

Documentation

Create documentation for Quick-Start for Ranking (CLI args, best practices and tutorials) #666 - Create a markdown document providing best practices on setting hyperparameters for ranking models based on the empirical results from our research experimentation (e.g. hparam optimization search space, best hparams found, comparison of the accuracy of STL and MTL models for each task)

You can check the Quick-start for ranking documentation starting from this main page

github-actions · 2023-04-19T21:46:11Z

Documentation preview

https://nvidia-merlin.github.io/Merlin/review/pr-915

rnyak · 2023-04-21T06:11:40Z

examples/quick_start/scripts/preproc/README.md

+                        --timestamp_feature >= value
+```
+
+### CUDA cluster options


as we discussed, I think this should be optional, and users should be able to use CPU as well, which they can prefer with a small dataset. This is the case for the recsys23 competition dataset, it is not that big.

I changed the preprocess.py script to detect if there are GPU available, otherwise it configures Dataset(..., cpu=True). But when testing this setting in the Merlin TF container 23.02 without GPUs available it raised some errors when importing NVTabular due to a known issue with cuda-python 11.7.0 and earlier (used by cudf). According to @oliverholworthy, it seems we won't have this issue with the release 23.04 because we’re using a more recent version of cudf and cuda-python.

In addition to that, I created an option --enable_dask_cuda_cluster in the preprocess.py script to enable/disable the usage of Dask Cluster, as for smaller datasets not using LocalCUDACluster might be faster

rnyak · 2023-04-21T06:14:34Z

examples/quick_start/scripts/preproc/README.md

+
+### Inputs
+```
+  --data_path


if one already split train and val sets at their end up front, in that case, data-path would be train set path right? and to avoid any further split, users should also set --dataset_split_strategy to None?. or it is default None? If it is default None, that's fine.

Yes, the default --dataset_split_strategy is None. You the eval and test set were already split, you just need to provide them in --eval_data_path and --test_data_path

rnyak · 2023-04-21T07:21:24Z

examples/quick_start/ranking/README.md

@@ -0,0 +1,131 @@
+# Quick-start for ranking models with Merlin


do you think this should go one level up? Like opening readme page right under Merlin/examples/quick_start/ folder?

It could be there, but maybe renamed to something like ranking.md?
The idea is that we gonna add other quick-start documents next, like Quick-start for session-based recommendation, for retrieval, for Two/Multiple Stages.
In that case, I think that should be a README.md that would work as an index for our quick-starts. What do you think?

sounds good to me.

Have created a README.md with an intro to Quick-starts, and a link to the ranking one

rnyak · 2023-04-21T08:06:21Z

@gabrielspmoreira One think I think we can improve is the prediction step. I tested the script you shared with me for prediction but it retrains the model.. but is there a prediction script that user can feed the saved model path and then do the batch predict automatically without training again? It'd be better if we can provide an example code snippet how one can do the prediction.

gabrielspmoreira · 2023-04-24T13:53:31Z

@gabrielspmoreira One think I think we can improve is the prediction step. I tested the script you shared with me for prediction but it retrains the model.. but is there a prediction script that user can feed the saved model path and then do the batch predict automatically without training again? It'd be better if we can provide an example code snippet how one can do the prediction.

Indeed. Following your suggestion, I made it possible to save the trained model with --save_model_path, then run the script again providing --load_model_path, in this case not providing train_data_path but just --predict_data_path, so that the script loads the trained model and just perform the batch predict, saving them to --predict_output_path

…failing in that case

… trained models and generating preds without retraining

…ne keys and values

…ded, but just --predict_data_path

* Moving quick-start for ranking from models repo to Merlin repo * Updating quick-start doc and gitignore * Remove outputs from ranking script * Created tutorial of hpo with Quick-Start and W&B sweeps. Refined docs * Added option to run preprocessing using CPU. But NVTabular import is failing in that case * Discovering automatically if GPUs are available in preprocessing script * Refined docs on hypertuning * Refactored CLI args for preproc and ranking to better support loading trained models and generating preds without retraining * Adjustments in the markdown documentation * Having quick-start dynamic args to support space separated command line keys and values * Fix on the arg parsing of quick-start ranking training * Raising an exception when a target not found in the schema is provided * Additional fixes in the documentation * Fixed an issue when on --train_data_path or --eval_data_path is provided, but just --predict_data_path * Printing the folder where prediction file will be saved * Printing the folder where prediction file will be saved

gabrielspmoreira self-assigned this Apr 19, 2023

gabrielspmoreira added the examples Adding new examples label Apr 19, 2023

gabrielspmoreira added this to the Merlin 23.05 milestone Apr 19, 2023

gabrielspmoreira mentioned this pull request Apr 19, 2023

Quick-start example for preprocessing, training and deploying ranking models NVIDIA-Merlin/models#988

Closed

13 tasks

gabrielspmoreira requested review from radekosmulski, rnyak, mikemckiernan, bschifferer and bbozkaya April 20, 2023 13:44

rnyak reviewed Apr 21, 2023

View reviewed changes

gabrielspmoreira added 9 commits April 24, 2023 11:52

Moving quick-start for ranking from models repo to Merlin repo

bf05dee

Updating quick-start doc and gitignore

419d1b7

Remove outputs from ranking script

b45b7a5

Created tutorial of hpo with Quick-Start and W&B sweeps. Refined docs

a83d700

Added option to run preprocessing using CPU. But NVTabular import is …

143a1e2

…failing in that case

Discovering automatically if GPUs are available in preprocessing script

641515c

Refined docs on hypertuning

aa99d10

Refactored CLI args for preproc and ranking to better support loading…

7cd76cd

… trained models and generating preds without retraining

Adjustments in the markdown documentation

20c7557

gabrielspmoreira force-pushed the quick_start_ranking branch from 14884b3 to 20c7557 Compare April 24, 2023 14:53

gabrielspmoreira added 3 commits April 24, 2023 13:44

Having quick-start dynamic args to support space separated command li…

649b4e3

…ne keys and values

Fix on the arg parsing of quick-start ranking training

861a76f

Raising an exception when a target not found in the schema is provided

92219e8

rnyak approved these changes Apr 24, 2023

View reviewed changes

gabrielspmoreira added 2 commits April 24, 2023 16:05

Additional fixes in the documentation

c01f3bf

Fixed an issue when on --train_data_path or --eval_data_path is provi…

86f9b4f

…ded, but just --predict_data_path

gabrielspmoreira added 2 commits April 24, 2023 16:25

Printing the folder where prediction file will be saved

7fb5d57

Printing the folder where prediction file will be saved

a10e668

gabrielspmoreira merged commit ae260b9 into main Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick-start for ranking with Merlin Models #915

Quick-start for ranking with Merlin Models #915

gabrielspmoreira commented Apr 19, 2023 •

edited

Loading

github-actions bot commented Apr 19, 2023

rnyak Apr 21, 2023

gabrielspmoreira Apr 24, 2023

gabrielspmoreira Apr 24, 2023

rnyak Apr 21, 2023

gabrielspmoreira Apr 21, 2023

rnyak Apr 21, 2023

gabrielspmoreira Apr 21, 2023

rnyak Apr 23, 2023

gabrielspmoreira Apr 24, 2023

rnyak commented Apr 21, 2023 •

edited

Loading

gabrielspmoreira commented Apr 24, 2023

		@@ -0,0 +1,131 @@
		# Quick-start for ranking models with Merlin

Quick-start for ranking with Merlin Models #915

Quick-start for ranking with Merlin Models #915

Conversation

gabrielspmoreira commented Apr 19, 2023 • edited Loading

Goals ⚽

Implementation Details 🚧

Testing Details 🔍

Tasks

Implementation

Experimentation

Documentation

github-actions bot commented Apr 19, 2023

Documentation preview

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rnyak commented Apr 21, 2023 • edited Loading

gabrielspmoreira commented Apr 24, 2023

gabrielspmoreira commented Apr 19, 2023 •

edited

Loading

rnyak commented Apr 21, 2023 •

edited

Loading