Review getting started examples #779

radekosmulski · 2022-12-29T02:53:49Z

No description provided.

review-notebook-app · 2022-12-29T02:53:53Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

github-actions · 2022-12-29T03:00:34Z

Documentation preview

https://nvidia-merlin.github.io/Merlin/review/pr-779

radekosmulski · 2023-01-11T07:57:10Z

rerun tests

examples/getting-started-movielens/03-Training-with-HugeCTR.ipynb

bschifferer · 2023-01-11T10:51:00Z

examples/getting-started-movielens/03-Training-with-HugeCTR.ipynb

@@ -57,7 +57,7 @@
    "HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-Through Rates (CTRs).<br>\n",


We need to move them there for inference. Can we add the reason why we move them to it?

Reply via ReviewNB

rephrased this explaining the situation to the reader

examples/getting-started-movielens/03-Training-with-PyTorch.ipynb

bschifferer · 2023-01-11T12:58:45Z

examples/getting-started-movielens/03-Training-with-TF.ipynb

@@ -2,7 +2,7 @@
 "cells": [


I think the paragraph doesn't make much sense

Applying deep learning models to recommendation systems faces unique challenges in comparison to other domains, such as computer vision and natural language processing. The datasets and common model architectures have unique characteristics, which require custom solutions. Recommendation system datasets have terabytes in size with billion examples but each example is represented by only a few bytes. For example, the Criteo CTR dataset, the largest publicly available dataset, is 1.3TB with 4 billion examples. The model architectures have normally large embedding tables for the users and items, which do not fit on a single GPU.

We do not address large embeddings in the example - I dont know why we mention it.
We do not use Criteo dataset, not sure why we explain something related to it?

CAn we rename NVTabular dataloader to Merlin dataloader?

This notebook explains, how to use the NVTabular dataloader to accelerate TensorFlow training.

Reply via ReviewNB

completely changed this paragraph now

bschifferer · 2023-01-11T12:58:45Z

examples/getting-started-movielens/03-Training-with-TF.ipynb

@@ -2,7 +2,7 @@
 "cells": [


Are we using all imports? E.g. I dont think we will use workflow_fit_transform in the notebook?

Reply via ReviewNB

removed unused imports

bschifferer · 2023-01-11T12:58:45Z

examples/getting-started-movielens/03-Training-with-TF.ipynb

@@ -2,7 +2,7 @@
 "cells": [


Can we add the tags to the previous notebook ( 02 NVTabular)?

I see that we add another NVTabular workflow in this notebook - which just adds tags to the workflow. I think it will be cleaner, if we add the tags in the previous notebook and do not require two NVTAbular workflows.

We can remove the gerne columns from the schema file to not load/create an archtecture with the feature

Reply via ReviewNB

bschifferer · 2023-01-11T12:58:46Z

examples/getting-started-movielens/03-Training-with-TF.ipynb

@@ -2,7 +2,7 @@
 "cells": [


We can move the export code to the 4th notebook?

We can save the TensorFlow model.

In the 4th notebook, we can load the NVTabular workflow and TensorFlow model to create the graph with Merlin Systems

Reply via ReviewNB

bschifferer · 2023-01-16T11:16:27Z

examples/getting-started-movielens/03-Training-with-HugeCTR.ipynb

@@ -57,7 +57,7 @@
    "HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-Through Rates (CTRs).<br>\n",


We walked through the documentation, but it is useful to understand the API.

This sentence is not correct/uptodate as we do not walk through the documentation anymore. We can just delete it and write something like.

We define our model in a model.py file (+add reason why we do it in a model.py file)

Reply via ReviewNB

bschifferer · 2023-01-16T11:17:02Z

@radekosmulski is it ready for a 2nd review? I am not sure, if you added the changes to the TF notebook

bschifferer · 2023-01-18T08:23:30Z

We need to move the tests to Merlin/tests/unit/examples/

rnyak · 2023-01-18T14:10:01Z

examples/getting-started-movielens/02-ETL-with-NVTabular.ipynb

@@ -2,7 +2,7 @@
 "cells": [


Line #2. "INPUT_DATA_DIR", os.path.expanduser("~/nvt-examples/movielens/data/")
Is it possible to not have this path as root? is there a specific reason we set this as root? I know this is from existing examples, but I really dont like that this is in root :) can we set it as /workspace/nvt-examples/... if it wont create too much work? what do you think?

Reply via ReviewNB

changed the path as advised

@radekosmulski was changing path created any issue with the unit tests? let's be sure it wont :)

rnyak · 2023-01-18T14:10:01Z

examples/getting-started-movielens/02-ETL-with-NVTabular.ipynb

@@ -2,7 +2,7 @@
 "cells": [


Line #8. write_hugectr_keyset=True # only needed if using this ETL Notebook for training with HugeCTR
if not using HugeCTR, can we still keep this line or it should be removed? if it should be removed we can add a sentence to tell so.

Reply via ReviewNB

added explanation

examples/getting-started-movielens/02-ETL-with-NVTabular.ipynb

rnyak · 2023-01-18T14:16:44Z

examples/getting-started-movielens/03-Training-with-HugeCTR.ipynb

@@ -57,7 +57,7 @@
    "HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-Through Rates (CTRs).<br>\n",


10x speed up in training? with what dataset and model?DLRM? do you mind to give just one line more explanation or link to prove that statement? thanks.

Reply via ReviewNB

This information was there in the notebook, someone else is stating that they saw this improvement, I have not come across any supporting evidence.

Should we keep it as is? otherwise can remove the 10x? or the whole passage altogether?

@bschifferer do you know any supporting evidence/link of 10X speed up? thanks.

rnyak · 2023-01-18T14:16:44Z

examples/getting-started-movielens/03-Training-with-HugeCTR.ipynb

@@ -57,7 +57,7 @@
    "HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-Through Rates (CTRs).<br>\n",


Line #1. %%writefile train_hugeCTR.py
Looks like the file name is not model.py here. it is train_HugeCTR.py so better to modify the file name in the sentence above: We will write the model to ./model.py and execute it afterwards

Reply via ReviewNB

examples/getting-started-movielens/03-Training-with-HugeCTR.ipynb

examples/getting-started-movielens/03-Training-with-PyTorch.ipynb

rnyak · 2023-01-18T14:57:58Z

examples/getting-started-movielens/03-Training-with-PyTorch.ipynb

@@ -37,52 +37,44 @@
    "\n",


I am not sure this The second tensor is a supporting Tensor, describing how many genres are associated with each datapoint in the batch. sentence is really accurate. why because according to my understanding from this sentence, I see that second tensor has 88846 value and but a batch in this example cannot have a single tensor of that amount of data points.

I'd say the second tensor is representing the starting point of each tensor of genres, and the difference between two successive number shows the number of datapoints in each tensor.

basically, can we clarify it better? what do you think?

Reply via ReviewNB

Absolutely! Made the change as requested

rnyak · 2023-01-18T14:57:58Z

examples/getting-started-movielens/03-Training-with-PyTorch.ipynb

@@ -37,52 +37,44 @@
    "\n",


is it really the length or more like starting point?

Reply via ReviewNB

made the change

rnyak · 2023-01-18T14:57:58Z

examples/getting-started-movielens/03-Training-with-PyTorch.ipynb

@@ -37,52 +37,44 @@
    "\n",


Line #6. train_loss, y_pred, y = process_epoch(train_loader,
I recommend you set the loss as BCE because the process_epoch's default loss func is MSE loss which does not make sense for a binary classification problem: https://github.com/NVIDIA-Merlin/NVTabular/blob/main/nvtabular/framework_utils/torch/utils.py#L64

so training a model with wrong loss wont be accurate. what do you think?

Reply via ReviewNB

that's a good point, change made

rnyak · 2023-01-18T15:10:15Z

examples/getting-started-movielens/03-Training-with-TF.ipynb

@@ -2,7 +2,7 @@
 "cells": [


it should be merlin-tensorflow container.

Reply via ReviewNB

corrected the text and the url

rnyak · 2023-01-18T15:10:16Z

examples/getting-started-movielens/03-Training-with-TF.ipynb

@@ -2,7 +2,7 @@
 "cells": [


Line #6. configure_tensorflow()

are you sure you want to use configure_tensorflow() it automatically takes 50% of the GPU memory.. instead you could use os.environ["TF_GPU_ALLOCATOR"]="cuda_malloc_async".

Reply via ReviewNB

replaced as suggested

rnyak · 2023-01-18T15:10:16Z

examples/getting-started-movielens/03-Training-with-TF.ipynb

@@ -2,7 +2,7 @@
 "cells": [


May be explain what's DCN briefly and add the image here? you can steal information from this notebook :) https://github.com/NVIDIA-Merlin/models/blob/main/examples/03-Exploring-different-models.ipynb

Reply via ReviewNB

explanation + image added! 🙂

rnyak · 2023-01-18T15:10:16Z

examples/getting-started-movielens/03-Training-with-TF.ipynb

@@ -2,7 +2,7 @@
 "cells": [


Line #5. prediction_tasks=mm.BinaryClassificationTask(target_column),

we no longer use mm.BinaryClassificationTask instead we use prediction_tasks=mm.BinaryOutput(target_column) .

Reply via ReviewNB

change made

unfortunately, the serving operator doesn't work if I use mm.BinaryOutput

--------------------------------------------------------------------------- InferenceServerException Traceback (most recent call last) Cell In [11], line 9 3 outputs = [ 4 grpcclient.InferRequestedOutput(col) 5 for col in ["rating/binary_classification_task"] 6 ] 8 with grpcclient.InferenceServerClient("localhost:8001") as client: ----> 9 response = client.infer("predicttensorflow", inputs, request_id="1", outputs=outputs)
File /usr/local/lib/python3.8/dist-packages/tritonclient/grpc/init.py:1431, in InferenceServerClient.infer(self, model_name, inputs, model_version, outputs, request_id, sequence_id, sequence_start, sequence_end, priority, timeout, client_timeout, headers, compression_algorithm)
1429 return result
1430 except grpc.RpcError as rpc_error:
-> 1431 raise_error_grpc(rpc_error)

File /usr/local/lib/python3.8/dist-packages/tritonclient/grpc/init.py:62, in raise_error_grpc(rpc_error)
61 def raise_error_grpc(rpc_error):
---> 62 raise get_error_grpc(rpc_error) from None

InferenceServerException: [StatusCode.INVALID_ARGUMENT] unexpected inference output 'rating/binary_classification_task' for model 'predicttensorflow'

@radekosmulski if we can figure out what's the correct output name in your config file (please check it out) you can replace it in the for col in ["rating/binary_classification_task"] line here. if config files do not show the output name, we can checkout from model.

rnyak · 2023-01-18T15:10:16Z

examples/getting-started-movielens/03-Training-with-TF.ipynb

@@ -2,7 +2,7 @@
 "cells": [


Actually it is more than that. triton_op.export is exporting config files as well right?

If yes, I'd say better to write down below what happens after export. We generate model config files so that we can load our models on TIS and send requests successfully. t

Reply via ReviewNB

clarification added

rnyak · 2023-01-18T15:18:41Z

examples/getting-started-movielens/04-Triton-Inference-with-TF.ipynb

@@ -35,26 +35,7 @@
    "\n",


Please remove this code docker run -it --gpus device=0 .. line . this is no longer the case. we are in the merlin-tensorflow container which does serving as well. so no need to launch a new container. Please remove the sentence Before we get started, you should start the container for Triton Inference Server with the following command. This command includes the -v argument that mounts your local model-repository folder with your saved models from the previous notebook (03-Training-with-TF.ipynb).

Reply via ReviewNB

rnyak · 2023-01-18T15:18:41Z

examples/getting-started-movielens/04-Triton-Inference-with-TF.ipynb

@@ -35,26 +35,7 @@
    "\n",


you can write down as We need to start the Triton Inference Server first. We can do that with the following command. You need to provide correct path for the models directory. Note that since we add --model-control-mode=explicit flag the models wont be loaded yet, we will load our model below.

Reply via ReviewNB

change made

rnyak · 2023-01-18T15:18:41Z

examples/getting-started-movielens/04-Triton-Inference-with-TF.ipynb

@@ -35,26 +35,7 @@
    "\n",


is this transformed dataset? if you are only serving TF model not NVT + TF then we should send the transformed dataset as a request, not the raw. Can you add clarification please saying that we send transformed dataset (if this is the case) due to such such reason? if this is not transformed dataset, then this is not accurate, if you are only serving the model.

Reply via ReviewNB

yes, this is the transformed dataset, the one our model was trained on, will add clarification

rnyak · 2023-01-18T15:22:30Z

examples/getting-started-movielens/04-Triton-Inference-with-HugeCTR.ipynb

@@ -55,29 +55,42 @@
   "id": "71304a10",


remove this. we dont need to launch the container again. we are already in the container.

Reply via ReviewNB

rnyak · 2023-01-18T15:22:30Z

examples/getting-started-movielens/04-Triton-Inference-with-HugeCTR.ipynb

@@ -55,29 +55,42 @@
   "id": "71304a10",


remove this line.

Reply via ReviewNB

rnyak · 2023-01-18T15:22:30Z

examples/getting-started-movielens/04-Triton-Inference-with-HugeCTR.ipynb

@@ -55,29 +55,42 @@
   "id": "71304a10",


you can remove After you started the container .Just say We can launch the Triton Inference Server with the following command:

Reply via ReviewNB

rnyak · 2023-01-18T15:22:31Z

examples/getting-started-movielens/04-Triton-Inference-with-HugeCTR.ipynb

@@ -55,29 +55,42 @@
   "id": "71304a10",


you can say since we add --model-control-mode=explicit flag, the model wont be loaded at this step, we will load the model below.

Reply via ReviewNB

radekosmulski · 2023-01-20T11:43:02Z

rerun tests

rnyak · 2023-01-20T14:20:28Z

examples/getting-started-movielens/03-Training-with-PyTorch.ipynb

@@ -33,56 +33,50 @@
    "\n",


there is a typo here in the give

Reply via ReviewNB

thank you, corrected!

rnyak · 2023-01-20T14:20:28Z

examples/getting-started-movielens/04-Triton-Inference-with-TF.ipynb

@@ -35,26 +35,7 @@
    "\n",


wont --> will not

Reply via ReviewNB

rnyak · 2023-01-20T14:20:28Z

examples/getting-started-movielens/04-Triton-Inference-with-TF.ipynb

@@ -35,26 +35,7 @@
    "\n",


Serve --> Server

Reply via ReviewNB

radekosmulski · 2023-01-20T23:40:18Z

rerun tests

radekosmulski changed the title ~~Review getting started examples~~ WIP: Review getting started examples Dec 29, 2022

radekosmulski marked this pull request as draft December 29, 2022 02:54

radekosmulski added 2 commits December 29, 2022 12:56

update 04-Tritone-Inference

70dc702

update

0b5ed9f

radekosmulski force-pushed the review_getting_started_examples branch from 309904b to 0b5ed9f Compare December 29, 2022 02:59

update

45fabf4

radekosmulski added documentation Improvements or additions to documentation examples Adding new examples labels Dec 29, 2022

radekosmulski added 7 commits December 30, 2022 16:30

update

54a7fb3

start working on pytorch example

bb9a1b4

get HugeCTR to work

345697b

implement hugectr tests

461fb94

get hugectr test to pass

f72b57d

fix PyTorch, attempt to fix TF, edit both

79bc2d6

Merge branch 'main' into review_getting_started_examples

d64a053

radekosmulski marked this pull request as ready for review January 10, 2023 20:31

radekosmulski marked this pull request as draft January 10, 2023 20:31

update

8807d47

radekosmulski changed the title ~~WIP: Review getting started examples~~ Review getting started examples Jan 10, 2023

radekosmulski marked this pull request as ready for review January 10, 2023 20:56

add tests

e4354cc

bschifferer reviewed Jan 11, 2023

View reviewed changes

radekosmulski added 4 commits January 16, 2023 09:39

update

adbeb25

update PyTorch example

483d0fd

update

1ab74cf

fix description of training with TF

231a3d9

bschifferer reviewed Jan 16, 2023

View reviewed changes

radekosmulski added 3 commits January 18, 2023 13:35

fix tensorflow notebooks / test

e2f509b

update pytorch example track

e8bb498

update

97879b1

move tests to tests/unit

014ca4b

rnyak reviewed Jan 18, 2023

View reviewed changes

implement review suggestions

93a70ca

radekosmulski mentioned this pull request Jan 19, 2023

[BUG] prediction_tasks=mm.BinaryOutput(target_column) doesn't work with PredictTensorflow NVIDIA-Merlin/systems#274

Closed

radekosmulski added 2 commits January 20, 2023 21:20

update

188b4e2

update

760cb2a

radekosmulski added 5 commits January 20, 2023 21:47

update

538d78d

fix tests

c15b540

update

2001587

update

a7cf2c3

update

66bf6db

rnyak reviewed Jan 20, 2023

View reviewed changes

update

87aabf4

rnyak approved these changes Jan 22, 2023

View reviewed changes

rnyak merged commit 5d9113e into main Jan 22, 2023

rnyak deleted the review_getting_started_examples branch January 24, 2023 17:59

viswa-nvidia mentioned this pull request Jan 24, 2023

[RMP] Rework the older Merlin example notebooks to use other Merlin libraries #253

Closed

9 tasks

		@@ -57,7 +57,7 @@
		"HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-Through Rates (CTRs).<br>\n",

Review getting started examples #779

Review getting started examples #779

Conversation

radekosmulski commented Dec 29, 2022

review-notebook-app bot commented Dec 29, 2022

github-actions bot commented Dec 29, 2022

Documentation preview

radekosmulski commented Jan 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bschifferer commented Jan 16, 2023

bschifferer commented Jan 18, 2023

rnyak Jan 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rnyak Jan 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rnyak Jan 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rnyak Jan 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rnyak Jan 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

radekosmulski commented Jan 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

radekosmulski commented Jan 20, 2023

rnyak Jan 18, 2023 •

edited

Loading

rnyak Jan 18, 2023 •

edited

Loading

rnyak Jan 18, 2023 •

edited

Loading

rnyak Jan 18, 2023 •

edited

Loading

rnyak Jan 18, 2023 •

edited

Loading