Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review getting started examples #779

Merged
merged 32 commits into from
Jan 22, 2023
Merged

Review getting started examples #779

merged 32 commits into from
Jan 22, 2023

Conversation

radekosmulski
Copy link
Contributor

No description provided.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@radekosmulski radekosmulski changed the title Review getting started examples WIP: Review getting started examples Dec 29, 2022
@radekosmulski radekosmulski marked this pull request as draft December 29, 2022 02:54
@github-actions
Copy link

Documentation preview

https://nvidia-merlin.github.io/Merlin/review/pr-779

@radekosmulski radekosmulski added documentation Improvements or additions to documentation examples Adding new examples labels Dec 29, 2022
@radekosmulski radekosmulski marked this pull request as ready for review January 10, 2023 20:31
@radekosmulski radekosmulski marked this pull request as draft January 10, 2023 20:31
@radekosmulski radekosmulski changed the title WIP: Review getting started examples Review getting started examples Jan 10, 2023
@radekosmulski radekosmulski marked this pull request as ready for review January 10, 2023 20:56
@radekosmulski
Copy link
Contributor Author

rerun tests

@@ -57,7 +57,7 @@
"HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-Through Rates (CTRs).<br>\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to move them there for inference. Can we add the reason why we move them to it?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rephrased this explaining the situation to the reader

@@ -2,7 +2,7 @@
"cells": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the paragraph doesn't make much sense

Applying deep learning models to recommendation systems faces unique challenges in comparison to other domains, such as computer vision and natural language processing. The datasets and common model architectures have unique characteristics, which require custom solutions. Recommendation system datasets have terabytes in size with billion examples but each example is represented by only a few bytes. For example, the Criteo CTR dataset, the largest publicly available dataset, is 1.3TB with 4 billion examples. The model architectures have normally large embedding tables for the users and items, which do not fit on a single GPU.

We do not address large embeddings in the example - I dont know why we mention it.

We do not use Criteo dataset, not sure why we explain something related to it?

CAn we rename NVTabular dataloader to Merlin dataloader?

This notebook explains, how to use the NVTabular dataloader to accelerate TensorFlow training.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

completely changed this paragraph now

@@ -2,7 +2,7 @@
"cells": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using all imports? E.g. I dont think we will use workflow_fit_transform in the notebook?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed unused imports

@@ -2,7 +2,7 @@
"cells": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the tags to the previous notebook ( 02 NVTabular)?

I see that we add another NVTabular workflow in this notebook - which just adds tags to the workflow. I think it will be cleaner, if we add the tags in the previous notebook and do not require two NVTAbular workflows.

We can remove the gerne columns from the schema file to not load/create an archtecture with the feature


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -2,7 +2,7 @@
"cells": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move the export code to the 4th notebook?

We can save the TensorFlow model.

In the 4th notebook, we can load the NVTabular workflow and TensorFlow model to create the graph with Merlin Systems


Reply via ReviewNB

@@ -57,7 +57,7 @@
"HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-Through Rates (CTRs).<br>\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We walked through the documentation, but it is useful to understand the API.

This sentence is not correct/uptodate as we do not walk through the documentation anymore. We can just delete it and write something like.

We define our model in a model.py file (+add reason why we do it in a model.py file)


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@bschifferer
Copy link
Contributor

@radekosmulski is it ready for a 2nd review? I am not sure, if you added the changes to the TF notebook

@bschifferer
Copy link
Contributor

We need to move the tests to Merlin/tests/unit/examples/

@@ -2,7 +2,7 @@
"cells": [
Copy link
Contributor

@rnyak rnyak Jan 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #2.        "INPUT_DATA_DIR", os.path.expanduser("~/nvt-examples/movielens/data/")

Is it possible to not have this path as root? is there a specific reason we set this as root? I know this is from existing examples, but I really dont like that this is in root :) can we set it as /workspace/nvt-examples/... if it wont create too much work? what do you think?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed the path as advised

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@radekosmulski was changing path created any issue with the unit tests? let's be sure it wont :)

@@ -2,7 +2,7 @@
"cells": [
Copy link
Contributor

@rnyak rnyak Jan 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #8.        write_hugectr_keyset=True  # only needed if using this ETL Notebook for training with HugeCTR

if not using HugeCTR, can we still keep this line or it should be removed? if it should be removed we can add a sentence to tell so.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added explanation

@@ -57,7 +57,7 @@
"HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-Through Rates (CTRs).<br>\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10x speed up in training? with what dataset and model?DLRM? do you mind to give just one line more explanation or link to prove that statement? thanks.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This information was there in the notebook, someone else is stating that they saw this improvement, I have not come across any supporting evidence.

Should we keep it as is? otherwise can remove the 10x? or the whole passage altogether?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bschifferer do you know any supporting evidence/link of 10X speed up? thanks.

@@ -57,7 +57,7 @@
"HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-Through Rates (CTRs).<br>\n",
Copy link
Contributor

@rnyak rnyak Jan 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #1.    %%writefile train_hugeCTR.py

Looks like the file name is not model.py here. it is train_HugeCTR.py so better to modify the file name in the sentence above: We will write the model to ./model.py and execute it afterwards


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected

@@ -37,52 +37,44 @@
"\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this The second tensor is a supporting Tensor, describing how many genres are associated with each datapoint in the batch. sentence is really accurate. why because according to my understanding from this sentence, I see that second tensor has 88846 value and but a batch in this example cannot have a single tensor of that amount of data points.

I'd say the second tensor is representing the starting point of each tensor of genres, and the difference between two successive number shows the number of datapoints in each tensor.

basically, can we clarify it better? what do you think?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely! Made the change as requested

@@ -37,52 +37,44 @@
"\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it really the length or more like starting point?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made the change

@@ -37,52 +37,44 @@
"\n",
Copy link
Contributor

@rnyak rnyak Jan 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #6.        train_loss, y_pred, y = process_epoch(train_loader,

I recommend you set the loss as BCE because the process_epoch's default loss func is MSE loss which does not make sense for a binary classification problem: https://github.com/NVIDIA-Merlin/NVTabular/blob/main/nvtabular/framework_utils/torch/utils.py#L64

so training a model with wrong loss wont be accurate. what do you think?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good point, change made

@@ -2,7 +2,7 @@
"cells": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be merlin-tensorflow container.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected the text and the url

@@ -2,7 +2,7 @@
"cells": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #6. configure_tensorflow()

are you sure you want to use configure_tensorflow() it automatically takes 50% of the GPU memory.. instead you could use os.environ["TF_GPU_ALLOCATOR"]="cuda_malloc_async".


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced as suggested

@@ -2,7 +2,7 @@
"cells": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be explain what's DCN briefly and add the image here? you can steal information from this notebook :) https://github.com/NVIDIA-Merlin/models/blob/main/examples/03-Exploring-different-models.ipynb


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explanation + image added! 🙂

@@ -2,7 +2,7 @@
"cells": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #5. prediction_tasks=mm.BinaryClassificationTask(target_column),

we no longer use mm.BinaryClassificationTask instead we use prediction_tasks=mm.BinaryOutput(target_column) .


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change made

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, the serving operator doesn't work if I use mm.BinaryOutput

 ---------------------------------------------------------------------------
InferenceServerException                  Traceback (most recent call last)
Cell In [11], line 9
      3 outputs = [
      4     grpcclient.InferRequestedOutput(col)
      5     for col in ["rating/binary_classification_task"]
      6 ]
      8 with grpcclient.InferenceServerClient("localhost:8001") as client:
----> 9     response = client.infer("predicttensorflow", inputs, request_id="1", outputs=outputs)

File /usr/local/lib/python3.8/dist-packages/tritonclient/grpc/init.py:1431, in InferenceServerClient.infer(self, model_name, inputs, model_version, outputs, request_id, sequence_id, sequence_start, sequence_end, priority, timeout, client_timeout, headers, compression_algorithm)
1429 return result
1430 except grpc.RpcError as rpc_error:
-> 1431 raise_error_grpc(rpc_error)

File /usr/local/lib/python3.8/dist-packages/tritonclient/grpc/init.py:62, in raise_error_grpc(rpc_error)
61 def raise_error_grpc(rpc_error):
---> 62 raise get_error_grpc(rpc_error) from None

InferenceServerException: [StatusCode.INVALID_ARGUMENT] unexpected inference output 'rating/binary_classification_task' for model 'predicttensorflow'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@radekosmulski if we can figure out what's the correct output name in your config file (please check it out) you can replace it in the for col in ["rating/binary_classification_task"] line here. if config files do not show the output name, we can checkout from model.

@@ -2,7 +2,7 @@
"cells": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it is more than that. triton_op.export is exporting config files as well right?

If yes, I'd say better to write down below what happens after export. We generate model config files so that we can load our models on TIS and send requests successfully. t


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarification added

@@ -35,26 +35,7 @@
"\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this code docker run -it --gpus device=0 .. line . this is no longer the case. we are in the merlin-tensorflow container which does serving as well. so no need to launch a new container. Please remove the sentence Before we get started, you should start the container for Triton Inference Server with the following command. This command includes the -v argument that mounts your local model-repository folder with your saved models from the previous notebook (03-Training-with-TF.ipynb). 


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -35,26 +35,7 @@
"\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can write down as We need to start the Triton Inference Server first. We can do that with the following command. You need to provide correct path for the models directory. Note that since we add --model-control-mode=explicit flag the models wont be loaded yet, we will load our model below. 


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change made

@@ -35,26 +35,7 @@
"\n",
Copy link
Contributor

@rnyak rnyak Jan 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this transformed dataset? if you are only serving TF model not NVT + TF then we should send the transformed dataset as a request, not the raw. Can you add clarification please saying that we send transformed dataset (if this is the case) due to such such reason? if this is not transformed dataset, then this is not accurate, if you are only serving the model.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is the transformed dataset, the one our model was trained on, will add clarification

@@ -55,29 +55,42 @@
"id": "71304a10",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this. we dont need to launch the container again. we are already in the container.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -55,29 +55,42 @@
"id": "71304a10",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -55,29 +55,42 @@
"id": "71304a10",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove After you started the container .Just say We can launch the Triton Inference Server with the following command:


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -55,29 +55,42 @@
"id": "71304a10",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can say since we add --model-control-mode=explicit flag, the model wont be loaded at this step, we will load the model below.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@radekosmulski
Copy link
Contributor Author

rerun tests

@@ -33,56 +33,50 @@
"\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a typo here in the give 


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you, corrected!

@@ -35,26 +35,7 @@
"\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wont --> will not


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

@@ -35,26 +35,7 @@
"\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serve --> Server


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@radekosmulski
Copy link
Contributor Author

rerun tests

@rnyak rnyak merged commit 5d9113e into main Jan 22, 2023
@rnyak rnyak deleted the review_getting_started_examples branch January 24, 2023 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation examples Adding new examples
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants