Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update NCF Dataset code #1612

Merged
merged 37 commits into from
Jan 31, 2022
Merged

Update NCF Dataset code #1612

merged 37 commits into from
Jan 31, 2022

Conversation

angusrtaylor
Copy link
Collaborator

@angusrtaylor angusrtaylor commented Jan 17, 2022

Description

  • Fixed out-of-memory error: Users of the original code encountered out-of-memory errors when creating the Dataset object for large datasets. This is because the original code attempts to create a matrix of users and "negative items" (items the user has not interacted with) for negative sampling purposes. This matrix can have very large memory footprint. The updated code removes the need for creating this matrix by only identifying a user's negative item set when processing a user's data during training.
  • Improved memory efficiency: The original code expands the original training dataset (containing only positive examples) by performing negative sampling for all users at the start of each training epoch and holding this full dataset in memory. This creates a memory bottleneck and causes problems when training on very large datasets. The updated code iterates over a csv file sorted by user and performs negative sampling for each user sequentially, never storing the full dataset in memory. A shuffle buffer is introduced to the train_loader function such that a subset of the full dataset is loaded into memory for shuffling purposes before returning batches. The size of the shuffle buffer is configurable.
  • Fixed sampling problem: The original code samples user negative items every time it processes a positive example in the original training set. This is inefficient but also amounts to sampling with replacement which does not follow the approach of the original paper. The updated code samples negative items once per user without replacement. A new parameter is added to the Dataset class to enable sampling with replacement if desired.
  • Fixed training/test set information leakage: in the original code, embeddings are created in the model for all users and items in both the training and test sets. This creates some information leakage for users/items that appear in the training set but not the test set. The updated code filters out unseen users/items from the test set in the notebook and additional filters are added to the Dataset object to prevent these users/items from being added to the test set.
  • Fixed leave-one-out evaluation example: the original code runs the leave-one-out evaluation on the last 25% of the dataset using a chronological split which does not correspond to the approach taken in the original paper. The updated code aligns with the approach of the paper by only taking the last item for each user as the test set.

Note: updated unit tests are to follow.

Checklist:

  • [ x] I have followed the contribution guidelines and code style for this project.
  • [ x] I have added tests covering my contributions.
  • [ x] I have updated the documentation accordingly.
  • [ x] This PR is being made to staging branch and not to main branch.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Collaborator

@miguelgfierro miguelgfierro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great!

recommenders/models/ncf/dataset.py Outdated Show resolved Hide resolved
recommenders/models/ncf/dataset.py Outdated Show resolved Hide resolved
recommenders/models/ncf/dataset.py Outdated Show resolved Hide resolved
@miguelgfierro
Copy link
Collaborator

@angusrtaylor there are some errors with flake, are you able to access the logs?

@angusrtaylor
Copy link
Collaborator Author

@angusrtaylor there are some errors with flake, are you able to access the logs?

Thanks I'll address this

tests/conftest.py Outdated Show resolved Hide resolved
tests/conftest.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@anargyri anargyri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @angusrtaylor ! the code is really neat now.

@anargyri
Copy link
Collaborator

There was some error in the notebook, see here https://dev.azure.com/best-practices/recommenders/_build/results?buildId=56649&view=ms.vss-test-web.build-test-results-tab&runId=53844&resultId=100002&paneView=debug
It looks like it's in the construction of the Dataset.

@angusrtaylor
Copy link
Collaborator Author

There was some error in the notebook, see here https://dev.azure.com/best-practices/recommenders/_build/results?buildId=56649&view=ms.vss-test-web.build-test-results-tab&runId=53844&resultId=100002&paneView=debug It looks like it's in the construction of the Dataset.

Thanks. I saw that this morning but couldn't make sense of it. In my PR, the constructor is different (data = NCFDataset(train_file=train_file, test_file=leave_one_out_test_file, seed=SEED) in my branch vs data = NCFDataset(train=train, test=test, seed=SEED) in staging). So it seems the build is running the version of the notebook that's in staging rather than my branch.

@anargyri
Copy link
Collaborator

There was some error in the notebook, see here https://dev.azure.com/best-practices/recommenders/_build/results?buildId=56649&view=ms.vss-test-web.build-test-results-tab&runId=53844&resultId=100002&paneView=debug It looks like it's in the construction of the Dataset.

Thanks. I saw that this morning but couldn't make sense of it. In my PR, the constructor is different (data = NCFDataset(train_file=train_file, test_file=leave_one_out_test_file, seed=SEED) in my branch vs data = NCFDataset(train=train, test=test, seed=SEED) in staging). So it seems the build is running the version of the notebook that's in staging rather than my branch.

Strange. I am rerunning the job.

@miguelgfierro
Copy link
Collaborator

there is a weird OOM error:

E           /home/runner/work/recommenders/recommenders/.tox/gpu/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _next_data(self)
E               561         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
E               562         if self._pin_memory:
E           --> 563             data = _utils.pin_memory.pin_memory(data)
E               564         return data
E               565 
E           
E           /home/runner/work/recommenders/recommenders/.tox/gpu/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py in pin_memory(data)
E                56         return type(data)(*(pin_memory(sample) for sample in data))
E                57     elif isinstance(data, collections.abc.Sequence):
E           ---> 58         return [pin_memory(sample) for sample in data]
E                59     elif hasattr(data, "pin_memory"):
E                60         return data.pin_memory()
E           
E           /home/runner/work/recommenders/recommenders/.tox/gpu/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py in <listcomp>(.0)
E                56         return type(data)(*(pin_memory(sample) for sample in data))
E                57     elif isinstance(data, collections.abc.Sequence):
E           ---> 58         return [pin_memory(sample) for sample in data]
E                59     elif hasattr(data, "pin_memory"):
E                60         return data.pin_memory()
E           
E           /home/runner/work/recommenders/recommenders/.tox/gpu/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py in pin_memory(data)
E                56         return type(data)(*(pin_memory(sample) for sample in data))
E                57     elif isinstance(data, collections.abc.Sequence):
E           ---> 58         return [pin_memory(sample) for sample in data]
E                59     elif hasattr(data, "pin_memory"):
E                60         return data.pin_memory()
E           
E           /home/runner/work/recommenders/recommenders/.tox/gpu/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py in <listcomp>(.0)
E                56         return type(data)(*(pin_memory(sample) for sample in data))
E                57     elif isinstance(data, collections.abc.Sequence):
E           ---> 58         return [pin_memory(sample) for sample in data]
E                59     elif hasattr(data, "pin_memory"):
E                60         return data.pin_memory()
E           
E           /home/runner/work/recommenders/recommenders/.tox/gpu/lib/python3.7/site-packages/torch/utils/data/_utils/pin_memory.py in pin_memory(data)
E                48 def pin_memory(data):
E                49     if isinstance(data, torch.Tensor):
E           ---> 50         return data.pin_memory()
E                51     elif isinstance(data, string_classes):
E                52         return data
E           
E           RuntimeError: cuda runtime error (2) : out of memory at ../aten/src/THC/THCCachingHostAllocator.cpp:280

@angusrtaylor
Copy link
Collaborator Author

cuda runtime error (2)

There are a bunch of tests that are failing in the build, even after merging in staging. I'm still trying to make sense of it. @anargyri's latest PR build was successful on staging so I'm trying to work out what is different compared to that...
image

@anargyri
Copy link
Collaborator

cuda runtime error (2)

There are a bunch of tests that are failing in the build, even after merging in staging. I'm still trying to make sense of it. @anargyri's latest PR build was successful on staging so I'm trying to work out what is different compared to that... image

It could be something with the host. I rebooted and restarted the pipeline.

@angusrtaylor
Copy link
Collaborator Author

I figured out the problem with the NCF test: the failure was related to the ncf quickstart notebook which I had forgotten to update. I've updated it and pushed the changes. I think all the other failures are OOM errors.

@anargyri anargyri merged commit 77fddff into staging Jan 31, 2022
@miguelgfierro miguelgfierro deleted the anta/ncf branch January 31, 2022 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants