Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make GPU dependencies optional #27

Merged
merged 11 commits into from
Apr 23, 2024
Merged

Conversation

ayushdg
Copy link
Collaborator

@ayushdg ayushdg commented Apr 9, 2024

  • Moves some of the GPU imports into optional import blocks.
  • Split up the install into an extra cu12 style install for GPU components.
  • Update README's
  • Update installation in CI testing.

@ayushdg
Copy link
Collaborator Author

ayushdg commented Apr 22, 2024

I'm seeing two new errors pop up probably because of newer pandas/dask versions (cudf was probably setting the upper pin to something more restrictive. I need to look into this a bit more but pinging @ryantwolf in case something immediately comes to mind.

@ryantwolf
Copy link
Collaborator

Nothing immediately comes to mind unfortunately. I haven't seen those errors before. Happy to help debug more though if you want help.

@ayushdg ayushdg marked this pull request as ready for review April 23, 2024 03:04
@ayushdg ayushdg requested a review from ryantwolf April 23, 2024 18:33
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I only had one question for my own curiousity. Just one final check before I approve, have you run the RAPIDS dedup CI tests with this change?

nemo_curator/datasets/doc_dataset.py Show resolved Hide resolved
@ayushdg
Copy link
Collaborator Author

ayushdg commented Apr 23, 2024

LGTM. I only had one question for my own curiousity. Just one final check before I approve, have you run the RAPIDS dedup CI tests with this change?

Yup I've manually run the pipeline and verified that it works as expected

@ryantwolf ryantwolf self-requested a review April 23, 2024 20:39
@ryantwolf ryantwolf merged commit 17e0d5f into NVIDIA:main Apr 23, 2024
3 checks passed
nicoleeeluo pushed a commit to nicoleeeluo/NeMo-Curator that referenced this pull request May 20, 2024
* Move GPU imports and make them optional

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Move gpu dependencies to a seperate install

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Remove unused import

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Switch to placeholder import that raises on usage

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Remove deprecated utils usage

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Add cuML attribution

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Safe import tests, improve install instruction, update gha workflow

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Fix pytests due to loc bug

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* update install instructions

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Raise on non module-not-found errors, update logging

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Update logging to not change root logger

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

---------

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>
ryantwolf added a commit that referenced this pull request May 24, 2024
* Init commit for tutorial notebook

Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Fix metadata inference with pandas and dask (#35)

* Fix metadata inference with pandas and dask

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix datatypes for task decontamination

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Use targetted import

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

---------

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Disable PyTorch Compile Multiprocessing (#34)

* Move tokenizer import

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Reduce inductor threads

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Change env int to string

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Change location of env var

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add comment linking issue

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

---------

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Improve speed of AddId module (#36)

* Add fast id method

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add type conversion

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix off by one errors in tests

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

---------

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Make GPU dependencies optional (#27)

* Move GPU imports and make them optional

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Move gpu dependencies to a seperate install

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Remove unused import

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Switch to placeholder import that raises on usage

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Remove deprecated utils usage

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Add cuML attribution

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Safe import tests, improve install instruction, update gha workflow

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Fix pytests due to loc bug

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* update install instructions

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Raise on non module-not-found errors, update logging

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Update logging to not change root logger

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

---------

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Fix failing GPU tests with latest pandas bump (#41)

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Adds Nemo Curator K8s example (#40)

* [K8s]: Adds a helper script to create a dask cluster on k8s and includes
instructions for how to a Curator workload on k8s

Signed-off-by: Terry Kong <terryk@nvidia.com>

* black formatting

Signed-off-by: Terry Kong <terryk@nvidia.com>

* big_english -> my_dataset

Signed-off-by: Terry Kong <terryk@nvidia.com>

* 24.01 -> 24.03 default container

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Add help kwarg to all flags

Signed-off-by: Terry Kong <terryk@nvidia.com>

* Clarify why venv is needed

Signed-off-by: Terry Kong <terryk@nvidia.com>

* fix precommit failures

Signed-off-by: Terry Kong <terryk@nvidia.com>

---------

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Move common dedup utils and remove unused code (#42)

* Refactor common utils and remove unused code

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* More cleanup

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* More updates/shuffling

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Move gpu_dedup scripts into subfolder

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Remove gpu_deduplication subfolder

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Add readme to fuzzy dedup scripts section

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Fix typo and relative links

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Remove legacy script entrypoints

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Remove legacy scripts and add init file

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Update GpuDeduplication.rst

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

---------

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Fix lang id example (#37)

* Fix lang id example

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add classifier unit tests

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add test for failure

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Remove failure test

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

---------

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Add dataset blending tool (#32)

* Add initial dataset blending function

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add blend unit tests

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add self parameter

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix return type of blend dataset

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix blending tests

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Change assert statement for very uneven blend

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix key error

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add proper proportion blending test

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add four dataset blend and clarify docs

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add shuffle module

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add blend example and tests

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix random method name

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Wrap return type in DocumentDataset

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Save result of column drop

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Change equality check for shuffle tests

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix expected order after shuffle

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add more documents to shuffle test

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add assert statement

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add within partition shuffle

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Refactor add rand column for shuffle

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix filename tests

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add determinism handling for shuffle

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Change numpy random function

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix tests with new random method

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Remove length call from blending

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Improve scaling of blending function

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix blend tests

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add blending script

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add additional file paths call

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add documentation

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Reformat docs

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Remove backticks

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add context manager for shuffle tests

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add better deterministic shuffle path

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Update documentation and reset index

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

---------

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* High level fuzzy duplicates module (#46)

* Initial pass at fuzzy dedup api

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Update deprecated shuffle arg

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* dask_cuda gpu only import

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Move fuzzy_dedup imports to optional

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* more tests

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Move FuzzyDeDupConfig to it's own class

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Add example script and config file, fix typo

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Remove slurm examples for gpu dedup

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Add config module

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Rename FuzzyDeDupConfig and minhash_length to  FuzzyDuplicatesConfig, num_hashes

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Add comments and update example

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Write to same format as input in fuzzy dedup example

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

---------

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Fix indexing in PII Modifier (#55)

* Fix pii index issue

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Add sequential wrapper

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

* Fix pii tests

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>

---------

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Disable string conversion globally (#56)

Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Fix issue #43 (empty files creation) and improve reading/writing speed (#57)

This commit fixes issue #43 (empty files created when invoking reshard_jsonl method at nemo_curator.utils.file_utils.py) by double-checking the files size after being generated, and deleting them with size zero.

In addition to that, I have noticed there is no need to parse to JSON object the content of the different lines, which should be already in json format. By removing that extra-parsing, there is a significant speed up in the execution of this method.

Signed-off-by: Miguel Martínez <26169771+miguelusque@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* [Tutorials] Add a tutorial for PEFT data curation (#45)

This PR adds a new tutorial to demonstrate data curation for PEFT
use-cases.

Signed-off-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Only import PII constants during Curator import (#61)

* Move PII constants to a seperate file that does not import presidio/spacy and other GPU dependencies

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Add comment around import, move constant import to global scope

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

---------

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Deleting links

Signed-off-by: Nicoel Luo <nluo@nvidia.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb

Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb

Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb

Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb

Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb

Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb

Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb

Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb

Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb

Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb

Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb

Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb

Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com>
Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Fixed typo. Update content to lastest NeMo Curator version. Added fuzzy deduplication wrapper example

Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Fixing Style

Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Updating container version

Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Fixing style

Signed-off-by: Nicole Luo <nluo@nvidia.com>

* Update get_client() according to latest version; Update log path for map_bucket section

Signed-off-by: Nicole Luo <nluo@nvidia.com>

---------

Signed-off-by: Nicole Luo <nluo@nvidia.com>
Signed-off-by: Ryan Wolf <rywolf@nvidia.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Miguel Martínez <26169771+miguelusque@users.noreply.github.com>
Signed-off-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com>
Signed-off-by: Nicoel Luo <nluo@nvidia.com>
Signed-off-by: nicoleeeluo <157772168+nicoleeeluo@users.noreply.github.com>
Co-authored-by: Ryan Wolf <rywolf@nvidia.com>
Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: Miguel Martínez <26169771+miguelusque@users.noreply.github.com>
Co-authored-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com>
Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
@ayushdg ayushdg deleted the optional-gpu-imports branch June 3, 2024 17:18
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants