Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 69 - Refactor how arguments are added to scripts #102

Merged
merged 25 commits into from
Jul 1, 2024

Conversation

miguelusque
Copy link
Contributor

Description

I have noticed that the code used to parse the scripts is repeated in multiple places.

I have refactored the different parser.add_argument calls, grouping them in a single file that is consumed by the different scripts.

Signed-off-by: Miguel Martínez <miguelusque@users.noreply.github.com>
Signed-off-by: miguelusque <miguelusque@users.noreply.github.com>
Signed-off-by: miguelusque <miguelusque@users.noreply.github.com>
Signed-off-by: miguelusque <miguelusque@users.noreply.github.com>
Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes. Most of my comments are around the same topic:

  • There are some parts of the codebase that repeat the same arguments across multiple files. It makes sense to group some of those into the common file for reuse. For other arguments, that are very specific to a use case or notebook, do you have a strong opinion on why they should reside in a common class rather than added directly in the example itself?

examples/domain_classifier_example.py Outdated Show resolved Hide resolved
examples/k8s/create_dask_cluster.py Outdated Show resolved Hide resolved
examples/quality_classifier_example.py Outdated Show resolved Hide resolved
nemo_curator/scripts/add_id.py Outdated Show resolved Hide resolved
nemo_curator/scripts/filter_documents.py Outdated Show resolved Hide resolved
Signed-off-by: miguelusque <miguelusque@users.noreply.github.com>
Signed-off-by: miguelusque <miguelusque@users.noreply.github.com>
Signed-off-by: miguelusque <miguelusque@users.noreply.github.com>
@miguelusque miguelusque requested a review from ayushdg June 10, 2024 23:28
Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I like this idea going forward, but I think I want to see a couple of changes. Relating to Ayush's review summary, I think that common arguments (like those found in add_distributed_args) should be grouped together. I think this class also makes more sense as a home for add_distributed_args and similar functions rather than them being entirely independent.

However, I think if there isn't any reuse of a list of arguments in a script, we shouldn't make a method in ArgumentHelper for it. Arguments that are hyper-specific to each script should remain only in that script and not be abstracted away. I'm fine abstracting away collections of arguments that are used in 2+ scripts though (like input-data-dir, input-type, etc.). So I think I'd like to see ArgumentHelper shrink a bit for now and return a lot of the niche arguments to their original scripts.

nemo_curator/utils/script_utils.py Show resolved Hide resolved
nemo_curator/utils/script_utils.py Outdated Show resolved Hide resolved
Signed-off-by: miguelusque <miguelusque@users.noreply.github.com>
Signed-off-by: miguelusque <miguelusque@users.noreply.github.com>
Signed-off-by: miguelusque <miguelusque@users.noreply.github.com>
Signed-off-by: miguelusque <miguelusque@users.noreply.github.com>
@miguelusque
Copy link
Contributor Author

In general I like this idea going forward, but I think I want to see a couple of changes. Relating to Ayush's review summary, I think that common arguments (like those found in add_distributed_args) should be grouped together. I think this class also makes more sense as a home for add_distributed_args and similar functions rather than them being entirely independent.

However, I think if there isn't any reuse of a list of arguments in a script, we shouldn't make a method in ArgumentHelper for it. Arguments that are hyper-specific to each script should remain only in that script and not be abstracted away. I'm fine abstracting away collections of arguments that are used in 2+ scripts though (like input-data-dir, input-type, etc.). So I think I'd like to see ArgumentHelper shrink a bit for now and return a lot of the niche arguments to their original scripts.

Hi Ryan,

Thank you! Just done!

Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version is great and much more in line with what I think is the most useful. I found a couple of outdated arguments (that were a lot easier to spot thanks to you) that would be good to fix before merging this in. Once all my changes are addressed and I verify via the CI that everything is ok, we should be good.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Leave this file unchanged. ArgumentHelper isn't being used and the rest of the changes are inconsequential.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs to be addressed.

nemo_curator/scripts/add_id.py Outdated Show resolved Hide resolved
nemo_curator/scripts/blend_datasets.py Outdated Show resolved Hide resolved
nemo_curator/scripts/blend_datasets.py Outdated Show resolved Hide resolved
nemo_curator/scripts/blend_datasets.py Outdated Show resolved Hide resolved
nemo_curator/scripts/text_cleaning.py Outdated Show resolved Hide resolved
nemo_curator/scripts/train_fasttext.py Outdated Show resolved Hide resolved
nemo_curator/utils/fuzzy_dedup_utils/io_utils.py Outdated Show resolved Hide resolved
nemo_curator/scripts/make_data_shards.py Outdated Show resolved Hide resolved
miguelusque and others added 12 commits June 25, 2024 14:29
The former message was out of date.

Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: Miguel Martínez <26169771+miguelusque@users.noreply.github.com>
Improve help wording for output_data_dir argument

Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: Miguel Martínez <26169771+miguelusque@users.noreply.github.com>
Update help wording for ouptut-data-dir argument

Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: Miguel Martínez <26169771+miguelusque@users.noreply.github.com>
Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: Miguel Martínez <26169771+miguelusque@users.noreply.github.com>
Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: Miguel Martínez <26169771+miguelusque@users.noreply.github.com>
There is a typo in help for argument output-model

Co-authored-by: Ryan Wolf <ryantwolf1@gmail.com>
Signed-off-by: Miguel Martínez <26169771+miguelusque@users.noreply.github.com>
Signed-off-by: miguelusque <miguelusque@users.noreply.github.com>
Signed-off-by: miguelusque <miguelusque@users.noreply.github.com>
Signed-off-by: miguelusque <miguelusque@users.noreply.github.com>
Signed-off-by: miguelusque <miguelusque@users.noreply.github.com>
Signed-off-by: miguelusque <miguelusque@users.noreply.github.com>
Signed-off-by: miguelusque <miguelusque@users.noreply.github.com>
Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment of mine is left to be addressed. Other than that, should be good! I'll run the CI now just to double check

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs to be addressed.

Signed-off-by: Miguel Martínez <26169771+miguelusque@users.noreply.github.com>
@ryantwolf ryantwolf self-requested a review June 27, 2024 16:44
Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks again for this!

Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed it works with the fuzzy dedup cli pipeline

@ayushdg ayushdg merged commit 274d3a9 into NVIDIA:main Jul 1, 2024
3 checks passed
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants