Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applying SEO Best Pratices #104

Merged
merged 13 commits into from
Jun 12, 2024
Merged

Applying SEO Best Pratices #104

merged 13 commits into from
Jun 12, 2024

Conversation

aschilling-nv
Copy link
Contributor

Description

This PR flattens the URL case to all lowercase letters. Additional information on the best practice can be found here.

Usage

# Add snippet demonstrating usage

Checklist

  • [X ] I am familiar with the Contributing Guide.
  • [X ] New or Existing tests cover these changes.
  • [X ] The documentation is up to date with these changes.

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
…tion.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
…icationunicodeformatting.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
… personalidentifiableinformationidentificationandremoval.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Setting all RST files to lowercase names.

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new all lowercase format, it might be slightly difficult to read filenames like: personalidentifiableinformationidentificationandremoval.

For multi-word filenames, do you prefer separating words via - or _ ?

docs/user-guide/index.rst Show resolved Hide resolved
@aschilling-nv
Copy link
Contributor Author

With the new all lowercase format, it might be slightly difficult to read filenames like: personalidentifiableinformationidentificationandremoval.

For multi-word filenames, do you prefer separating words via - or _ ?

For another time, but we should work to shorten those file names if possible.

In the future, breaking up file names please use -. If you use _ it will break SEO results.

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
@ayushdg ayushdg merged commit 38b0ac1 into NVIDIA:main Jun 12, 2024
3 checks passed
VibhuJawa pushed a commit to VibhuJawa/NeMo-Curator that referenced this pull request Jun 27, 2024
* Rename CPUvsGPU.rst to cpuvsgpu.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename DataCuration.rsts to datacuration.rsts

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename DistributedDataClassification.rst to distributeddataclassification.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename DocumentDataset.rst to documentdataset.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename Download.rst to download.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename GpuDeduplication.rst to gpudeduplication.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename KubernetesCurator.rst to kubernetescurator.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename QualityFiltering.rst to qualityfiltering.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename TaskDecontamination.rst to taskdecontamination.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update index.rst

Setting all RST files to lowercase names.

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Ignore docs for EOF fixer hook

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

---------

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
VibhuJawa pushed a commit to VibhuJawa/NeMo-Curator that referenced this pull request Jun 27, 2024
* Rename CPUvsGPU.rst to cpuvsgpu.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename DataCuration.rsts to datacuration.rsts

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename DistributedDataClassification.rst to distributeddataclassification.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename DocumentDataset.rst to documentdataset.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename Download.rst to download.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename GpuDeduplication.rst to gpudeduplication.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename KubernetesCurator.rst to kubernetescurator.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename QualityFiltering.rst to qualityfiltering.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename TaskDecontamination.rst to taskdecontamination.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update index.rst

Setting all RST files to lowercase names.

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Ignore docs for EOF fixer hook

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

---------

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
VibhuJawa pushed a commit to VibhuJawa/NeMo-Curator that referenced this pull request Jul 1, 2024
* Rename CPUvsGPU.rst to cpuvsgpu.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename DataCuration.rsts to datacuration.rsts

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename DistributedDataClassification.rst to distributeddataclassification.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename DocumentDataset.rst to documentdataset.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename Download.rst to download.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename GpuDeduplication.rst to gpudeduplication.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename KubernetesCurator.rst to kubernetescurator.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename QualityFiltering.rst to qualityfiltering.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename TaskDecontamination.rst to taskdecontamination.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update index.rst

Setting all RST files to lowercase names.

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Ignore docs for EOF fixer hook

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

---------

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com>
VibhuJawa pushed a commit to VibhuJawa/NeMo-Curator that referenced this pull request Jul 1, 2024
* Rename CPUvsGPU.rst to cpuvsgpu.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename DataCuration.rsts to datacuration.rsts

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename DistributedDataClassification.rst to distributeddataclassification.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename DocumentDataset.rst to documentdataset.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename Download.rst to download.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename GpuDeduplication.rst to gpudeduplication.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename KubernetesCurator.rst to kubernetescurator.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename QualityFiltering.rst to qualityfiltering.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename TaskDecontamination.rst to taskdecontamination.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update index.rst

Setting all RST files to lowercase names.

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Ignore docs for EOF fixer hook

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

---------

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com>
VibhuJawa added a commit that referenced this pull request Jul 5, 2024
* Applying SEO Best Pratices (#104)

* Rename CPUvsGPU.rst to cpuvsgpu.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename DataCuration.rsts to datacuration.rsts

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename DistributedDataClassification.rst to distributeddataclassification.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename DocumentDataset.rst to documentdataset.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename Download.rst to download.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename GpuDeduplication.rst to gpudeduplication.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename KubernetesCurator.rst to kubernetescurator.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename QualityFiltering.rst to qualityfiltering.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename TaskDecontamination.rst to taskdecontamination.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update index.rst

Setting all RST files to lowercase names.

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Ignore docs for EOF fixer hook

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

---------

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Shuffle CC result on group before writing out (#110)

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Update index.rst (#113)

Added links to tutorials

Signed-off-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* first commit

Signed-off-by: avinashvem <avem@nvidia.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* mv under modules dir

Signed-off-by: avinashvem <avem@nvidia.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* first commit

Signed-off-by: avinashvem <avem@nvidia.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* mv under modules dir

Signed-off-by: avinashvem <avem@nvidia.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* first commit

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* mv under modules dir

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* embed by cluster saved

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* id map script

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* test commit

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* add id map script

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Cleanup compute_embeddings_crossfit.py

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Cleanup compute_embeddings_crossfit.py

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Pre-commit style fixes

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* clustering_dask_crossfit.py

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Minor clean up to sort_clusters_crossfit.py

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* cleanup semdedup_crossfit

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Remove undo changes

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Remove rename changes

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Fix rename

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Readme formatting

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* add dask to semdedup_crossfit.py

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* README.md updates

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* README.md updates

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* README.md updates

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* README.md updates

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* README.md updates

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* configure max memory using a cli

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Dumb id results to parquet

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Embedding fixes

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* README.md updates

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Working end to end

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Minor  yaml fixes

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Undo changes to index.rst

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Update .pre-commit-config.yaml 

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Update index.rst

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Update index.rst

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Undo changes to docs/personalidentifiableinformationidentificationandremoval.rst

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Update fuzzy_dedup.py

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Undo changes to docs/personalidentifiableinformationidentificationandremoval.rst

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Update index.rst

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Add end to end script in readme.md

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Add type hints

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Use dask for sort_clusters

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Make sort_clusters work on MNMG scales

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Cleaned up dask shutdown

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Decrease noise in E2E scripts

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Clean up scripts

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Fix scripts/end_to_end_script.sh

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Some more cleanup

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Add copyright

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Fix README.md

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Address reviews

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Make work with a SemDedupConfig

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Make work with SemDedupConfig

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Move to nemo-curator's logger

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Semdedup-extract_dedup_data.py

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Update index.rst

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Applying SEO Best Pratices (#104)

* Rename CPUvsGPU.rst to cpuvsgpu.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename DataCuration.rsts to datacuration.rsts

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename DistributedDataClassification.rst to distributeddataclassification.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename DocumentDataset.rst to documentdataset.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename Download.rst to download.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename GpuDeduplication.rst to gpudeduplication.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename KubernetesCurator.rst to kubernetescurator.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename QualityFiltering.rst to qualityfiltering.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename TaskDecontamination.rst to taskdecontamination.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update index.rst

Setting all RST files to lowercase names.

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Ignore docs for EOF fixer hook

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

---------

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Update index.rst

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Fix bad merge

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Update index.rst

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Update index.rst

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Update index.rst

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Update index.rst

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Add Module for embedding+clustering

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Add sorting to clustering

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Refactor Semdup modules

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Refactor Semdup modules

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Refactor Semdup modules

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Fix Readme.md

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Add a environment variable to silence HF warnings

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* dask-cudf fix

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* dask-cudf fix

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* dask-cudf fix

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Make config a flat file based on reviews

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Add docstrings

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Fix argparse and seed function

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Use argparse to read config

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Move around config files

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Move around config files

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Move around config files

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Remove end_to_end_script.sh

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Append Readme

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Address Reviews

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Change config

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Make embedding creation optionally lazy

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* fix docstring

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Address Reviews and docstrings

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Address Reviews and make eps_thresholds a list of values

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Minor import fix

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Empty Commit

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Add modules to __init__ and README.md

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Fix init

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Move comment

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Empty commit to restart CI (which failed due to a download issue)

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Empty commit to restart CI (which failed due to a download issue)

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

---------

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
Signed-off-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: avinashvem <avem@nvidia.com>
Co-authored-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Co-authored-by: avinashvem <avem@nvidia.com>
sarahyurick pushed a commit to sarahyurick/NeMo-Curator that referenced this pull request Jul 23, 2024
* Applying SEO Best Pratices (NVIDIA#104)

* Rename CPUvsGPU.rst to cpuvsgpu.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename DataCuration.rsts to datacuration.rsts

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename DistributedDataClassification.rst to distributeddataclassification.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename DocumentDataset.rst to documentdataset.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename Download.rst to download.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename GpuDeduplication.rst to gpudeduplication.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename KubernetesCurator.rst to kubernetescurator.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename QualityFiltering.rst to qualityfiltering.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename TaskDecontamination.rst to taskdecontamination.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update index.rst

Setting all RST files to lowercase names.

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Ignore docs for EOF fixer hook

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

---------

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Shuffle CC result on group before writing out (NVIDIA#110)

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Update index.rst (NVIDIA#113)

Added links to tutorials

Signed-off-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* first commit

Signed-off-by: avinashvem <avem@nvidia.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* mv under modules dir

Signed-off-by: avinashvem <avem@nvidia.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* first commit

Signed-off-by: avinashvem <avem@nvidia.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* mv under modules dir

Signed-off-by: avinashvem <avem@nvidia.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* first commit

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* mv under modules dir

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* embed by cluster saved

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* id map script

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* test commit

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* add id map script

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Cleanup compute_embeddings_crossfit.py

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Cleanup compute_embeddings_crossfit.py

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Pre-commit style fixes

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* clustering_dask_crossfit.py

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Minor clean up to sort_clusters_crossfit.py

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* cleanup semdedup_crossfit

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Remove undo changes

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Remove rename changes

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Fix rename

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Readme formatting

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* add dask to semdedup_crossfit.py

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* README.md updates

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* README.md updates

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* README.md updates

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* README.md updates

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* README.md updates

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* configure max memory using a cli

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Dumb id results to parquet

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Embedding fixes

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* README.md updates

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Working end to end

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Minor  yaml fixes

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Undo changes to index.rst

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Update .pre-commit-config.yaml 

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Update index.rst

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Update index.rst

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Undo changes to docs/personalidentifiableinformationidentificationandremoval.rst

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Update fuzzy_dedup.py

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Undo changes to docs/personalidentifiableinformationidentificationandremoval.rst

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Update index.rst

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Add end to end script in readme.md

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Add type hints

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Use dask for sort_clusters

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Make sort_clusters work on MNMG scales

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Cleaned up dask shutdown

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Decrease noise in E2E scripts

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Clean up scripts

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Fix scripts/end_to_end_script.sh

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Some more cleanup

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Add copyright

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Fix README.md

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Address reviews

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Make work with a SemDedupConfig

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Make work with SemDedupConfig

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Move to nemo-curator's logger

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Semdedup-extract_dedup_data.py

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Update index.rst

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Applying SEO Best Pratices (NVIDIA#104)

* Rename CPUvsGPU.rst to cpuvsgpu.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename DataCuration.rsts to datacuration.rsts

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename DistributedDataClassification.rst to distributeddataclassification.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename DocumentDataset.rst to documentdataset.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename Download.rst to download.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename GpuDeduplication.rst to gpudeduplication.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename KubernetesCurator.rst to kubernetescurator.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename QualityFiltering.rst to qualityfiltering.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Rename TaskDecontamination.rst to taskdecontamination.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update index.rst

Setting all RST files to lowercase names.

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Ignore docs for EOF fixer hook

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>

---------

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com>

* Update index.rst

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Fix bad merge

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Update index.rst

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Update index.rst

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Update index.rst

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Update index.rst

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Add Module for embedding+clustering

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Add sorting to clustering

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Refactor Semdup modules

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Refactor Semdup modules

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Refactor Semdup modules

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Fix Readme.md

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Add a environment variable to silence HF warnings

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* dask-cudf fix

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* dask-cudf fix

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* dask-cudf fix

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Make config a flat file based on reviews

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Add docstrings

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Fix argparse and seed function

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Use argparse to read config

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Move around config files

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Move around config files

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Move around config files

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Remove end_to_end_script.sh

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Append Readme

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Address Reviews

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Change config

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Make embedding creation optionally lazy

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* fix docstring

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Address Reviews and docstrings

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Address Reviews and make eps_thresholds a list of values

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Minor import fix

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Empty Commit

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Add modules to __init__ and README.md

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Fix init

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Move comment

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Empty commit to restart CI (which failed due to a download issue)

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

* Empty commit to restart CI (which failed due to a download issue)

Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>

---------

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
Signed-off-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: avinashvem <avem@nvidia.com>
Co-authored-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Co-authored-by: avinashvem <avem@nvidia.com>
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants