-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Applying SEO Best Pratices #104
Conversation
Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
…tion.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
…icationunicodeformatting.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
… personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Setting all RST files to lowercase names. Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new all lowercase format, it might be slightly difficult to read filenames like: personalidentifiableinformationidentificationandremoval.
For multi-word filenames, do you prefer separating words via -
or _
?
For another time, but we should work to shorten those file names if possible. In the future, breaking up file names please use |
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Rename CPUvsGPU.rst to cpuvsgpu.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DataCuration.rsts to datacuration.rsts Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DistributedDataClassification.rst to distributeddataclassification.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DocumentDataset.rst to documentdataset.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename Download.rst to download.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename GpuDeduplication.rst to gpudeduplication.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename KubernetesCurator.rst to kubernetescurator.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename QualityFiltering.rst to qualityfiltering.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename TaskDecontamination.rst to taskdecontamination.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update index.rst Setting all RST files to lowercase names. Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Ignore docs for EOF fixer hook Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> --------- Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
* Rename CPUvsGPU.rst to cpuvsgpu.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DataCuration.rsts to datacuration.rsts Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DistributedDataClassification.rst to distributeddataclassification.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DocumentDataset.rst to documentdataset.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename Download.rst to download.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename GpuDeduplication.rst to gpudeduplication.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename KubernetesCurator.rst to kubernetescurator.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename QualityFiltering.rst to qualityfiltering.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename TaskDecontamination.rst to taskdecontamination.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update index.rst Setting all RST files to lowercase names. Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Ignore docs for EOF fixer hook Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> --------- Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
* Rename CPUvsGPU.rst to cpuvsgpu.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DataCuration.rsts to datacuration.rsts Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DistributedDataClassification.rst to distributeddataclassification.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DocumentDataset.rst to documentdataset.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename Download.rst to download.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename GpuDeduplication.rst to gpudeduplication.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename KubernetesCurator.rst to kubernetescurator.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename QualityFiltering.rst to qualityfiltering.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename TaskDecontamination.rst to taskdecontamination.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update index.rst Setting all RST files to lowercase names. Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Ignore docs for EOF fixer hook Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> --------- Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Rename CPUvsGPU.rst to cpuvsgpu.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DataCuration.rsts to datacuration.rsts Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DistributedDataClassification.rst to distributeddataclassification.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DocumentDataset.rst to documentdataset.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename Download.rst to download.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename GpuDeduplication.rst to gpudeduplication.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename KubernetesCurator.rst to kubernetescurator.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename QualityFiltering.rst to qualityfiltering.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename TaskDecontamination.rst to taskdecontamination.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update index.rst Setting all RST files to lowercase names. Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Ignore docs for EOF fixer hook Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> --------- Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com>
* Applying SEO Best Pratices (#104) * Rename CPUvsGPU.rst to cpuvsgpu.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DataCuration.rsts to datacuration.rsts Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DistributedDataClassification.rst to distributeddataclassification.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DocumentDataset.rst to documentdataset.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename Download.rst to download.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename GpuDeduplication.rst to gpudeduplication.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename KubernetesCurator.rst to kubernetescurator.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename QualityFiltering.rst to qualityfiltering.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename TaskDecontamination.rst to taskdecontamination.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update index.rst Setting all RST files to lowercase names. Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Ignore docs for EOF fixer hook Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> --------- Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Shuffle CC result on group before writing out (#110) Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst (#113) Added links to tutorials Signed-off-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * first commit Signed-off-by: avinashvem <avem@nvidia.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * mv under modules dir Signed-off-by: avinashvem <avem@nvidia.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * first commit Signed-off-by: avinashvem <avem@nvidia.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * mv under modules dir Signed-off-by: avinashvem <avem@nvidia.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * first commit Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * mv under modules dir Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * embed by cluster saved Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * id map script Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * test commit Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * add id map script Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Cleanup compute_embeddings_crossfit.py Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Cleanup compute_embeddings_crossfit.py Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Pre-commit style fixes Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * clustering_dask_crossfit.py Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Minor clean up to sort_clusters_crossfit.py Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * cleanup semdedup_crossfit Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Remove undo changes Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Remove rename changes Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Fix rename Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Readme formatting Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * add dask to semdedup_crossfit.py Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * README.md updates Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * README.md updates Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * README.md updates Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * README.md updates Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * README.md updates Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * configure max memory using a cli Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Dumb id results to parquet Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Embedding fixes Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * README.md updates Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Working end to end Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Minor yaml fixes Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Undo changes to index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update .pre-commit-config.yaml Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Undo changes to docs/personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update fuzzy_dedup.py Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Undo changes to docs/personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Add end to end script in readme.md Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Add type hints Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Use dask for sort_clusters Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Make sort_clusters work on MNMG scales Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Cleaned up dask shutdown Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Decrease noise in E2E scripts Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Clean up scripts Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Fix scripts/end_to_end_script.sh Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Some more cleanup Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Add copyright Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Fix README.md Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Address reviews Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Make work with a SemDedupConfig Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Make work with SemDedupConfig Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Move to nemo-curator's logger Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Semdedup-extract_dedup_data.py Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Applying SEO Best Pratices (#104) * Rename CPUvsGPU.rst to cpuvsgpu.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DataCuration.rsts to datacuration.rsts Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DistributedDataClassification.rst to distributeddataclassification.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DocumentDataset.rst to documentdataset.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename Download.rst to download.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename GpuDeduplication.rst to gpudeduplication.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename KubernetesCurator.rst to kubernetescurator.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename QualityFiltering.rst to qualityfiltering.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename TaskDecontamination.rst to taskdecontamination.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update index.rst Setting all RST files to lowercase names. Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Ignore docs for EOF fixer hook Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> --------- Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Fix bad merge Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Add Module for embedding+clustering Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Add sorting to clustering Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Refactor Semdup modules Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Refactor Semdup modules Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Refactor Semdup modules Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Fix Readme.md Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Add a environment variable to silence HF warnings Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * dask-cudf fix Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * dask-cudf fix Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * dask-cudf fix Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Make config a flat file based on reviews Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Add docstrings Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Fix argparse and seed function Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Use argparse to read config Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Move around config files Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Move around config files Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Move around config files Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Remove end_to_end_script.sh Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Append Readme Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Address Reviews Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Change config Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Make embedding creation optionally lazy Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * fix docstring Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Address Reviews and docstrings Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Address Reviews and make eps_thresholds a list of values Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Minor import fix Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Empty Commit Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Add modules to __init__ and README.md Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Fix init Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Move comment Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Empty commit to restart CI (which failed due to a download issue) Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Empty commit to restart CI (which failed due to a download issue) Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> --------- Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> Signed-off-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: avinashvem <avem@nvidia.com> Co-authored-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Co-authored-by: avinashvem <avem@nvidia.com>
* Applying SEO Best Pratices (NVIDIA#104) * Rename CPUvsGPU.rst to cpuvsgpu.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DataCuration.rsts to datacuration.rsts Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DistributedDataClassification.rst to distributeddataclassification.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DocumentDataset.rst to documentdataset.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename Download.rst to download.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename GpuDeduplication.rst to gpudeduplication.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename KubernetesCurator.rst to kubernetescurator.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename QualityFiltering.rst to qualityfiltering.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename TaskDecontamination.rst to taskdecontamination.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update index.rst Setting all RST files to lowercase names. Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Ignore docs for EOF fixer hook Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> --------- Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Shuffle CC result on group before writing out (NVIDIA#110) Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst (NVIDIA#113) Added links to tutorials Signed-off-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * first commit Signed-off-by: avinashvem <avem@nvidia.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * mv under modules dir Signed-off-by: avinashvem <avem@nvidia.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * first commit Signed-off-by: avinashvem <avem@nvidia.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * mv under modules dir Signed-off-by: avinashvem <avem@nvidia.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * first commit Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * mv under modules dir Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * embed by cluster saved Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * id map script Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * test commit Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * add id map script Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Cleanup compute_embeddings_crossfit.py Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Cleanup compute_embeddings_crossfit.py Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Pre-commit style fixes Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * clustering_dask_crossfit.py Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Minor clean up to sort_clusters_crossfit.py Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * cleanup semdedup_crossfit Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Remove undo changes Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Remove rename changes Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Fix rename Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Readme formatting Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * add dask to semdedup_crossfit.py Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * README.md updates Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * README.md updates Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * README.md updates Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * README.md updates Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * README.md updates Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * configure max memory using a cli Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Dumb id results to parquet Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Embedding fixes Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * README.md updates Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Working end to end Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Minor yaml fixes Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Undo changes to index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update .pre-commit-config.yaml Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Undo changes to docs/personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update fuzzy_dedup.py Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Undo changes to docs/personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Add end to end script in readme.md Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Add type hints Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Use dask for sort_clusters Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Make sort_clusters work on MNMG scales Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Cleaned up dask shutdown Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Decrease noise in E2E scripts Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Clean up scripts Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Fix scripts/end_to_end_script.sh Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Some more cleanup Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Add copyright Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Fix README.md Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Address reviews Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Make work with a SemDedupConfig Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Make work with SemDedupConfig Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Move to nemo-curator's logger Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Semdedup-extract_dedup_data.py Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Applying SEO Best Pratices (NVIDIA#104) * Rename CPUvsGPU.rst to cpuvsgpu.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DataCuration.rsts to datacuration.rsts Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DistributedDataClassification.rst to distributeddataclassification.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename DocumentDataset.rst to documentdataset.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename Download.rst to download.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename GpuDeduplication.rst to gpudeduplication.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename KubernetesCurator.rst to kubernetescurator.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename QualityFiltering.rst to qualityfiltering.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Rename TaskDecontamination.rst to taskdecontamination.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update index.rst Setting all RST files to lowercase names. Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Ignore docs for EOF fixer hook Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> --------- Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Fix bad merge Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Update index.rst Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Add Module for embedding+clustering Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Add sorting to clustering Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Refactor Semdup modules Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Refactor Semdup modules Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Refactor Semdup modules Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Fix Readme.md Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Add a environment variable to silence HF warnings Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * dask-cudf fix Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * dask-cudf fix Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * dask-cudf fix Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Make config a flat file based on reviews Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Add docstrings Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Fix argparse and seed function Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Use argparse to read config Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Move around config files Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Move around config files Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Move around config files Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Remove end_to_end_script.sh Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Append Readme Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Address Reviews Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Change config Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Make embedding creation optionally lazy Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * fix docstring Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Address Reviews and docstrings Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Address Reviews and make eps_thresholds a list of values Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Minor import fix Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Empty Commit Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Add modules to __init__ and README.md Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Fix init Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Move comment Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Empty commit to restart CI (which failed due to a download issue) Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> * Empty commit to restart CI (which failed due to a download issue) Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> --------- Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com> Signed-off-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: avinashvem <avem@nvidia.com> Co-authored-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Co-authored-by: avinashvem <avem@nvidia.com>
Description
This PR flattens the URL case to all lowercase letters. Additional information on the best practice can be found here.
Usage
# Add snippet demonstrating usage
Checklist