Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add convert_finetuning_dataset to CLI #1354

Merged
merged 19 commits into from
Jul 20, 2024

Conversation

KuuCi
Copy link
Contributor

@KuuCi KuuCi commented Jul 12, 2024

This PR allows users to call llmfoundry convert_finetuning_dataset {ARGS} while maintaining correctness with existing convert_finetuning_dataset script. The motivation is for DLE where we want to make the CLI much more intuitive in the docker images

@KuuCi KuuCi marked this pull request as ready for review July 12, 2024 21:45
@KuuCi KuuCi requested a review from a team as a code owner July 12, 2024 21:45
v-chen_data added 2 commits July 12, 2024 15:28
@KuuCi
Copy link
Contributor Author

KuuCi commented Jul 13, 2024

test-data-txt-cEry3n runs:
llmfoundry convert_finetuning_dataset \ --dataset "Muennighoff/P3" \ --splits train,validation \ --preprocessor llmfoundry.data.finetuning.tasks:p3_preprocessing_function \ --out_root data_folder

test-data-txt-ccNcwV runs:
python convert_finetuning_dataset.py \ --dataset "Muennighoff/P3" \ --splits train validation \ --preprocessor llmfoundry.data.finetuning.tasks:p3_preprocessing_function \ --out_root data_folder

@KuuCi
Copy link
Contributor Author

KuuCi commented Jul 14, 2024

ft-cli-JmJNkX runs:
cd llm-foundry/scripts/data_prep mkdir shakespeare && cd shakespeare curl -O https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt cd .. llmfoundry convert_finetuning_dataset \ --dataset "Muennighoff/P3" \ --splits train,validation \ --preprocessor "llmfoundry.data.finetuning.tasks:p3_preprocessing_function" \ --out-root "data_folder"

ft-orig-5VBmib runs:
cd llm-foundry/scripts/data_prep mkdir shakespeare && cd shakespeare curl -O https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt cd .. python convert_finetuning_dataset.py \ --dataset "Muennighoff/P3" \ --splits train validation \ --preprocessor "llmfoundry.data.finetuning.tasks:p3_preprocessing_function" \ --out_root "data_folder"

image

@KuuCi KuuCi requested a review from dakinggg July 15, 2024 17:57
v-chen_data added 2 commits July 15, 2024 11:53
@KuuCi KuuCi marked this pull request as draft July 16, 2024 00:55
v-chen_data added 2 commits July 17, 2024 17:50
@KuuCi
Copy link
Contributor Author

KuuCi commented Jul 18, 2024

mcli logs ft-orig-8sBz3c runs:
python convert_finetuning_dataset.py \ --dataset "Muennighoff/P3" \ --splits train validation \ --preprocessor "llmfoundry.data.finetuning.tasks:p3_preprocessing_function" \ --out_root "data_folder"

mcli logs ft-cli-KIX0U6 runs:
llmfoundry data_prep convert_finetuning_dataset \ --dataset "Muennighoff/P3" \ --splits train,validation \ --preprocessor "llmfoundry.data.finetuning.tasks:p3_preprocessing_function" \ --out-root "data_folder"

image

@KuuCi KuuCi marked this pull request as ready for review July 18, 2024 01:04
tests/data/test_dataloader.py Outdated Show resolved Hide resolved
@KuuCi KuuCi merged commit 59f1a0a into main Jul 20, 2024
9 checks passed
@dakinggg dakinggg deleted the dataprep-convert_finentuning_dataset-cli branch August 6, 2024 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants