Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make dataset_processes configurable #651

Merged
merged 1 commit into from
Sep 29, 2023
Merged

Conversation

corbt
Copy link
Contributor

@corbt corbt commented Sep 28, 2023

I'm using the Axolotl script to train models on modal.com serverless GPUs. Unfortunately, their environment seems to have some kind of bug where if I try to run datasets.filter with too high a num_proc, it throws an error and dies.

This PR adds a new configuration option dataset_processes, which lets you explicitly set the number of processes used to map/filter the dataset. If not included, this defaults to the current behavior of setting that to os.cpu_count().

I'm using the Axolotl script to train models on https://modal.com serverless GPUs. Unfortunately, their environment seems to have some kind of bug where if I try to run `datasets.filter` with too high a `num_proc`, it throws an error and dies.

This PR adds a new configuration option `dataset_processes`, which lets you explicitly set the number of processes used to map/filter the dataset. If not included, this defaults to the current behavior of setting that to `os.cpu_count()`.
@corbt
Copy link
Contributor Author

corbt commented Sep 28, 2023

Just pushed a new commit that fixes black.

@winglian winglian merged commit 9ec2077 into axolotl-ai-cloud:main Sep 29, 2023
4 checks passed
mkeoliya pushed a commit to mkeoliya/axolotl that referenced this pull request Dec 15, 2023
I'm using the Axolotl script to train models on https://modal.com serverless GPUs. Unfortunately, their environment seems to have some kind of bug where if I try to run `datasets.filter` with too high a `num_proc`, it throws an error and dies.

This PR adds a new configuration option `dataset_processes`, which lets you explicitly set the number of processes used to map/filter the dataset. If not included, this defaults to the current behavior of setting that to `os.cpu_count()`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants