Make dataset_processes configurable #651

corbt · 2023-09-28T18:14:02Z

I'm using the Axolotl script to train models on modal.com serverless GPUs. Unfortunately, their environment seems to have some kind of bug where if I try to run datasets.filter with too high a num_proc, it throws an error and dies.

This PR adds a new configuration option dataset_processes, which lets you explicitly set the number of processes used to map/filter the dataset. If not included, this defaults to the current behavior of setting that to os.cpu_count().

I'm using the Axolotl script to train models on https://modal.com serverless GPUs. Unfortunately, their environment seems to have some kind of bug where if I try to run `datasets.filter` with too high a `num_proc`, it throws an error and dies. This PR adds a new configuration option `dataset_processes`, which lets you explicitly set the number of processes used to map/filter the dataset. If not included, this defaults to the current behavior of setting that to `os.cpu_count()`.

corbt · 2023-09-28T23:28:45Z

Just pushed a new commit that fixes black.

I'm using the Axolotl script to train models on https://modal.com serverless GPUs. Unfortunately, their environment seems to have some kind of bug where if I try to run `datasets.filter` with too high a `num_proc`, it throws an error and dies. This PR adds a new configuration option `dataset_processes`, which lets you explicitly set the number of processes used to map/filter the dataset. If not included, this defaults to the current behavior of setting that to `os.cpu_count()`.

winglian approved these changes Sep 28, 2023

View reviewed changes

corbt force-pushed the main branch from a005c1b to 66b3d5f Compare September 28, 2023 23:27

winglian merged commit 9ec2077 into axolotl-ai-cloud:main Sep 29, 2023
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make dataset_processes configurable #651

Make dataset_processes configurable #651

corbt commented Sep 28, 2023

corbt commented Sep 28, 2023

Make dataset_processes configurable #651

Make dataset_processes configurable #651

Conversation

corbt commented Sep 28, 2023

corbt commented Sep 28, 2023