sample packing with resume_from_checkpoint #406

winglian · 2023-08-15T13:06:16Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

resume should step over the number of batches already trained and resume from the same point in the data

Current behaviour

fails to load due to the MultipackDataloader not having a batch sampler, Setting the batch sampler to be the same as the sampler results in an issue in acclerate.data_loader reloading the dataset into a new dataloader in an attempt to skip over the batches

    dataloader = DataLoader(dataset, batch_sampler=new_batch_sampler, **kwargs)                                                                                                                                                                                                        
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 245, in __init__                   
    raise ValueError('prefetch_factor option could only be specified in multiprocessing.'                                                  
ValueError: prefetch_factor option could only be specified in multiprocessing.let num_workers > 0 to enable multiprocessing, otherwise set prefetch_factor to None.

Steps to reproduce

see above

Possible solution

need to figure out a way to refactor the current dataloader either as a better sampler or extend the real torch dataloader

Which Operating Systems are you using?

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

winglian · 2023-08-15T21:47:52Z

I think we can convert the MultipackDataloader to an iterabledataset wrapper around another dataset. Not sure how best to handle the sampler though

eggqq007 · 2023-08-16T13:23:43Z

hit this issue as well. seeking patch......SOS

eggqq007 · 2023-08-16T13:25:31Z

my backtrace

[2023-08-16 13:21:31,565] [INFO] [axolotl.utils.dataloader._len_est:250] [PID:2497] packing_efficiency_estimate: 0.96 total_num_tokens per device: 220152720
Traceback (most recent call last):
  File "/workspace/axolotl/scripts/finetune.py", line 315, in <module>
    fire.Fire(train)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/workspace/axolotl/scripts/finetune.py", line 300, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1793, in _inner_training_loop
    epoch_iterator = skip_first_batches(epoch_iterator, steps_trained_in_current_epoch)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/data_loader.py", line 871, in skip_first_batches
    batch_sampler = dataloader.sampler if sampler_is_batch_sampler else dataloader.batch_sampler
AttributeError: 'MultipackDistributedDataloader' object has no attribute 'batch_sampler'

eggqq007 · 2023-08-21T07:38:47Z

this issue is pending weeks

dongxiaolong · 2023-08-26T16:22:12Z

my backtrace

[2023-08-16 13:21:31,565] [INFO] [axolotl.utils.dataloader._len_est:250] [PID:2497] packing_efficiency_estimate: 0.96 total_num_tokens per device: 220152720
Traceback (most recent call last):
  File "/workspace/axolotl/scripts/finetune.py", line 315, in <module>
    fire.Fire(train)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/workspace/axolotl/scripts/finetune.py", line 300, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1793, in _inner_training_loop
    epoch_iterator = skip_first_batches(epoch_iterator, steps_trained_in_current_epoch)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/data_loader.py", line 871, in skip_first_batches
    batch_sampler = dataloader.sampler if sampler_is_batch_sampler else dataloader.batch_sampler
AttributeError: 'MultipackDistributedDataloader' object has no attribute 'batch_sampler'

I met the same problem. Do you solve it ?

sadaisystems · 2023-08-27T12:43:40Z

Any updates on this? Having this problem for a single GPU run.

IgnacioFDM · 2023-09-10T18:11:19Z

This one bit me earlier. As a stopgap maybe warn the user they won't be able to resume when using sample packing?

sikri2408 · 2023-10-17T14:55:40Z

Facing similar issue when trying to resume_with_checkpoint.
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/gandharv.sikri/venvs/env_py38_llm/src/axolotl/src/axolotl/cli/train.py", line 36, in
fire.Fire(do_cli)
File "/home/gandharv.sikri/venvs/env_py38_llm/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/gandharv.sikri/venvs/env_py38_llm/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/gandharv.sikri/venvs/env_py38_llm/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/gandharv.sikri/venvs/env_py38_llm/src/axolotl/src/axolotl/cli/train.py", line 32, in do_cli
train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
File "/home/gandharv.sikri/venvs/env_py38_llm/src/axolotl/src/axolotl/train.py", line 116, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/home/gandharv.sikri/venvs/env_py38_llm/lib/python3.8/site-packages/transformers/trainer.py", line 1553, in train
return inner_training_loop(
File "/home/gandharv.sikri/venvs/env_py38_llm/lib/python3.8/site-packages/transformers/trainer.py", line 1807, in _inner_training_loop
epoch_iterator = skip_first_batches(epoch_iterator, steps_trained_in_current_epoch)
File "/home/gandharv.sikri/venvs/env_py38_llm/lib/python3.8/site-packages/accelerate/data_loader.py", line 871, in skip_first_batches
batch_sampler = dataloader.sampler if sampler_is_batch_sampler else dataloader.batch_sampler
AttributeError: 'MultipackDistributedDataloader' object has no attribute 'batch_sampler'
0%| | 0/160 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/gandharv.sikri/venvs/env_py38_llm/bin/accelerate", line 8, in
sys.exit(main())
File "/home/gandharv.sikri/venvs/env_py38_llm/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/gandharv.sikri/venvs/env_py38_llm/lib/python3.8/site-packages/accelerate/commands/launch.py", line 979, in launch_command
simple_launcher(args)
File "/home/gandharv.sikri/venvs/env_py38_llm/lib/python3.8/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/gandharv.sikri/venvs/env_py38_llm/bin/python3.8', '-m', 'axolotl.cli.train', 'CodeLlama-13b.yaml']' returned non-zero exit status 1.

Any updates on this?

ehartford · 2023-10-24T23:08:24Z

this also impacted me, so I can't use sample packing because I need to be able to resume from checkpoint

casper-hansen · 2023-10-26T06:01:37Z

One quick and dirty way to solve this may be to save the current epoch and index of the internal_batch_generator in the MultipackDataloader for every X steps. You can quickly resume from this with some custom logic by looping over the batches until you get to the right epoch and index.

casper-hansen · 2023-10-26T19:44:44Z

@ehartford @winglian Unless I am missing something, here is a more thorough idea of how to implement this.

This is quite a large task.

In the MultipackDistributedDataloader, save every save_steps=10 as dataloader_state.json.
- Modify worker: for curr_sample_index, sample in enumerate(self._internal_batch_generator())
- {"curr_epoch": 1, "curr_sample_index": 92}
Modify/Extend AxolotlTrainer to initialize self.train_dataloader and self.eval_dataloader
- get_train_dataloader
- get_eval_dataloader
Implement a fast_forward_to_index in MultipackDistributedDataloader that has similar functionality to _worker but where it just discards every sample up until the index.
Reference the self.train_dataloader and self.eval_dataloader in the AxolotlTrainer and call the fast_forward_to_index.

Upon a crash/OOM/whatever failure, you can then resume the training by using the dataloader_state.json or by specifying the epoch/index to continue from.

ehartford · 2023-10-27T16:27:54Z

I have offered a $1,000 bounty to the person who fixes this bug.

winglian added the bug Something isn't working label Aug 15, 2023

winglian mentioned this issue Aug 15, 2023

add sync_model_states parameter to fix resume from checkpoint with fsdp #400

Open

winglian mentioned this issue Oct 28, 2023

multipack w batch sampler #795

Merged

winglian closed this as completed in #795 Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sample packing with resume_from_checkpoint #406

sample packing with resume_from_checkpoint #406

winglian commented Aug 15, 2023

winglian commented Aug 15, 2023

eggqq007 commented Aug 16, 2023

eggqq007 commented Aug 16, 2023

eggqq007 commented Aug 21, 2023

dongxiaolong commented Aug 26, 2023

sadaisystems commented Aug 27, 2023 •

edited

Loading

IgnacioFDM commented Sep 10, 2023

sikri2408 commented Oct 17, 2023 •

edited

Loading

ehartford commented Oct 24, 2023

casper-hansen commented Oct 26, 2023 •

edited

Loading

casper-hansen commented Oct 26, 2023

ehartford commented Oct 27, 2023

sample packing with resume_from_checkpoint #406

sample packing with resume_from_checkpoint #406

Comments

winglian commented Aug 15, 2023

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

winglian commented Aug 15, 2023

eggqq007 commented Aug 16, 2023

eggqq007 commented Aug 16, 2023

eggqq007 commented Aug 21, 2023

dongxiaolong commented Aug 26, 2023

sadaisystems commented Aug 27, 2023 • edited Loading

IgnacioFDM commented Sep 10, 2023

sikri2408 commented Oct 17, 2023 • edited Loading

ehartford commented Oct 24, 2023

casper-hansen commented Oct 26, 2023 • edited Loading

casper-hansen commented Oct 26, 2023

ehartford commented Oct 27, 2023

sadaisystems commented Aug 27, 2023 •

edited

Loading

sikri2408 commented Oct 17, 2023 •

edited

Loading

casper-hansen commented Oct 26, 2023 •

edited

Loading