Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sample packing with resume_from_checkpoint #406

Closed
6 of 10 tasks
winglian opened this issue Aug 15, 2023 · 12 comments · Fixed by #795
Closed
6 of 10 tasks

sample packing with resume_from_checkpoint #406

winglian opened this issue Aug 15, 2023 · 12 comments · Fixed by #795
Labels
bug Something isn't working

Comments

@winglian
Copy link
Collaborator

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

resume should step over the number of batches already trained and resume from the same point in the data

Current behaviour

fails to load due to the MultipackDataloader not having a batch sampler, Setting the batch sampler to be the same as the sampler results in an issue in acclerate.data_loader reloading the dataset into a new dataloader in an attempt to skip over the batches

    dataloader = DataLoader(dataset, batch_sampler=new_batch_sampler, **kwargs)                                                                                                                                                                                                        
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 245, in __init__                   
    raise ValueError('prefetch_factor option could only be specified in multiprocessing.'                                                  
ValueError: prefetch_factor option could only be specified in multiprocessing.let num_workers > 0 to enable multiprocessing, otherwise set prefetch_factor to None.                   

Steps to reproduce

see above

Possible solution

need to figure out a way to refactor the current dataloader either as a better sampler or extend the real torch dataloader

Which Operating Systems are you using?

  • Android
  • iPhone/iPad
  • Linux
  • macOS
  • Windows

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@winglian
Copy link
Collaborator Author

I think we can convert the MultipackDataloader to an iterabledataset wrapper around another dataset. Not sure how best to handle the sampler though

@eggqq007
Copy link

hit this issue as well. seeking patch......SOS

@eggqq007
Copy link

my backtrace

[2023-08-16 13:21:31,565] [INFO] [axolotl.utils.dataloader._len_est:250] [PID:2497] packing_efficiency_estimate: 0.96 total_num_tokens per device: 220152720
Traceback (most recent call last):
  File "/workspace/axolotl/scripts/finetune.py", line 315, in <module>
    fire.Fire(train)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/workspace/axolotl/scripts/finetune.py", line 300, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1793, in _inner_training_loop
    epoch_iterator = skip_first_batches(epoch_iterator, steps_trained_in_current_epoch)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/data_loader.py", line 871, in skip_first_batches
    batch_sampler = dataloader.sampler if sampler_is_batch_sampler else dataloader.batch_sampler
AttributeError: 'MultipackDistributedDataloader' object has no attribute 'batch_sampler'

@eggqq007
Copy link

this issue is pending weeks

@dongxiaolong
Copy link
Contributor

my backtrace

[2023-08-16 13:21:31,565] [INFO] [axolotl.utils.dataloader._len_est:250] [PID:2497] packing_efficiency_estimate: 0.96 total_num_tokens per device: 220152720
Traceback (most recent call last):
  File "/workspace/axolotl/scripts/finetune.py", line 315, in <module>
    fire.Fire(train)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/workspace/axolotl/scripts/finetune.py", line 300, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1793, in _inner_training_loop
    epoch_iterator = skip_first_batches(epoch_iterator, steps_trained_in_current_epoch)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/data_loader.py", line 871, in skip_first_batches
    batch_sampler = dataloader.sampler if sampler_is_batch_sampler else dataloader.batch_sampler
AttributeError: 'MultipackDistributedDataloader' object has no attribute 'batch_sampler'

I met the same problem. Do you solve it ?

@sadaisystems
Copy link
Contributor

sadaisystems commented Aug 27, 2023

Any updates on this? Having this problem for a single GPU run.

@IgnacioFDM
Copy link

This one bit me earlier. As a stopgap maybe warn the user they won't be able to resume when using sample packing?

@sikri2408
Copy link

sikri2408 commented Oct 17, 2023

Facing similar issue when trying to resume_with_checkpoint.
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/gandharv.sikri/venvs/env_py38_llm/src/axolotl/src/axolotl/cli/train.py", line 36, in
fire.Fire(do_cli)
File "/home/gandharv.sikri/venvs/env_py38_llm/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/gandharv.sikri/venvs/env_py38_llm/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/gandharv.sikri/venvs/env_py38_llm/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/gandharv.sikri/venvs/env_py38_llm/src/axolotl/src/axolotl/cli/train.py", line 32, in do_cli
train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
File "/home/gandharv.sikri/venvs/env_py38_llm/src/axolotl/src/axolotl/train.py", line 116, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/home/gandharv.sikri/venvs/env_py38_llm/lib/python3.8/site-packages/transformers/trainer.py", line 1553, in train
return inner_training_loop(
File "/home/gandharv.sikri/venvs/env_py38_llm/lib/python3.8/site-packages/transformers/trainer.py", line 1807, in _inner_training_loop
epoch_iterator = skip_first_batches(epoch_iterator, steps_trained_in_current_epoch)
File "/home/gandharv.sikri/venvs/env_py38_llm/lib/python3.8/site-packages/accelerate/data_loader.py", line 871, in skip_first_batches
batch_sampler = dataloader.sampler if sampler_is_batch_sampler else dataloader.batch_sampler
AttributeError: 'MultipackDistributedDataloader' object has no attribute 'batch_sampler'

0%| | 0/160 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/gandharv.sikri/venvs/env_py38_llm/bin/accelerate", line 8, in
sys.exit(main())
File "/home/gandharv.sikri/venvs/env_py38_llm/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/gandharv.sikri/venvs/env_py38_llm/lib/python3.8/site-packages/accelerate/commands/launch.py", line 979, in launch_command
simple_launcher(args)
File "/home/gandharv.sikri/venvs/env_py38_llm/lib/python3.8/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/gandharv.sikri/venvs/env_py38_llm/bin/python3.8', '-m', 'axolotl.cli.train', 'CodeLlama-13b.yaml']' returned non-zero exit status 1.

Any updates on this?

@ehartford
Copy link
Collaborator

this also impacted me, so I can't use sample packing because I need to be able to resume from checkpoint

@casper-hansen
Copy link
Collaborator

casper-hansen commented Oct 26, 2023

One quick and dirty way to solve this may be to save the current epoch and index of the internal_batch_generator in the MultipackDataloader for every X steps. You can quickly resume from this with some custom logic by looping over the batches until you get to the right epoch and index.

@casper-hansen
Copy link
Collaborator

@ehartford @winglian Unless I am missing something, here is a more thorough idea of how to implement this.

This is quite a large task.

  1. In the MultipackDistributedDataloader, save every save_steps=10 as dataloader_state.json.

    • Modify worker: for curr_sample_index, sample in enumerate(self._internal_batch_generator())
    • {"curr_epoch": 1, "curr_sample_index": 92}
  2. Modify/Extend AxolotlTrainer to initialize self.train_dataloader and self.eval_dataloader

  3. Implement a fast_forward_to_index in MultipackDistributedDataloader that has similar functionality to _worker but where it just discards every sample up until the index.

  4. Reference the self.train_dataloader and self.eval_dataloader in the AxolotlTrainer and call the fast_forward_to_index.

Upon a crash/OOM/whatever failure, you can then resume the training by using the dataloader_state.json or by specifying the epoch/index to continue from.

@ehartford
Copy link
Collaborator

I have offered a $1,000 bounty to the person who fixes this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants