Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR: expected to be in states [<TrainingState_.IDLE: 1>] but current state is TrainingState_.BACKWARD_PRE #247

Closed
NarenZen opened this issue May 29, 2023 · 4 comments
Assignees

Comments

@NarenZen
Copy link

Got this error, which finetuning instruct model 8XA100 machine

ERROR: expected to be in states [<TrainingState_.IDLE: 1>] but current state is TrainingState_.BACKWARD_PRE
  File "/usr/local/lib/python3.10/dist-packages/composer/core/engine.py", line 526, in _close
    callback.close(state, logger)
  File "/usr/local/lib/python3.10/dist-packages/composer/callbacks/checkpoint_saver.py", line 362, in close
    self._save_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/composer/callbacks/checkpoint_saver.py", line 384, in _save_checkpoint
    saved_path = checkpoint.save_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/composer/utils/checkpoint.py", line 518, in save_checkpoint
    'state': state.state_dict(),
  File "/usr/local/lib/python3.10/dist-packages/composer/core/state.py", line 789, in state_dict
    model_state = attribute_value.state_dict()
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1448, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 2402, in state_dict
    with summon_ctx:
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 2974, in _summon_full_params
    self._assert_state([TrainingState_.IDLE])
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 3582, in _assert_state
    traceback.print_stack()
Error running CheckpointSaver.close(). Skipping CheckpointSaver.post_close().
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/composer/core/engine.py", line 526, in _close
    callback.close(state, logger)
  File "/usr/local/lib/python3.10/dist-packages/composer/callbacks/checkpoint_saver.py", line 362, in close
    self._save_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/composer/callbacks/checkpoint_saver.py", line 384, in _save_checkpoint
    saved_path = checkpoint.save_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/composer/utils/checkpoint.py", line 518, in save_checkpoint
    'state': state.state_dict(),
  File "/usr/local/lib/python3.10/dist-packages/composer/core/state.py", line 789, in state_dict
    model_state = attribute_value.state_dict()
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1448, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 2402, in state_dict
    with summon_ctx:
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 2974, in _summon_full_params
    self._assert_state([TrainingState_.IDLE])
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 3583, in _assert_state
    raise ValueError(msg)
ValueError: expected to be in states [<TrainingState_.IDLE: 1>] but current state is TrainingState_.BACKWARD_PRE
Stack (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/composer/core/engine.py", line 528, in _close
    log.error(
ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1.
@jacobfulano
Copy link
Contributor

Hi @NarenZen, can you provide more details on your hardware and docker image? Have you successfully saved checkpoints? Are you using multiple nodes?

@NarenZen
Copy link
Author

  1. hardware: 8xa100
  2. i got this error, but was able see the saved checkpoints
  3. running this git commit code: 89f56d2

@vchiley
Copy link
Contributor

vchiley commented Jun 6, 2023

that commit looks like its part of #193
where we fixed a few things before actually merging it.
why not use the merged hash? or use main?

does this happen at the beginning of training? or at evaluation? or right after checkpointing?
have you made modification to the code or are you just running the repo as is?
what model are you using?
can you add the configuration file?
are you using our composer pytorch images? or trying to install requirements from scratch in a new env? or installing requirements using your base env?

This does look like an FSDP.

bmosaicml pushed a commit that referenced this issue Jun 6, 2023
@hanlint
Copy link
Collaborator

hanlint commented Jul 24, 2023

Closing as this error is running on a non-tagged commit. Please re-open if you see the issue in one of our released versions!

@hanlint hanlint closed this as completed Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants