ERROR: expected to be in states [<TrainingState_.IDLE: 1>] but current state is TrainingState_.BACKWARD_PRE #247

NarenZen · 2023-05-29T18:22:40Z

Got this error, which finetuning instruct model 8XA100 machine

ERROR: expected to be in states [<TrainingState_.IDLE: 1>] but current state is TrainingState_.BACKWARD_PRE
  File "/usr/local/lib/python3.10/dist-packages/composer/core/engine.py", line 526, in _close
    callback.close(state, logger)
  File "/usr/local/lib/python3.10/dist-packages/composer/callbacks/checkpoint_saver.py", line 362, in close
    self._save_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/composer/callbacks/checkpoint_saver.py", line 384, in _save_checkpoint
    saved_path = checkpoint.save_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/composer/utils/checkpoint.py", line 518, in save_checkpoint
    'state': state.state_dict(),
  File "/usr/local/lib/python3.10/dist-packages/composer/core/state.py", line 789, in state_dict
    model_state = attribute_value.state_dict()
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1448, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 2402, in state_dict
    with summon_ctx:
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 2974, in _summon_full_params
    self._assert_state([TrainingState_.IDLE])
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 3582, in _assert_state
    traceback.print_stack()
Error running CheckpointSaver.close(). Skipping CheckpointSaver.post_close().
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/composer/core/engine.py", line 526, in _close
    callback.close(state, logger)
  File "/usr/local/lib/python3.10/dist-packages/composer/callbacks/checkpoint_saver.py", line 362, in close
    self._save_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/composer/callbacks/checkpoint_saver.py", line 384, in _save_checkpoint
    saved_path = checkpoint.save_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/composer/utils/checkpoint.py", line 518, in save_checkpoint
    'state': state.state_dict(),
  File "/usr/local/lib/python3.10/dist-packages/composer/core/state.py", line 789, in state_dict
    model_state = attribute_value.state_dict()
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1448, in state_dict
    module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 2402, in state_dict
    with summon_ctx:
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 2974, in _summon_full_params
    self._assert_state([TrainingState_.IDLE])
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 3583, in _assert_state
    raise ValueError(msg)
ValueError: expected to be in states [<TrainingState_.IDLE: 1>] but current state is TrainingState_.BACKWARD_PRE
Stack (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/composer/core/engine.py", line 528, in _close
    log.error(
ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1.

The text was updated successfully, but these errors were encountered:

jacobfulano · 2023-05-31T02:31:18Z

Hi @NarenZen, can you provide more details on your hardware and docker image? Have you successfully saved checkpoints? Are you using multiple nodes?

NarenZen · 2023-05-31T11:47:26Z

hardware: 8xa100
i got this error, but was able see the saved checkpoints
running this git commit code: 89f56d2

vchiley · 2023-06-06T11:52:17Z

that commit looks like its part of #193
where we fixed a few things before actually merging it.
why not use the merged hash? or use main?

does this happen at the beginning of training? or at evaluation? or right after checkpointing?
have you made modification to the code or are you just running the repo as is?
what model are you using?
can you add the configuration file?
are you using our composer pytorch images? or trying to install requirements from scratch in a new env? or installing requirements using your base env?

This does look like an FSDP.

hanlint · 2023-07-24T06:25:36Z

Closing as this error is running on a non-tagged commit. Please re-open if you see the issue in one of our released versions!

jacobfulano assigned nik-mosaic May 31, 2023

bmosaicml pushed a commit that referenced this issue Jun 6, 2023

Fix non assignment bug (#247)

f799cae

hanlint closed this as completed Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERROR: expected to be in states [<TrainingState_.IDLE: 1>] but current state is TrainingState_.BACKWARD_PRE #247

ERROR: expected to be in states [<TrainingState_.IDLE: 1>] but current state is TrainingState_.BACKWARD_PRE #247

NarenZen commented May 29, 2023

jacobfulano commented May 31, 2023

NarenZen commented May 31, 2023

vchiley commented Jun 6, 2023 •

edited

Loading

hanlint commented Jul 24, 2023

ERROR: expected to be in states [<TrainingState_.IDLE: 1>] but current state is TrainingState_.BACKWARD_PRE #247

ERROR: expected to be in states [<TrainingState_.IDLE: 1>] but current state is TrainingState_.BACKWARD_PRE #247

Comments

NarenZen commented May 29, 2023

jacobfulano commented May 31, 2023

NarenZen commented May 31, 2023

vchiley commented Jun 6, 2023 • edited Loading

hanlint commented Jul 24, 2023

vchiley commented Jun 6, 2023 •

edited

Loading