You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Got this error, which finetuning instruct model 8XA100 machine
ERROR: expected to be in states [<TrainingState_.IDLE: 1>] but current state is TrainingState_.BACKWARD_PRE
File "/usr/local/lib/python3.10/dist-packages/composer/core/engine.py", line 526, in _close
callback.close(state, logger)
File "/usr/local/lib/python3.10/dist-packages/composer/callbacks/checkpoint_saver.py", line 362, in close
self._save_checkpoint(
File "/usr/local/lib/python3.10/dist-packages/composer/callbacks/checkpoint_saver.py", line 384, in _save_checkpoint
saved_path = checkpoint.save_checkpoint(
File "/usr/local/lib/python3.10/dist-packages/composer/utils/checkpoint.py", line 518, in save_checkpoint
'state': state.state_dict(),
File "/usr/local/lib/python3.10/dist-packages/composer/core/state.py", line 789, in state_dict
model_state = attribute_value.state_dict()
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1448, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 2402, in state_dict
with summon_ctx:
File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 2974, in _summon_full_params
self._assert_state([TrainingState_.IDLE])
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 3582, in _assert_state
traceback.print_stack()
Error running CheckpointSaver.close(). Skipping CheckpointSaver.post_close().
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/composer/core/engine.py", line 526, in _close
callback.close(state, logger)
File "/usr/local/lib/python3.10/dist-packages/composer/callbacks/checkpoint_saver.py", line 362, in close
self._save_checkpoint(
File "/usr/local/lib/python3.10/dist-packages/composer/callbacks/checkpoint_saver.py", line 384, in _save_checkpoint
saved_path = checkpoint.save_checkpoint(
File "/usr/local/lib/python3.10/dist-packages/composer/utils/checkpoint.py", line 518, in save_checkpoint
'state': state.state_dict(),
File "/usr/local/lib/python3.10/dist-packages/composer/core/state.py", line 789, in state_dict
model_state = attribute_value.state_dict()
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1448, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 2402, in state_dict
with summon_ctx:
File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 2974, in _summon_full_params
self._assert_state([TrainingState_.IDLE])
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 3583, in _assert_state
raise ValueError(msg)
ValueError: expected to be in states [<TrainingState_.IDLE: 1>] but current state is TrainingState_.BACKWARD_PRE
Stack (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/composer/core/engine.py", line 528, in _close
log.error(
ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1.
The text was updated successfully, but these errors were encountered:
that commit looks like its part of #193
where we fixed a few things before actually merging it.
why not use the merged hash? or use main?
does this happen at the beginning of training? or at evaluation? or right after checkpointing?
have you made modification to the code or are you just running the repo as is?
what model are you using?
can you add the configuration file?
are you using our composer pytorch images? or trying to install requirements from scratch in a new env? or installing requirements using your base env?
Got this error, which finetuning instruct model 8XA100 machine
The text was updated successfully, but these errors were encountered: