RuntimeError: still have inflight params[BUG] #5648

iszengxin · 2024-06-12T12:47:57Z

Describe the bug
Hello,Can some one get Help. I use V0.14.3, installed from source code tar.gz: https://github.com/melMass/DeepSpeed/releases
I use deepspeed Zero3, and training LLama Factory KTO task, under the training-evaluate stage get this problem.

Launcher context
deepspeed --num_gpus 1 --master_port=9901 src/train.py .....
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}

Docker context
Are you using a specific docker image that you can share?

Additional context
RuntimeError: still have inflight params [{'id': 35, 'status': 'AVAILABLE', 'numel': 65536, 'ds_numel': 65536, 'shape': (4096, 16), 'ds_shape': (4096, 16), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536])}, {'id': 37, 'status': 'AVAILABLE', 'numel': 16384, 'ds_numel': 16384, 'shape': (1024, 16), 'ds_shape': (1024, 16), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([16384])}, {'id': 39, 'status': 'AVAILABLE', 'numel': 65536, 'ds_numel': 65536, 'shape': (4096, 16), 'ds_shape': (4096, 16), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536])}, {'id': 41, 'status': 'AVAILABLE', 'numel': 229376, 'ds_numel': 229376, 'shape': (14336, 16), 'ds_shape': (14336, 16), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([229376])}, {'id': 45, 'status': 'AVAILABLE', 'numel': 229376, 'ds_numel': 229376, 'shape': (14336, 16), 'ds_shape': (14336, 16), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([229376])}, {'id': 43, 'status': 'AVAILABLE', 'numel': 65536, 'ds_numel': 65536, 'shape': (4096, 16), 'ds_shape': (4096, 16), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536])}, {'id': 47, 'status': 'AVAILABLE', 'numel': 229376, 'ds_numel': 229376, 'shape': (14336, 16), 'ds_shape': (14336, 16), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([229376])}, {'id': 51, 'status': 'AVAILABLE', 'numel': 229376, 'ds_numel': 229376, 'shape': (14336, 16), 'ds_shape': (14336, 16), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([229376])}]

sxhysj · 2024-06-26T04:51:34Z

Same issue here. I used Zero-3, ZeRO Init and deepspeed 0.14.2. Here is the list o f my pip versions:
Package Version

accelerate 0.31.0
aiohttp 3.9.5
aiosignal 1.3.1
annotated-types 0.7.0
anyio 4.4.0
attrs 23.2.0
bitsandbytes 0.43.1
certifi 2024.6.2
charset-normalizer 3.3.2
dataclasses-json 0.6.7
datasets 2.14.7
deepspeed 0.14.2
dill 0.3.7
distro 1.9.0
docstring_parser 0.16
evaluate 0.4.1
filelock 3.13.1
frozenlist 1.4.1
fsspec 2023.10.0
greenlet 3.0.3
h11 0.14.0
hjson 3.1.0
httpcore 1.0.5
httpx 0.27.0
huggingface-hub 0.23.4
idna 3.7
Jinja2 3.1.3
joblib 1.4.2
jsonpatch 1.33
jsonpointer 3.0.0
langchain 0.2.6
langchain-community 0.2.6
langchain-core 0.2.10
langchain-text-splitters 0.2.2
langsmith 0.1.82
markdown-it-py 3.0.0
MarkupSafe 2.1.5
marshmallow 3.21.3
mdurl 0.1.2
mpmath 1.3.0
multidict 6.0.5
multiprocess 0.70.15
mypy-extensions 1.0.0
networkx 3.2.1
ninja 1.11.1.1
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.1.105
nvidia-nvtx-cu12 12.1.105
openai 1.12.0
orjson 3.10.5
packaging 24.1
pandas 2.2.0
peft 0.11.1
pillow 10.2.0
pip 24.0
protobuf 4.25.2
psutil 6.0.0
py-cpuinfo 9.0.0
pyarrow 16.1.0
pyarrow-hotfix 0.6
pydantic 2.7.4
pydantic_core 2.18.4
Pygments 2.18.0
pynvml 11.5.0
python-dateutil 2.9.0.post0
pytz 2024.1
PyYAML 6.0.1
regex 2024.5.15
requests 2.32.3
responses 0.18.0
rich 13.7.1
safetensors 0.4.3
scikit-learn 1.4.0
scipy 1.14.0
sentencepiece 0.1.99
setuptools 69.5.1
shtab 1.7.1
six 1.16.0
sniffio 1.3.1
SQLAlchemy 2.0.31
sympy 1.12
tenacity 8.2.3
threadpoolctl 3.5.0
tiktoken 0.6.0
tokenizers 0.19.1
torch 2.3.1+cu121
torchaudio 2.3.1+cu121
torchvision 0.18.1+cu121
tqdm 4.66.4
transformers 4.41.2
triton 2.3.1
trl 0.9.4
typing_extensions 4.9.0
typing-inspect 0.9.0
tyro 0.8.5
tzdata 2024.1
urllib3 2.2.2
wheel 0.43.0
xxhash 3.4.1
yarl 1.9.4

I encountered error:

[2024-06-26 12:43:15,076] [INFO] [config.py:986:print_user_config]   json = {
    "fp16": {
        "enabled": false,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 2e-05,
            "warmup_num_steps": 1000
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "nvme",
            "pin_memory": true,
            "nvme_path": "/home/xxx/git/sep/tmp",
            "buffer_count": 40
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true,
            "nvme_path": "/home/xxx/git/sep/tmp2",
            "buffer_count": 40
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1.000000e+09,
        "reduce_bucket_size": 1.000000e+06,
        "stage3_prefetch_bucket_size": 1.509949e+07,
        "stage3_param_persistence_threshold": 4.096000e+04,
        "stage3_max_live_parameters": 1.000000e+09,
        "stage3_max_reuse_distance": 1.000000e+09,
        "stage3_gather_16bit_weights_on_model_save": false
    },
    "gradient_accumulation_steps": 1,
    "gradient_clipping": 1.0,
    "steps_per_print": inf,
    "train_batch_size": 1,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": false,
    "bf16": {
        "enabled": false
    },
    "zero_allow_untested_optimizer": true
}
reward_model_name:  ./saved_models/reward_model_vicuna-7b-adapter-merged
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.
  warnings.warn(warning_msg)
Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at ./saved_models/reward_model_vicuna-7b-adapter-merged and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
0it [00:00, ?it/s]/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/bitsandbytes/nn/modules.py:426: UserWarning: Input type into Linear4bit is torch.float16, but bnb_4bit_compute_dtype=torch.float32 (default). This will lead to slow inference or training speed.
  warnings.warn(
/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/transformers/pipelines/text_classification.py:104: UserWarning: `return_all_scores` is now deprecated,  if want a similar functionality use `top_k=None` instead of `return_all_scores=True` or `top_k=1` instead of `return_all_scores=False`.
  warnings.warn(
Invalidate trace cache @ step 10: expected module 1048, but got module 1055
0it [00:10, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/xxx/git/sep/main.py", line 94, in <module>
[rank0]:     exp_model.train()
[rank0]:   File "/home/xxx/git/sep/exp/exp_model.py", line 87, in train
[rank0]:     tuning_lm_with_rl(self.args)
[rank0]:   File "/home/xxx/git/sep/predict_module/tuning_lm_with_rl.py", line 264, in tuning_lm_with_rl
[rank0]:     stats = ppo_trainer.step(question_tensors, response_tensors, rewards)
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/anaconda3/envs/ml/lib/python3.11/contextlib.py", line 81, in inner
[rank0]:     return func(*args, **kwds)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 749, in step
[rank0]:     ref_logprobs, ref_logits_or_none, _, _ = self.batched_forward_pass(
[rank0]:                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/contextlib.py", line 81, in inner
[rank0]:     return func(*args, **kwds)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 1013, in batched_forward_pass
[rank0]:     logits, _, values = model(**input_kwargs)
[rank0]:                         ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1855, in forward
[rank0]:     loss = self.module(*inputs, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1595, in _call_impl
[rank0]:     hook_result = hook(self, args, result)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 232, in _end_of_forward_hook
[rank0]:     self.get_param_coordinator(training=False).reset_step()
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 207, in reset_step
[rank0]:     raise RuntimeError(f"still have inflight params "
[rank0]: RuntimeError: still have inflight params [{'id': 3, 'status': 'AVAILABLE', 'numel': 32768, 'ds_numel': 32768, 'shape': (4096, 8), 'ds_shape': (4096, 8), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([32768])}, {'id': 7, 'status': 'AVAILABLE', 'numel': 32768, 'ds_numel': 32768, 'shape': (4096, 8), 'ds_shape': (4096, 8), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([32768])}, {'id': 16, 'status': 'AVAILABLE', 'numel': 32768, 'ds_numel': 32768, 'shape': (4096, 8), 'ds_shape': (4096, 8), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([32768])}, {'id': 20, 'status': 'AVAILABLE', 'numel': 32768, 'ds_numel': 32768, 'shape': (4096, 8), 'ds_shape': (4096, 8), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([32768])}]
E0626 12:43:31.197000 125351468656448 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 3290) of binary: /home/xxx/anaconda3/envs/ml/bin/python
Traceback (most recent call last):
  File "/home/xxx/anaconda3/envs/ml/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1082, in launch_command
    deepspeed_launcher(args)
  File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/accelerate/commands/launch.py", line 786, in deepspeed_launcher
    distrib_run.run(args)
  File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-26_12:43:31
  host      : hcserver
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3290)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

iszengxin added bug Something isn't working training labels Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: still have inflight params[BUG] #5648

RuntimeError: still have inflight params[BUG] #5648

iszengxin commented Jun 12, 2024

sxhysj commented Jun 26, 2024

RuntimeError: still have inflight params[BUG] #5648

RuntimeError: still have inflight params[BUG] #5648

Comments

iszengxin commented Jun 12, 2024

sxhysj commented Jun 26, 2024