Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: still have inflight params[BUG] #5648

Open
iszengxin opened this issue Jun 12, 2024 · 1 comment
Open

RuntimeError: still have inflight params[BUG] #5648

iszengxin opened this issue Jun 12, 2024 · 1 comment
Labels
bug Something isn't working training

Comments

@iszengxin
Copy link

Describe the bug
Hello,Can some one get Help. I use V0.14.3, installed from source code tar.gz: https://github.com/melMass/DeepSpeed/releases
I use deepspeed Zero3, and training LLama Factory KTO task, under the training-evaluate stage get this problem.

Launcher context
deepspeed --num_gpus 1 --master_port=9901 src/train.py .....
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}

Docker context
Are you using a specific docker image that you can share?

Additional context
RuntimeError: still have inflight params [{'id': 35, 'status': 'AVAILABLE', 'numel': 65536, 'ds_numel': 65536, 'shape': (4096, 16), 'ds_shape': (4096, 16), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536])}, {'id': 37, 'status': 'AVAILABLE', 'numel': 16384, 'ds_numel': 16384, 'shape': (1024, 16), 'ds_shape': (1024, 16), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([16384])}, {'id': 39, 'status': 'AVAILABLE', 'numel': 65536, 'ds_numel': 65536, 'shape': (4096, 16), 'ds_shape': (4096, 16), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536])}, {'id': 41, 'status': 'AVAILABLE', 'numel': 229376, 'ds_numel': 229376, 'shape': (14336, 16), 'ds_shape': (14336, 16), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([229376])}, {'id': 45, 'status': 'AVAILABLE', 'numel': 229376, 'ds_numel': 229376, 'shape': (14336, 16), 'ds_shape': (14336, 16), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([229376])}, {'id': 43, 'status': 'AVAILABLE', 'numel': 65536, 'ds_numel': 65536, 'shape': (4096, 16), 'ds_shape': (4096, 16), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536])}, {'id': 47, 'status': 'AVAILABLE', 'numel': 229376, 'ds_numel': 229376, 'shape': (14336, 16), 'ds_shape': (14336, 16), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([229376])}, {'id': 51, 'status': 'AVAILABLE', 'numel': 229376, 'ds_numel': 229376, 'shape': (14336, 16), 'ds_shape': (14336, 16), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([229376])}]

@iszengxin iszengxin added bug Something isn't working training labels Jun 12, 2024
@sxhysj
Copy link

sxhysj commented Jun 26, 2024

Same issue here. I used Zero-3, ZeRO Init and deepspeed 0.14.2. Here is the list o f my pip versions:
Package Version


accelerate 0.31.0
aiohttp 3.9.5
aiosignal 1.3.1
annotated-types 0.7.0
anyio 4.4.0
attrs 23.2.0
bitsandbytes 0.43.1
certifi 2024.6.2
charset-normalizer 3.3.2
dataclasses-json 0.6.7
datasets 2.14.7
deepspeed 0.14.2
dill 0.3.7
distro 1.9.0
docstring_parser 0.16
evaluate 0.4.1
filelock 3.13.1
frozenlist 1.4.1
fsspec 2023.10.0
greenlet 3.0.3
h11 0.14.0
hjson 3.1.0
httpcore 1.0.5
httpx 0.27.0
huggingface-hub 0.23.4
idna 3.7
Jinja2 3.1.3
joblib 1.4.2
jsonpatch 1.33
jsonpointer 3.0.0
langchain 0.2.6
langchain-community 0.2.6
langchain-core 0.2.10
langchain-text-splitters 0.2.2
langsmith 0.1.82
markdown-it-py 3.0.0
MarkupSafe 2.1.5
marshmallow 3.21.3
mdurl 0.1.2
mpmath 1.3.0
multidict 6.0.5
multiprocess 0.70.15
mypy-extensions 1.0.0
networkx 3.2.1
ninja 1.11.1.1
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.1.105
nvidia-nvtx-cu12 12.1.105
openai 1.12.0
orjson 3.10.5
packaging 24.1
pandas 2.2.0
peft 0.11.1
pillow 10.2.0
pip 24.0
protobuf 4.25.2
psutil 6.0.0
py-cpuinfo 9.0.0
pyarrow 16.1.0
pyarrow-hotfix 0.6
pydantic 2.7.4
pydantic_core 2.18.4
Pygments 2.18.0
pynvml 11.5.0
python-dateutil 2.9.0.post0
pytz 2024.1
PyYAML 6.0.1
regex 2024.5.15
requests 2.32.3
responses 0.18.0
rich 13.7.1
safetensors 0.4.3
scikit-learn 1.4.0
scipy 1.14.0
sentencepiece 0.1.99
setuptools 69.5.1
shtab 1.7.1
six 1.16.0
sniffio 1.3.1
SQLAlchemy 2.0.31
sympy 1.12
tenacity 8.2.3
threadpoolctl 3.5.0
tiktoken 0.6.0
tokenizers 0.19.1
torch 2.3.1+cu121
torchaudio 2.3.1+cu121
torchvision 0.18.1+cu121
tqdm 4.66.4
transformers 4.41.2
triton 2.3.1
trl 0.9.4
typing_extensions 4.9.0
typing-inspect 0.9.0
tyro 0.8.5
tzdata 2024.1
urllib3 2.2.2
wheel 0.43.0
xxhash 3.4.1
yarl 1.9.4

I encountered error:

[2024-06-26 12:43:15,076] [INFO] [config.py:986:print_user_config]   json = {
    "fp16": {
        "enabled": false,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 2e-05,
            "warmup_num_steps": 1000
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "nvme",
            "pin_memory": true,
            "nvme_path": "/home/xxx/git/sep/tmp",
            "buffer_count": 40
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true,
            "nvme_path": "/home/xxx/git/sep/tmp2",
            "buffer_count": 40
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1.000000e+09,
        "reduce_bucket_size": 1.000000e+06,
        "stage3_prefetch_bucket_size": 1.509949e+07,
        "stage3_param_persistence_threshold": 4.096000e+04,
        "stage3_max_live_parameters": 1.000000e+09,
        "stage3_max_reuse_distance": 1.000000e+09,
        "stage3_gather_16bit_weights_on_model_save": false
    },
    "gradient_accumulation_steps": 1,
    "gradient_clipping": 1.0,
    "steps_per_print": inf,
    "train_batch_size": 1,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": false,
    "bf16": {
        "enabled": false
    },
    "zero_allow_untested_optimizer": true
}
reward_model_name:  ./saved_models/reward_model_vicuna-7b-adapter-merged
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.
  warnings.warn(warning_msg)
Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at ./saved_models/reward_model_vicuna-7b-adapter-merged and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
0it [00:00, ?it/s]/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/bitsandbytes/nn/modules.py:426: UserWarning: Input type into Linear4bit is torch.float16, but bnb_4bit_compute_dtype=torch.float32 (default). This will lead to slow inference or training speed.
  warnings.warn(
/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/transformers/pipelines/text_classification.py:104: UserWarning: `return_all_scores` is now deprecated,  if want a similar functionality use `top_k=None` instead of `return_all_scores=True` or `top_k=1` instead of `return_all_scores=False`.
  warnings.warn(
Invalidate trace cache @ step 10: expected module 1048, but got module 1055
0it [00:10, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/xxx/git/sep/main.py", line 94, in <module>
[rank0]:     exp_model.train()
[rank0]:   File "/home/xxx/git/sep/exp/exp_model.py", line 87, in train
[rank0]:     tuning_lm_with_rl(self.args)
[rank0]:   File "/home/xxx/git/sep/predict_module/tuning_lm_with_rl.py", line 264, in tuning_lm_with_rl
[rank0]:     stats = ppo_trainer.step(question_tensors, response_tensors, rewards)
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/anaconda3/envs/ml/lib/python3.11/contextlib.py", line 81, in inner
[rank0]:     return func(*args, **kwds)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 749, in step
[rank0]:     ref_logprobs, ref_logits_or_none, _, _ = self.batched_forward_pass(
[rank0]:                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/contextlib.py", line 81, in inner
[rank0]:     return func(*args, **kwds)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 1013, in batched_forward_pass
[rank0]:     logits, _, values = model(**input_kwargs)
[rank0]:                         ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1855, in forward
[rank0]:     loss = self.module(*inputs, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1595, in _call_impl
[rank0]:     hook_result = hook(self, args, result)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 232, in _end_of_forward_hook
[rank0]:     self.get_param_coordinator(training=False).reset_step()
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 207, in reset_step
[rank0]:     raise RuntimeError(f"still have inflight params "
[rank0]: RuntimeError: still have inflight params [{'id': 3, 'status': 'AVAILABLE', 'numel': 32768, 'ds_numel': 32768, 'shape': (4096, 8), 'ds_shape': (4096, 8), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([32768])}, {'id': 7, 'status': 'AVAILABLE', 'numel': 32768, 'ds_numel': 32768, 'shape': (4096, 8), 'ds_shape': (4096, 8), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([32768])}, {'id': 16, 'status': 'AVAILABLE', 'numel': 32768, 'ds_numel': 32768, 'shape': (4096, 8), 'ds_shape': (4096, 8), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([32768])}, {'id': 20, 'status': 'AVAILABLE', 'numel': 32768, 'ds_numel': 32768, 'shape': (4096, 8), 'ds_shape': (4096, 8), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([32768])}]
E0626 12:43:31.197000 125351468656448 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 3290) of binary: /home/xxx/anaconda3/envs/ml/bin/python
Traceback (most recent call last):
  File "/home/xxx/anaconda3/envs/ml/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1082, in launch_command
    deepspeed_launcher(args)
  File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/accelerate/commands/launch.py", line 786, in deepspeed_launcher
    distrib_run.run(args)
  File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-26_12:43:31
  host      : hcserver
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3290)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

2 participants