Fix multi-GPU loading and inference #190

casper-hansen · 2023-11-13T20:45:26Z

Resolves #162, Resolves #131, Resolves #143

update the use of accelerate methods for multi-GPU (they broke at some point)
fix memory issues related to multi-GPU
- cuda error: an illegal memory access was encountered: This was caused by tensors not being on the right devices. The solution is to put tensors on the right device at the model level - doing it at the linear module level was not a full fix.
note: hidden_states.to(attn_output.device) + attn_output may not be needed, needs more testing to make sure it is needed

pseudotensor · 2023-11-14T20:52:15Z

cool!

pseudotensor · 2024-01-21T06:47:10Z

I'm running this mode: TheBloke/openchat_3.5-16k-AWQ and while the 'balanced' case runs across all GPUs for model after it loads, any use of the model for large context input ends up only using the first GPU and going GPU OOM.

i.e.

    generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/generation/utils.py", line 1718, in generate
    return self.greedy_search(
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/generation/utils.py", line 2579, in greedy_search
    outputs = self(
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1053, in forward
    outputs = self.model(
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/awq/modules/fused/model.py", line 101, in forward
    h, _, past_key_value = layer(
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/awq/modules/fused/block.py", line 65, in forward
    attn_output, _, past_key_value = self.attn.forward(
  File "/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/awq/modules/fused/attn.py", line 200, in forward
    scores = torch.matmul(xq, keys.transpose(2, 3)) / math.sqrt(self.head_dim)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 23.70 GiB. GPU 0 has a total capacty of 47.54 GiB of which 12.35 GiB is free. Including non-PyTorch memory, this process has 35.18 GiB memory in use. Of the allocated memory 33.16 GiB is allocated by PyTorch, and 795.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

About 4GB is on each GPU post load, but usage blows up the first GPU and leads to this. Is the forward not distributed?

i.e. post load:

Post failure:

casper-hansen added 4 commits November 13, 2023 20:44

Fix multi-GPU loading and inference

4ac21dd

Improve loading pretrained

a1cd864

Align input devices when using gemm/gemv kernels

098f265

Fix incorrect devices in multi-GPU case

5ab1617

casper-hansen mentioned this pull request Nov 13, 2023

Getting OOM error while loading llama 70b using AWQ. #162

Closed

yatesdr mentioned this pull request Nov 14, 2023

Loading quantized model to cuda:1 #131

Closed

casper-hansen merged commit 09c73fb into main Nov 14, 2023

casper-hansen deleted the fix_multi_gpu branch November 14, 2023 20:53

casper-hansen mentioned this pull request Nov 16, 2023

Bump AutoAWQ to 0.1.7 oobabooga/text-generation-webui#4620

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multi-GPU loading and inference #190

Fix multi-GPU loading and inference #190

casper-hansen commented Nov 13, 2023 •

edited

Loading

pseudotensor commented Nov 14, 2023

pseudotensor commented Jan 21, 2024 •

edited

Loading

Fix multi-GPU loading and inference #190

Fix multi-GPU loading and inference #190

Conversation

casper-hansen commented Nov 13, 2023 • edited Loading

pseudotensor commented Nov 14, 2023

pseudotensor commented Jan 21, 2024 • edited Loading

casper-hansen commented Nov 13, 2023 •

edited

Loading

pseudotensor commented Jan 21, 2024 •

edited

Loading