[Chatllama] fix embedding out of bounds #253

HuangLK · 2023-03-11T16:12:44Z

while token_id is -1, embedding will cause out of bounds.
https://github.com/nebuly-ai/nebullvm/blob/ca085a979b5b596bf0ecd477e4c4deff3725661c/apps/accelerate/chatllama/chatllama/llama_model.py#L482

partial error message:

/opt/conda/conda-bld/pytorch_1659484808560/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [120,0,0], thread: [24,0,0] Assertion `srcIndex < srcSelectDimSize
` failed.
/opt/conda/conda-bld/pytorch_1659484808560/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [120,0,0], thread: [25,0,0] Assertion `srcIndex < srcSelectDimSize
` failed.
/opt/conda/conda-bld/pytorch_1659484808560/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [120,0,0], thread: [26,0,0] Assertion `srcIndex < srcSelectDimSize
` failed.

AAnirudh07 · 2023-03-12T03:57:35Z

Hi @HuangLK!
I modified llama_model.py with your changes but I still get the assertion error :(

Actor model: llama-7b
Tokenizer - llama's tokenizer (tokenizer.model)
Err:

Current device used :cuda
../chatllama_test/llama_weights/7B
Loading
Start Actor Model Pretraining
Traceback (most recent call last):
  File "/home/anirudh/rlhf/artifacts/main.py", line 51, in <module>
    actor_trainer.train()
  File "/home/anirudh/chatllama_test/venv/lib/python3.10/site-packages/chatllama/rlhf/actor.py", line 373, in train
    est_output = self.model(training_input, attention_mask)
  File "/home/anirudh/chatllama_test/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "<@beartype(chatllama.rlhf.actor.ActorModel.forward) at 0x7fa05053e290>", line 51, in forward
  File "/home/anirudh/chatllama_test/venv/lib/python3.10/site-packages/chatllama/rlhf/actor.py", line 114, in forward
    model_output = self.model.forward(
  File "/home/anirudh/chatllama_test/venv/lib/python3.10/site-packages/chatllama/llama_model.py", line 480, in forward
    logits = self._forward(tokens, attention_mask)
  File "/home/anirudh/chatllama_test/venv/lib/python3.10/site-packages/chatllama/llama_model.py", line 513, in _forward
    h, _, _ = layer(h, kv_mask, freqs_cis)
  File "/home/anirudh/chatllama_test/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/anirudh/chatllama_test/venv/lib/python3.10/site-packages/chatllama/llama_model.py", line 407, in forward
    attn, cache_k, cache_v = self.attention.forward(
  File "/home/anirudh/chatllama_test/venv/lib/python3.10/site-packages/chatllama/llama_model.py", line 293, in forward
    xq, xk = apply_rotary_emb(xq, xk, freqs_cis=freqs_cis)
  File "/home/anirudh/chatllama_test/venv/lib/python3.10/site-packages/chatllama/llama_model.py", line 200, in apply_rotary_emb
    freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
  File "/home/anirudh/chatllama_test/venv/lib/python3.10/site-packages/chatllama/llama_model.py", line 186, in reshape_for_broadcast
    assert freqs_cis.shape == (x.shape[1], x.shape[-1])
AssertionError

HuangLK · 2023-03-12T04:50:13Z

Seem like another case. You could check the max length of training data, or just use one simple&short example for debugging.

AAnirudh07 · 2023-03-12T07:02:12Z

will do, thnx!

bnuzhanyu · 2023-03-13T06:47:35Z

will do, thnx!

Did you fix that? I met this at iteration 205.

cokuehuang · 2023-03-13T10:54:24Z

@HuangLK Thanks for solving 'srcIndex < srcSelectDimSize' problem. I modified llama_model.py with your changes and occured another error :

Current device used :cuda
Loading
Start RL Training
Episode: 1 of 100, Timestep: 1 of 32
Traceback (most recent call last):
  File "artifacts/main.py", line 51, in <module>
    rlhf_trainer.train()
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/rlhf/trainer.py", line 655, in train
    ) = self.actorcritic.generate(states, states_mask)
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "<@beartype(chatllama.rlhf.trainer.ActorCritic.generate) at 0x7f5fcb170f70>", line 51, in generate
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/rlhf/trainer.py", line 144, in generate
    actions, sequence = self.actor.generate(states, state_mask)
  File "<@beartype(chatllama.rlhf.actor.ActorModel.generate) at 0x7f5fcd9f9160>", line 51, in generate
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/rlhf/actor.py", line 163, in generate
    sequences = self.model.generate(
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/llama_model.py", line 533, in generate
    logits = self._forward(input_ids, attention_mask)[:, -1, :]
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/llama_model.py", line 507, in _forward
    h, cache_k, cache_v = layer(
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/llama_model.py", line 405, in forward
    attn, cache_k, cache_v = self.attention.forward(
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/llama_model.py", line 304, in forward
    cache_k[:bsz, start_pos : start_pos + seqlen] = xk  # noqa E203
RuntimeError: The expanded size of the tensor (1) must match the existing size (32) at non-singleton dimension 0.  Target sizes: [1, 35, 32, 128].  Tensor sizes: [32, 35, 32, 128]

Do you have any ideas?

AAnirudh07 · 2023-03-14T01:11:15Z

Did you fix that? I met this at iteration 205.

@bnuzhanyu
Yep this did solve the problem! I ran out of memory a couple of epochs into the actor training but I blv that has nothing to do with this PR

PierpaoloSorbellini · 2023-03-20T10:46:12Z

Hi @HuangLK Thanks for the PR! We are very excited to have people be part of this project!
We have merged your PR great work!

fix embedding out of bounds

e65bc27

PierpaoloSorbellini changed the title ~~fix embedding out of bounds~~ [chatllama] fix embedding out of bounds Mar 14, 2023

PierpaoloSorbellini changed the title ~~[chatllama] fix embedding out of bounds~~ [Chatllama] fix embedding out of bounds Mar 14, 2023

PierpaoloSorbellini merged commit b49ad1c into nebuly-ai:main Mar 14, 2023

HuangLK deleted the feat/fix-embedding-oob branch March 15, 2023 06:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Chatllama] fix embedding out of bounds #253

[Chatllama] fix embedding out of bounds #253

HuangLK commented Mar 11, 2023

AAnirudh07 commented Mar 12, 2023 •

edited

Loading

HuangLK commented Mar 12, 2023

AAnirudh07 commented Mar 12, 2023

bnuzhanyu commented Mar 13, 2023

cokuehuang commented Mar 13, 2023

AAnirudh07 commented Mar 14, 2023

PierpaoloSorbellini commented Mar 20, 2023

[Chatllama] fix embedding out of bounds #253

[Chatllama] fix embedding out of bounds #253

Conversation

HuangLK commented Mar 11, 2023

AAnirudh07 commented Mar 12, 2023 • edited Loading

HuangLK commented Mar 12, 2023

AAnirudh07 commented Mar 12, 2023

bnuzhanyu commented Mar 13, 2023

cokuehuang commented Mar 13, 2023

AAnirudh07 commented Mar 14, 2023

PierpaoloSorbellini commented Mar 20, 2023

AAnirudh07 commented Mar 12, 2023 •

edited

Loading