Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Chatllama] fix embedding out of bounds #253

Merged
merged 1 commit into from
Mar 14, 2023

Conversation

HuangLK
Copy link
Contributor

@HuangLK HuangLK commented Mar 11, 2023

while token_id is -1, embedding will cause out of bounds.
https://github.com/nebuly-ai/nebullvm/blob/ca085a979b5b596bf0ecd477e4c4deff3725661c/apps/accelerate/chatllama/chatllama/llama_model.py#L482

partial error message:

/opt/conda/conda-bld/pytorch_1659484808560/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [120,0,0], thread: [24,0,0] Assertion `srcIndex < srcSelectDimSize
` failed.
/opt/conda/conda-bld/pytorch_1659484808560/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [120,0,0], thread: [25,0,0] Assertion `srcIndex < srcSelectDimSize
` failed.
/opt/conda/conda-bld/pytorch_1659484808560/work/aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [120,0,0], thread: [26,0,0] Assertion `srcIndex < srcSelectDimSize
` failed.

@AAnirudh07
Copy link
Contributor

AAnirudh07 commented Mar 12, 2023

Hi @HuangLK!
I modified llama_model.py with your changes but I still get the assertion error :(

  • Actor model: llama-7b
  • Tokenizer - llama's tokenizer (tokenizer.model)
    Err:
Current device used :cuda
../chatllama_test/llama_weights/7B
Loading
Start Actor Model Pretraining
Traceback (most recent call last):
  File "/home/anirudh/rlhf/artifacts/main.py", line 51, in <module>
    actor_trainer.train()
  File "/home/anirudh/chatllama_test/venv/lib/python3.10/site-packages/chatllama/rlhf/actor.py", line 373, in train
    est_output = self.model(training_input, attention_mask)
  File "/home/anirudh/chatllama_test/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "<@beartype(chatllama.rlhf.actor.ActorModel.forward) at 0x7fa05053e290>", line 51, in forward
  File "/home/anirudh/chatllama_test/venv/lib/python3.10/site-packages/chatllama/rlhf/actor.py", line 114, in forward
    model_output = self.model.forward(
  File "/home/anirudh/chatllama_test/venv/lib/python3.10/site-packages/chatllama/llama_model.py", line 480, in forward
    logits = self._forward(tokens, attention_mask)
  File "/home/anirudh/chatllama_test/venv/lib/python3.10/site-packages/chatllama/llama_model.py", line 513, in _forward
    h, _, _ = layer(h, kv_mask, freqs_cis)
  File "/home/anirudh/chatllama_test/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/anirudh/chatllama_test/venv/lib/python3.10/site-packages/chatllama/llama_model.py", line 407, in forward
    attn, cache_k, cache_v = self.attention.forward(
  File "/home/anirudh/chatllama_test/venv/lib/python3.10/site-packages/chatllama/llama_model.py", line 293, in forward
    xq, xk = apply_rotary_emb(xq, xk, freqs_cis=freqs_cis)
  File "/home/anirudh/chatllama_test/venv/lib/python3.10/site-packages/chatllama/llama_model.py", line 200, in apply_rotary_emb
    freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
  File "/home/anirudh/chatllama_test/venv/lib/python3.10/site-packages/chatllama/llama_model.py", line 186, in reshape_for_broadcast
    assert freqs_cis.shape == (x.shape[1], x.shape[-1])
AssertionError

@HuangLK
Copy link
Contributor Author

HuangLK commented Mar 12, 2023

Seem like another case. You could check the max length of training data, or just use one simple&short example for debugging.

@AAnirudh07
Copy link
Contributor

will do, thnx!

@bnuzhanyu
Copy link

will do, thnx!

Did you fix that? I met this at iteration 205.

@cokuehuang
Copy link

@HuangLK Thanks for solving 'srcIndex < srcSelectDimSize' problem. I modified llama_model.py with your changes and occured another error :

Current device used :cuda
Loading
Start RL Training
Episode: 1 of 100, Timestep: 1 of 32
Traceback (most recent call last):
  File "artifacts/main.py", line 51, in <module>
    rlhf_trainer.train()
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/rlhf/trainer.py", line 655, in train
    ) = self.actorcritic.generate(states, states_mask)
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "<@beartype(chatllama.rlhf.trainer.ActorCritic.generate) at 0x7f5fcb170f70>", line 51, in generate
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/rlhf/trainer.py", line 144, in generate
    actions, sequence = self.actor.generate(states, state_mask)
  File "<@beartype(chatllama.rlhf.actor.ActorModel.generate) at 0x7f5fcd9f9160>", line 51, in generate
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/rlhf/actor.py", line 163, in generate
    sequences = self.model.generate(
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/llama_model.py", line 533, in generate
    logits = self._forward(input_ids, attention_mask)[:, -1, :]
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/llama_model.py", line 507, in _forward
    h, cache_k, cache_v = layer(
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/llama_model.py", line 405, in forward
    attn, cache_k, cache_v = self.attention.forward(
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/llama_model.py", line 304, in forward
    cache_k[:bsz, start_pos : start_pos + seqlen] = xk  # noqa E203
RuntimeError: The expanded size of the tensor (1) must match the existing size (32) at non-singleton dimension 0.  Target sizes: [1, 35, 32, 128].  Tensor sizes: [32, 35, 32, 128]

Do you have any ideas?

@AAnirudh07
Copy link
Contributor

Did you fix that? I met this at iteration 205.

@bnuzhanyu
Yep this did solve the problem! I ran out of memory a couple of epochs into the actor training but I blv that has nothing to do with this PR

@PierpaoloSorbellini PierpaoloSorbellini changed the title fix embedding out of bounds [chatllama] fix embedding out of bounds Mar 14, 2023
@PierpaoloSorbellini PierpaoloSorbellini changed the title [chatllama] fix embedding out of bounds [Chatllama] fix embedding out of bounds Mar 14, 2023
@PierpaoloSorbellini PierpaoloSorbellini merged commit b49ad1c into nebuly-ai:main Mar 14, 2023
@HuangLK HuangLK deleted the feat/fix-embedding-oob branch March 15, 2023 06:35
@PierpaoloSorbellini
Copy link
Collaborator

Hi @HuangLK Thanks for the PR! We are very excited to have people be part of this project!
We have merged your PR great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants