Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match Transformers RoPE implementation #214

Merged
merged 4 commits into from
Sep 5, 2024

Conversation

zzhhjjj
Copy link
Collaborator

@zzhhjjj zzhhjjj commented Aug 14, 2024

Match RoPE with Transformers

What does this PR do?

  1. Adds Transformers library RoPE to Nanotron.
  2. Set the default value of rope_interleaved to False

Why?

In Nanotron, we currently use the interleaved version of RoPE, which differs from the implementation in Transformers. This discrepancy seems to cause a performance gap between Nanotron and Transformers after converting the weights.

Evaluation with lighteval

At lease for LLaMA 3/3.1, the evaluation results are very close
eval

Note

I used this converter to maintain the order of the columns

--------------------------------- Previously ---------------------------------
It's no longer relevant to the PR, but I'm keeping the information here for reference

What has been done?

Change the modeling code of LLaMA in nanotron to get the same output as the Transformers library

Why?

  1. The models trained with Nanotron show a drop in performance when converted to transformers format.
  2. Able to fine-tune the LLaMA 3.1 model without loss

What has been changed?

  1. Merged QKV -> Separate QKV
  2. Merged Gate/Up -> Separate Gate/Up
  3. Triton RMSNorm -> LLaMA RMSNorm
  4. Flash RoPR(training) -> RoPE
  5. Interleaved RoPE -> Non interleaved RoPE
  6. Core attention -> flash_attn_func
  7. Same computation device as transformers (CPU then CUDA)
  8. Fix RoPE precision bug(need to set the precision of the buffer)

-> Exact logits match during generation.

How to test it?

CUDA_LAUNCH_BLOCKING=1 CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=1 --nnodes=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:29600 --max_restarts=0 --tee=3 tests/test_llama_generation.py --ckpt-path /fsx/haojun/lighteval_evaluation_model/Llama-3-8B-split

This script compares the output logits, and asserts Nanotron's output is exactly the same as transformers

Here llama3 weights are obtained by using converter script, but need to separate q,k,v; up, gate projection.

Note

However, I found it a bit overkill. The performance drop is most likely due to different RoPE implementations. So to match with Transformers, set the default value to False!

@Lauler
Copy link

Lauler commented Aug 27, 2024

Hi @zzhhjjj . I'm interested in doing continued pretraining of existing Llama 3.1 checkpoints with Nanotron. Would you recommend to use your PR or to wait before your PR is merged before starting this?

If I read your comment correctly, using current nanotron code to finetune or do continued pretraining of LLama 3.1 will cause drops in performance?

Does this also affect Llama3.0 and Llama2 as well?

@zzhhjjj
Copy link
Collaborator Author

zzhhjjj commented Sep 2, 2024

@Lauler Hello Lauler, based on my experiment, It's better to set rope_interleaved=False if you need to convert the weights to Transformers

@zzhhjjj zzhhjjj changed the title exact output logits match for LLaMA-3 Match Transformers RoPE implementation Sep 2, 2024
@3outeille 3outeille self-assigned this Sep 5, 2024
src/nanotron/models/llama.py Outdated Show resolved Hide resolved
src/nanotron/models/llama.py Show resolved Hide resolved
@3outeille
Copy link
Member

lgtm

@3outeille 3outeille merged commit 1456446 into huggingface:main Sep 5, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants