Match Transformers RoPE implementation #214

zzhhjjj · 2024-08-14T17:20:24Z

Match RoPE with Transformers

What does this PR do?

Adds Transformers library RoPE to Nanotron.
Set the default value of rope_interleaved to False

Why?

In Nanotron, we currently use the interleaved version of RoPE, which differs from the implementation in Transformers. This discrepancy seems to cause a performance gap between Nanotron and Transformers after converting the weights.

Evaluation with lighteval

At lease for LLaMA 3/3.1, the evaluation results are very close

Note

I used this converter to maintain the order of the columns

--------------------------------- Previously ---------------------------------
It's no longer relevant to the PR, but I'm keeping the information here for reference

What has been done?

Change the modeling code of LLaMA in nanotron to get the same output as the Transformers library

Why?

The models trained with Nanotron show a drop in performance when converted to transformers format.
Able to fine-tune the LLaMA 3.1 model without loss

What has been changed?

Merged QKV -> Separate QKV
Merged Gate/Up -> Separate Gate/Up
Triton RMSNorm -> LLaMA RMSNorm
Flash RoPR(training) -> RoPE
Interleaved RoPE -> Non interleaved RoPE
Core attention -> flash_attn_func
Same computation device as transformers (CPU then CUDA)
Fix RoPE precision bug(need to set the precision of the buffer)

-> Exact logits match during generation.

How to test it?

CUDA_LAUNCH_BLOCKING=1 CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=1 --nnodes=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:29600 --max_restarts=0 --tee=3 tests/test_llama_generation.py --ckpt-path /fsx/haojun/lighteval_evaluation_model/Llama-3-8B-split

This script compares the output logits, and asserts Nanotron's output is exactly the same as transformers

Here llama3 weights are obtained by using converter script, but need to separate q,k,v; up, gate projection.

Note

However, I found it a bit overkill. The performance drop is most likely due to different RoPE implementations. So to match with Transformers, set the default value to False!

Lauler · 2024-08-27T17:12:55Z

Hi @zzhhjjj . I'm interested in doing continued pretraining of existing Llama 3.1 checkpoints with Nanotron. Would you recommend to use your PR or to wait before your PR is merged before starting this?

If I read your comment correctly, using current nanotron code to finetune or do continued pretraining of LLama 3.1 will cause drops in performance?

Does this also affect Llama3.0 and Llama2 as well?

zzhhjjj · 2024-09-02T16:44:37Z

@Lauler Hello Lauler, based on my experiment, It's better to set rope_interleaved=False if you need to convert the weights to Transformers

src/nanotron/models/llama.py

3outeille · 2024-09-05T13:30:53Z

lgtm

exact output logits match for LLaMA-3

f40fa05

clean code

b15fbc7

zzhhjjj changed the title ~~exact output logits match for LLaMA-3~~ Match Transformers RoPE implementation Sep 2, 2024

inference part

5ded28c

3outeille self-assigned this Sep 5, 2024

3outeille reviewed Sep 5, 2024

View reviewed changes

src/nanotron/models/llama.py Outdated Show resolved Hide resolved

src/nanotron/models/llama.py Show resolved Hide resolved

refactor

2677380

3outeille merged commit 1456446 into huggingface:main Sep 5, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Match Transformers RoPE implementation #214

Match Transformers RoPE implementation #214

zzhhjjj commented Aug 14, 2024 •

edited

Loading

Lauler commented Aug 27, 2024 •

edited

Loading

zzhhjjj commented Sep 2, 2024

3outeille commented Sep 5, 2024

Match Transformers RoPE implementation #214

Match Transformers RoPE implementation #214

Conversation

zzhhjjj commented Aug 14, 2024 • edited Loading

Match RoPE with Transformers

What does this PR do?

Why?

Evaluation with lighteval

Note

What has been done?

Why?

What has been changed?

How to test it?

Note

Lauler commented Aug 27, 2024 • edited Loading

zzhhjjj commented Sep 2, 2024

3outeille commented Sep 5, 2024

zzhhjjj commented Aug 14, 2024 •

edited

Loading

Lauler commented Aug 27, 2024 •

edited

Loading