Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : add option to render special/control tokens #6807

Merged
merged 4 commits into from
Apr 21, 2024

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Apr 21, 2024

fix #6770

Setting special == true in llama_token_to_piece() will cause special/control tokens' text to be rendered in the output:

llama.cpp/llama.h

Lines 827 to 837 in 1f45c2a

// Token Id -> Piece.
// Uses the vocabulary in the provided context.
// Does not write null terminator to the buffer.
// User code is responsible to remove the leading whitespace of the first non-BOS token when decoding multiple tokens.
// @param special If true, special tokens are rendered in the output.
LLAMA_API int32_t llama_token_to_piece(
const struct llama_model * model,
llama_token token,
char * buf,
int32_t length,
bool special);

Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 215 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=22612.46ms p(95)=38873.62ms fails=, finish reason: stop=101 truncated=114
  • Prompt processing (pp): avg=269.66tk/s p(95)=800.94tk/s
  • Token generation (tg): avg=23.51tk/s p(95)=26.01tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=gg/render-control-tokens commit=ed5d273c4dcc075a86b94a831bb825fb98519ce0

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 215 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1713705780 --> 1713706418
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 307.32, 307.32, 307.32, 307.32, 307.32, 341.48, 341.48, 341.48, 341.48, 341.48, 648.34, 648.34, 648.34, 648.34, 648.34, 655.03, 655.03, 655.03, 655.03, 655.03, 651.63, 651.63, 651.63, 651.63, 651.63, 637.9, 637.9, 637.9, 637.9, 637.9, 626.6, 626.6, 626.6, 626.6, 626.6, 646.59, 646.59, 646.59, 646.59, 646.59, 643.5, 643.5, 643.5, 643.5, 643.5, 657.53, 657.53, 657.53, 657.53, 657.53, 657.86, 657.86, 657.86, 657.86, 657.86, 677.98, 677.98, 677.98, 677.98, 677.98, 684.45, 684.45, 684.45, 684.45, 684.45, 681.23, 681.23, 681.23, 681.23, 681.23, 677.81, 677.81, 677.81, 677.81, 677.81, 675.64, 675.64, 675.64, 675.64, 675.64, 676.19, 676.19, 676.19, 676.19, 676.19, 679.49, 679.49, 679.49, 679.49, 679.49, 684.69, 684.69, 684.69, 684.69, 684.69, 682.54, 682.54, 682.54, 682.54, 682.54, 686.51, 686.51, 686.51, 686.51, 686.51, 684.57, 684.57, 684.57, 684.57, 684.57, 685.13, 685.13, 685.13, 685.13, 685.13, 691.79, 691.79, 691.79, 691.79, 691.79, 692.01, 692.01, 692.01, 692.01, 692.01, 691.33, 691.33, 691.33, 691.33, 691.33, 688.01, 688.01, 688.01, 688.01, 688.01, 702.3, 702.3, 702.3, 702.3, 702.3, 700.33, 700.33, 700.33, 700.33, 700.33, 697.1, 697.1, 697.1, 697.1, 697.1, 709.86, 709.86, 709.86, 709.86, 709.86, 711.55, 711.55, 711.55, 711.55, 711.55, 710.69, 710.69, 710.69, 710.69, 710.69, 708.45, 708.45, 708.45, 708.45, 708.45, 710.68, 710.68, 710.68, 710.68, 710.68, 715.26, 715.26, 715.26, 715.26, 715.26, 716.95, 716.95, 716.95, 716.95, 716.95, 713.01, 713.01, 713.01, 713.01, 713.01, 706.79, 706.79, 706.79, 706.79, 706.79, 703.22, 703.22, 703.22, 703.22, 703.22, 702.39, 702.39, 702.39, 702.39, 702.39, 702.41, 702.41, 702.41, 702.41, 702.41, 702.69, 702.69, 702.69, 702.69, 702.69, 703.41, 703.41, 703.41, 703.41, 703.41, 701.88, 701.88, 701.88, 701.88, 701.88, 700.97, 700.97, 700.97, 700.97, 700.97, 700.69, 700.69, 700.69, 700.69, 700.69, 704.55, 704.55, 704.55, 704.55, 704.55, 709.4, 709.4, 709.4, 709.4, 709.4, 707.51, 707.51, 707.51, 707.51, 707.51, 706.73, 706.73, 706.73, 706.73, 706.73, 706.14, 706.14, 706.14, 706.14, 706.14, 704.52, 704.52, 704.52, 704.52, 704.52, 705.7, 705.7, 705.7, 705.7, 705.7, 705.04, 705.04, 705.04, 705.04, 705.04, 708.07, 708.07, 708.07, 708.07, 708.07, 707.4, 707.4, 707.4, 707.4, 707.4, 710.05, 710.05, 710.05, 710.05, 710.05, 709.54, 709.54, 709.54, 709.54, 709.54, 716.04, 716.04, 716.04, 716.04, 716.04, 718.21, 718.21, 718.21, 718.21, 718.21, 718.21, 718.21, 718.21, 718.21, 718.21]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 215 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1713705780 --> 1713706418
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 29.65, 29.65, 29.65, 29.65, 29.65, 29.42, 29.42, 29.42, 29.42, 29.42, 27.91, 27.91, 27.91, 27.91, 27.91, 25.72, 25.72, 25.72, 25.72, 25.72, 23.69, 23.69, 23.69, 23.69, 23.69, 20.42, 20.42, 20.42, 20.42, 20.42, 17.39, 17.39, 17.39, 17.39, 17.39, 17.16, 17.16, 17.16, 17.16, 17.16, 17.36, 17.36, 17.36, 17.36, 17.36, 17.82, 17.82, 17.82, 17.82, 17.82, 18.27, 18.27, 18.27, 18.27, 18.27, 18.31, 18.31, 18.31, 18.31, 18.31, 18.32, 18.32, 18.32, 18.32, 18.32, 18.17, 18.17, 18.17, 18.17, 18.17, 18.15, 18.15, 18.15, 18.15, 18.15, 18.47, 18.47, 18.47, 18.47, 18.47, 18.8, 18.8, 18.8, 18.8, 18.8, 18.96, 18.96, 18.96, 18.96, 18.96, 19.25, 19.25, 19.25, 19.25, 19.25, 19.31, 19.31, 19.31, 19.31, 19.31, 19.36, 19.36, 19.36, 19.36, 19.36, 19.42, 19.42, 19.42, 19.42, 19.42, 19.46, 19.46, 19.46, 19.46, 19.46, 19.51, 19.51, 19.51, 19.51, 19.51, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.51, 19.51, 19.51, 19.51, 19.51, 19.52, 19.52, 19.52, 19.52, 19.52, 19.49, 19.49, 19.49, 19.49, 19.49, 19.46, 19.46, 19.46, 19.46, 19.46, 19.35, 19.35, 19.35, 19.35, 19.35, 19.25, 19.25, 19.25, 19.25, 19.25, 19.19, 19.19, 19.19, 19.19, 19.19, 18.94, 18.94, 18.94, 18.94, 18.94, 18.78, 18.78, 18.78, 18.78, 18.78, 18.75, 18.75, 18.75, 18.75, 18.75, 18.66, 18.66, 18.66, 18.66, 18.66, 18.54, 18.54, 18.54, 18.54, 18.54, 18.45, 18.45, 18.45, 18.45, 18.45, 18.3, 18.3, 18.3, 18.3, 18.3, 18.19, 18.19, 18.19, 18.19, 18.19, 17.89, 17.89, 17.89, 17.89, 17.89, 17.81, 17.81, 17.81, 17.81, 17.81, 17.82, 17.82, 17.82, 17.82, 17.82, 17.85, 17.85, 17.85, 17.85, 17.85, 17.89, 17.89, 17.89, 17.89, 17.89, 17.94, 17.94, 17.94, 17.94, 17.94, 18.01, 18.01, 18.01, 18.01, 18.01, 18.04, 18.04, 18.04, 18.04, 18.04, 17.96, 17.96, 17.96, 17.96, 17.96, 17.96, 17.96, 17.96, 17.96, 17.96, 17.84, 17.84, 17.84, 17.84, 17.84, 17.76, 17.76, 17.76, 17.76, 17.76, 17.75, 17.75, 17.75, 17.75, 17.75, 17.81, 17.81, 17.81, 17.81, 17.81, 17.87, 17.87, 17.87, 17.87, 17.87, 17.89, 17.89, 17.89, 17.89, 17.89, 17.91, 17.91, 17.91, 17.91, 17.91, 18.02, 18.02, 18.02, 18.02, 18.02, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06, 18.06]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 215 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1713705780 --> 1713706418
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.33, 0.33, 0.33, 0.33, 0.33, 0.42, 0.42, 0.42, 0.42, 0.42, 0.46, 0.46, 0.46, 0.46, 0.46, 0.45, 0.45, 0.45, 0.45, 0.45, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.22, 0.22, 0.22, 0.22, 0.22, 0.19, 0.19, 0.19, 0.19, 0.19, 0.25, 0.25, 0.25, 0.25, 0.25, 0.27, 0.27, 0.27, 0.27, 0.27, 0.25, 0.25, 0.25, 0.25, 0.25, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.23, 0.23, 0.23, 0.23, 0.23, 0.18, 0.18, 0.18, 0.18, 0.18, 0.2, 0.2, 0.2, 0.2, 0.2, 0.23, 0.23, 0.23, 0.23, 0.23, 0.19, 0.19, 0.19, 0.19, 0.19, 0.21, 0.21, 0.21, 0.21, 0.21, 0.25, 0.25, 0.25, 0.25, 0.25, 0.15, 0.15, 0.15, 0.15, 0.15, 0.23, 0.23, 0.23, 0.23, 0.23, 0.27, 0.27, 0.27, 0.27, 0.27, 0.21, 0.21, 0.21, 0.21, 0.21, 0.28, 0.28, 0.28, 0.28, 0.28, 0.3, 0.3, 0.3, 0.3, 0.3, 0.34, 0.34, 0.34, 0.34, 0.34, 0.32, 0.32, 0.32, 0.32, 0.32, 0.2, 0.2, 0.2, 0.2, 0.2, 0.25, 0.25, 0.25, 0.25, 0.25, 0.37, 0.37, 0.37, 0.37, 0.37, 0.44, 0.44, 0.44, 0.44, 0.44, 0.44, 0.44, 0.44, 0.44, 0.44, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.26, 0.26, 0.26, 0.26, 0.26, 0.2, 0.2, 0.2, 0.2, 0.2, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.21, 0.21, 0.21, 0.21, 0.21, 0.33, 0.33, 0.33, 0.33, 0.33, 0.35, 0.35, 0.35, 0.35, 0.35, 0.4, 0.4, 0.4, 0.4, 0.4, 0.25, 0.25, 0.25, 0.25, 0.25, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.18, 0.18, 0.18, 0.18, 0.18, 0.27, 0.27, 0.27, 0.27, 0.27, 0.11, 0.11, 0.11, 0.11, 0.11, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.26, 0.26, 0.26, 0.26, 0.26, 0.36, 0.36, 0.36, 0.36, 0.36, 0.42, 0.42, 0.42, 0.42, 0.42]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 215 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1713705780 --> 1713706418
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0]
                    
Loading

@ggerganov
Copy link
Owner Author

phi-2-q4_0: 215 iterations

Performance dropped - maybe generation does not stop properly after the #6745 EOG changes?

@ngxson
Copy link
Collaborator

ngxson commented Apr 21, 2024

Performance dropped - maybe generation does not stop properly after the #6745 EOG changes?

Very likely, because we're using phi-2 model which does not have native support for chatml (so <|im_end|> is not a single token - it is broken into multiple tokens)

Edit: The simple fix is to bring back the line llama_params["stop"].push_back("<|im_end|>"); in server/utils.hpp. Only chatml <|im_end|> need this special treatment. Other templates like gemma or llama3 don't need this.

@ggerganov
Copy link
Owner Author

I think we are incorrectly using a base model instead of instruction-tuned one for this test:

https://huggingface.co/microsoft/phi-2

image

The phi-2 model does not support any chat template because it is a base model. We have to change the model used in the benchmark with a instruction tuned one

@ngxson
Copy link
Collaborator

ngxson commented Apr 21, 2024

The phi-2 model does not support any chat template because it is a base model. We have to change the model used in the benchmark with a instruction tuned one

Ah yeah that's right. We can use dolphin-phi2 then. Here is the link: https://huggingface.co/TheBloke/dolphin-2_6-phi-2-GGUF

The <|im_start|>, <|im_end|> and chat template of the HF model are all correct: https://huggingface.co/cognitivecomputations/dolphin-2_6-phi-2/blob/main/tokenizer_config.json#L325

@ggerganov ggerganov merged commit 40f74e4 into master Apr 21, 2024
61 of 64 checks passed
@ggerganov ggerganov deleted the gg/render-control-tokens branch April 21, 2024 15:36
okuvshynov pushed a commit to okuvshynov/llama.cpp that referenced this pull request Apr 22, 2024
* make : fix common dep on llama.h

* llama : add option to render special tokens

* readme : add API change notice

ggml-ci

* swift : fix build
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Special tokens are not rendered correctly (as empty) -- llama3 specific?
3 participants