Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : refactor graph build code #3837

Merged
merged 21 commits into from
Nov 1, 2023
Merged

llama : refactor graph build code #3837

merged 21 commits into from
Nov 1, 2023

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Oct 28, 2023

ref #3382

  • Graph build functions no longer depend on the allocator
  • Move GPU offloading outside of build functions via callback
  • GPU offload for Bloom arch (this is a positive side-effect of the refactoring)
  • Offload result_norm when embeddings are not requested
  • Offload tensor with token positions inp_pos as non-repeating (NR)
  • Change offload type of KQ_mask, KQ_shift and K_shifted from KQ to NR. Any reason for these to be KQ?
  • Helper functions to build graphs: llama : add llm_build helper functions #3848

TODO:

  • Test arches with CUDA - might have broken something during the refactoring

llama.cpp Outdated Show resolved Hide resolved
llama.cpp Outdated Show resolved Hide resolved
* llama : add llm_build_norm helper function

ggml-ci

* llama : add llm_build_ffn helper function (#3849)

ggml-ci

* llama : add llm_build_k_shift helper

ggml-ci

* llama : fix offloading after recent changes

* llama : add llm_build_kv_store helper

ggml-ci

* llama : remove obsolete offload names

* llama : fix llm_build_k_shift to use n_head_kv instead of n_head

* llama : simplify falcon Q, K, V computation

* llama : remove obsolete comments in build graphs

* llama : add llm_build_kqv helper

ggml-ci

* llama : minor

* llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading

* llama : fix input allocation logic

* llama : update offload functions for KQ tensors

* llama : normalize tensor names

ggml-ci

* llama : enable warning about not offloaded tensors

* llama : remove extra ; + deduplicate gate_b logic

* llama : add llm_build_inp_embd helper
@ggerganov ggerganov marked this pull request as ready for review October 31, 2023 18:24
@ggerganov
Copy link
Owner Author

Planning to merge this soon. It will be miracle if I didn't break something, but I think the refactoring should make things a bit easier with adding new model arches in the future.

@ggerganov ggerganov merged commit 71e3718 into master Nov 1, 2023
33 checks passed
@jxy
Copy link
Contributor

jxy commented Nov 1, 2023

This change broke Falcon.

@cebtenzzre
Copy link
Collaborator

This change broke Falcon.

Could you be more specific? Falcon 7B appears to still work for me on both CPU and CUDA.

@ggerganov
Copy link
Owner Author

The attention norm for Falcon-40B was using wrong input tensor. Should be fixed with 523e49b

@Galunid
Copy link
Collaborator

Galunid commented Nov 6, 2023

I can confirm Persimmon is broken

GGML_ASSERT: ggml.c:3113: ggml_can_repeat_rows(b, a)

Program received signal SIGABRT, Aborted.
0x00007ffff7aac83c in ?? () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007ffff7aac83c in ?? () from /usr/lib/libc.so.6
#1  0x00007ffff7a5c668 in raise () from /usr/lib/libc.so.6
#2  0x00007ffff7a444b8 in abort () from /usr/lib/libc.so.6
#3  0x00005555555792a1 in ggml_add_impl (ctx=0x5555557d2888 <g_state+200>, a=0x7ffdf9209a90, b=0x7ffdf91fe690, inplace=false) at ggml.c:3113
#4  0x00005555555793dd in ggml_add (ctx=0x5555557d2888 <g_state+200>, a=0x7ffdf9209a90, b=0x7ffdf91fe690) at ggml.c:3137
#5  0x00005555555cd88c in llm_build_kqv (ctx=0x5555557d2888 <g_state+200>, hparams=..., kv=..., wo=0x5555578081c0, wo_b=0x555557808340, q_cur=0x7ffdf9208710, kq_scale=0x7ffdf91fe510, kq_mask=0x7ffdf91fe690, n_ctx=4096, n_tokens=1024, 
    n_kv=4096, max_alibi_bias=-1, cb=..., il=0) at llama.cpp:3454
#6  0x00005555555f1aaf in llm_build_context::build_persimmon (this=0x7fffffffa6d0) at llama.cpp:4198
#7  0x00005555555ceb1f in llama_build_graph (lctx=..., batch=...) at llama.cpp:4983
#8  0x00005555555d9aed in llama_new_context_with_model (model=0x555555870bb0, params=...) at llama.cpp:8182
#9  0x0000555555649af6 in llama_init_from_gpt_params (params=...) at common/common.cpp:955
#10 0x0000555555560452 in main (argc=23, argv=0x7fffffffdb68) at examples/main/main.cpp:180

I'm uploading Q4_K model to https://huggingface.co/Galunid/persimmon-gguf/tree/main

I tested it with 238657d and it was working, 71e3718 does not work

Dunno if I should create separate issue for this

@Galunid Galunid mentioned this pull request Nov 7, 2023
19 tasks
Galunid added a commit to Galunid/llama.cpp that referenced this pull request Nov 9, 2023
Galunid added a commit that referenced this pull request Nov 10, 2023
@ggerganov ggerganov mentioned this pull request Nov 16, 2023
brittlewis12 added a commit to brittlewis12/llmfarm_core.swift that referenced this pull request Nov 17, 2023
brittlewis12 added a commit to brittlewis12/llmfarm_core.swift that referenced this pull request Nov 18, 2023
olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023
* llama : factor out ggml-alloc from graph graph build functions

ggml-ci

* metal : disable kernel load log

* llama : factor out tensor offloading outside the build call (wip)

ggml-ci

* llama : offload rest of the models

ggml-ci

* llama : update offload log messages to print node index

* llama : comments

* llama : support offloading result_norm + comments

* llama : factor graph input into a function

* llama : do tensor offload only with CUDA

* llama : fix res_norm offloading

* llama : try to optimize offloading code

* llama : fix non-CUDA build

* llama : try to fix build

* llama : move refact in correct place + optimize graph input

* llama : refactor tensor offloading as callback

* llama : add layer index to all tensor names

* llama : add functional header

* llama : comment

ggml-ci

* llama : remove obsolete map for layer counting

* llama : add llm_build helper functions (ggerganov#3848)

* llama : add llm_build_norm helper function

ggml-ci

* llama : add llm_build_ffn helper function (ggerganov#3849)

ggml-ci

* llama : add llm_build_k_shift helper

ggml-ci

* llama : fix offloading after recent changes

* llama : add llm_build_kv_store helper

ggml-ci

* llama : remove obsolete offload names

* llama : fix llm_build_k_shift to use n_head_kv instead of n_head

* llama : simplify falcon Q, K, V computation

* llama : remove obsolete comments in build graphs

* llama : add llm_build_kqv helper

ggml-ci

* llama : minor

* llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading

* llama : fix input allocation logic

* llama : update offload functions for KQ tensors

* llama : normalize tensor names

ggml-ci

* llama : enable warning about not offloaded tensors

* llama : remove extra ; + deduplicate gate_b logic

* llama : add llm_build_inp_embd helper
olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023
olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023
olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023
brittlewis12 added a commit to brittlewis12/llmfarm_core.swift that referenced this pull request Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactoring Refactoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants