Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Phi-3 models #6849

Open
criminact opened this issue Apr 23, 2024 · 84 comments
Open

Support for Phi-3 models #6849

criminact opened this issue Apr 23, 2024 · 84 comments
Labels
good first issue Good for newcomers model Model specific

Comments

@criminact
Copy link
Contributor

criminact commented Apr 23, 2024

Microsoft recently released Phi-3 models in 3 variants (mini, small & medium). Can we add support for this new family of models.

@criminact
Copy link
Contributor Author

image

Model directly works 👍

GGUF link - https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/blob/main/Phi-3-mini-4k-instruct-q4.gguf
Command - main -m Phi-3-mini-4k-instruct-q4.gguf -p "<|system|>\nYou are a helpful AI assistant.<|end|>\n<|user|>\nHow to explain Internet for a medieval knight?<|end|>\n<|assistant|>"

@K-Mistele
Copy link
Contributor

Have you tested compatibility with the server? There probably needs to be a new prompt template since it's not compatible with the current ones AFAIK. Happy to dig into this in the next couple of days.

@sorasoras
Copy link

I believe llama cpp does not support long rope which is use by 128k variant.

@LiuChaoXD
Copy link

I believe llama cpp does not support long rope which is use by 128k variant.

yeah, I tried to convert 128K version. python convert.py ....
Raise NotImplementedError: Unknown rope scaling type: longrope

@MoonRide303
Copy link

Also NotImplementedError: Architecture 'Phi3ForCausalLM' not supported! from convert-hf-to-gguf.py.

@apepkuss
Copy link

@MoonRide303 Same error with convert-hf-to-gguf.py.

@candre23
Copy link

Model directly works 👍

Only partially. MS is using some new rope technique they're calling "longrope". As-is, LCPP will work ok for the first few gens but will then abruptly go insane. This new longrope thing is likely the culprit.

@K-Mistele
Copy link
Contributor

Ah yes - it looks like they published the paper in April. Details here, PDF here

@Dampfinchen
Copy link

This model is insane for its size.

@mirek190
Copy link

mirek190 commented Apr 23, 2024

template for llamacpp

main.exe --model models/new3/Phi-3-mini-4k-instruct-fp16.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 0 --interactive -ins -ngl 99 --simple-io --in-prefix "<|user|>\n" --in-suffix "<|end|>\n<|assistant|>" -p "<|system|>You are a helpful assistant.<|end|>\n "

@MoonRide303
Copy link

I have to admit it seems to be pretty smart, even that smallest 3.8B - looks worthy of fully supporting it. Many 7B and bigger models struggled with those simple tests, yet Phi-3 handled it pretty nicely:
image
image

@mirek190
Copy link

mirek190 commented Apr 23, 2024

Tested with llamacpp.
fp16 and Q8 version.

Do you also have a problem : generating tokens until I manually stop it?

I had to add
-r "----" -r "---" -r "<|end|>>" -r "### Answer:"

@criminact
Copy link
Contributor Author

@mirek190 The model doesn't stop generating you mean? Yeah faced it too. This PR: #6851 handles it.

@mirek190
Copy link

mirek190 commented Apr 23, 2024

Screenshot 2024-04-23 190716

Not too bad ... not level llama 8b but still

phi-3

A father and son are in a car accident where the father is killed. The ambulance brings the son to the hospital. He needs immediate surgery. In the operating room, the surgeon looks at the boy and says "I can't operate on him, he's my son!"
<|end|>\n<|assistant|>This scenario appears to be a hypothetical situation rather than an actual event. However, it is intended to evoke thought about ethical dilemmas in the medical profession.

In reality, such situations are highly unlikely due to professional and personal boundaries that doctors maintain, even if they happen to know the patient personally through other connections (e.g., being friends or acquaintances).

To address this hypothetical situation responsibly:

1. The surgeon would not be able to perform surgery on their biological son based solely on familial relations due to ethical standards that prevent personal relationships from interfering with professional duties and patient care.

2. If the boy's father, who is no longer alive in this scenario, had appointed a close family member or friend as his medical proxy for such circumstances, then it would be possible for them (a relative) to make decisions on behalf of their son.

3. In such cases where an immediate decision must be made about the child's treatment but no designated proxy is available, hospital ethics committees or legal authorities may need to step in to determine the best course of action for the patient's well-being.

It is important to remember that real-life situations involving surgeons and their family members are handled with utmost care for the welfare of both parties involved, always prioritizing medical ethics and professional obligations.

llama 3

A father and son are in a car accident where the father is killed. The ambulance brings the son to the hospital. He needs immediate surgery. In the operating room, the surgeon looks at the boy and says "I can't operate on him, he's my son!"
<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThis is a classic lateral thinking puzzle!

The answer lies in the fact that the surgeon is not the biological father of the boy. The correct interpretation is that the surgeon is an adoptive parent or a stepfather.

If we consider other possible scenarios, it's also possible that the surgeon is a woman who has adopted the son or is his stepmother. In any case, the key point is that the surgeon is not biologically related to the boy as his father.

Llama 3 is on totally different level comparing to phi-3 ...

@tristandruyen
Copy link
Contributor

Doing my part by adding the chat template :) #6857

@dspasyuk
Copy link
Contributor

Model seems working fine on my end, just endlessly generating text. ../llama.cpp/main --model /home/denis/Downloads/phi-3-mini-4k-instruct.Q8_0.gguf --n-gpu-layers 35 -ins --interactive --keep -1 --n-predict -1 --simple-io -b 2048 --ctx_size 0 --temp 0.1 --top_k 10 -mg 0 --multiline-input --repeat_penalty 1.12 -t 4 -r "/n>" -p <|system|>Your name is Alice. You are kind, honest, logical, precise, good at writing and mathematics assistant. <|end|>
image

@criminact
Copy link
Contributor Author

Closing this since PR: #6857 was merged into master with support for Phi-3 4K context length.

@s-kostyaev
Copy link

What about 128k context length variant?

@lukestanley
Copy link

Support for 128K context length seems pretty important to me for "Phi-3" support to be considered "done", right? @criminact

@criminact
Copy link
Contributor Author

Status: Phi-3 4K models are supported in master after #6857 merge

Phi-3 128K models aren't supported yet (as of 24th Apr 2024)

@phalexo
Copy link

phalexo commented Apr 25, 2024

template for llamacpp

main.exe --model models/new3/Phi-3-mini-4k-instruct-fp16.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 0 --interactive -ins -ngl 99 --simple-io --in-prefix "<|user|>\n" --in-suffix "<|end|>\n<|assistant|>" -p "<|system|>You are a helpful assistant.<|end|>\n "

Are templates different for 4K vs. 128K?

@jtomek
Copy link

jtomek commented Apr 25, 2024

Hi guys, what to do with this error?
unknown model architecture: 'phi3'

I fine-tuned my own phi-3 and converted it to gguf with this command:
python llama.cpp/convert-hf-to-gguf.py midesk-private --outfile midesk-private-gguf-4k-v0.0.gguf

I get the error when I run

from llama_cpp import Llama
llm = Llama(
      model_path="./midesk-private-gguf-4k-v0.0.gguf"
)

I would be very thankful for any help or push in the right direction.

@phalexo
Copy link

phalexo commented Apr 25, 2024

With reduced context size of 60000 I can load a 128K model. The prompting is still messed up though.

./main --model /opt/data/pjh64/Phi-3-mini-128K-Instruct.gguf/phi-3-mini-128K-Instruct_q8_0.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 60000 --interactive -ins -ngl 33 --simple-io --in-prefix "<|user|>\n" --in-suffix "<|end|>\n<|assistant|>" -p "<|system|>You are a helpful assistant.<|end|>\n "

main: interactive mode on.
Reverse prompt: '### Instruction:

'
Input prefix: '<|user|>\n'
Input suffix: '<|end|>\n<|assistant|>'
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 60000, n_batch = 2048, n_predict = -1, n_keep = 12

== Running in interactive mode. ==

  • Press Ctrl+C to interject at any time.
  • Press Return to return control to LLaMa.
  • To return control without starting a new line, end your input with '/'.
  • If you want to submit another line, end your input with ''.

<|system|>You are a helpful assistant.<|end|>\n

<|user|>\nHello. Tell me a story.
<|end|>\n<|assistant|>Once upon a time, in the serene land of Veridia, there was an enchanted forest known as Luminae Woods. The woodland sparkled under moonlight with each tree shimmering like living stars.

In this mystical place lived Elara, a beautiful young maiden blessed with iridescent hair and eyes that mirrored the depth of the cosmos.

Elara had one unique trait - she could converse with nature itself. She conversed with trees whispering secrets in rustling leaves, birds humming songs only they could understand.

One fateful day, a dark cloud loomed over Veridia. A malicious sorcerer named Malachar desired to steal the magical essence of Veridia for his own nefarious purposes.

Upon hearing this news, Elara decided she wouldn't let her homeland fall into despair. With bravery coursing through her veins and courage in her heart, she embarked on a peril
ous quest to stop the wicked sorcerer.

With each passing day, Elara encountered numerous trials that tested her courage, wisdom, and resilience. She journeyed across treacherous terrains, braved wild beasts and outsmarted magical illusions crafted by Malachar himself.

As Elara ventured deeper into the darkness of Maleficent's lair, she came face-to-face with the sorcerer. A battle of magic unfolded - a clash between good and evil, light against dark.

Despite feeling overwhelmed by Malachar_s mightier spells, Elara held on to her heart's purity, believing in herself and her mission for Veridia's peace.

In the end, it was Elara who prevailed. With a final surge of magic she wielded from within, she vanquished Malachar, breaking his dark curse over Veridia.

Afterwards, with peace restored to Veridia and its inhabitants living in harmony once more, Elara became the beloved guardian of Luminae Woods, continuing her duty as the voice
of nature itself.

Thus ends a tale about courage, goodness, and the power that resides within us all. It's a timeless story of how one person can make an immense difference in preserving peace and harmony.

And so, dear listener, let this legend inspire you to face your own battles with bravery and integrity _ for it is these virtues which truly define the worthiness of any individual or character.<|end|>

<|user|>\n

@MoonRide303
Copy link

@mirek190 Sadly passkey doesn't support most of the main and server options, including -fa.

@halbtuerke
Copy link

I tested this PR on CPU (not enough VRAM to handle 128k context)

@MoonRide303 do you know how much VRAM is required for handling 128K tokens?

@MoonRide303
Copy link

MoonRide303 commented May 13, 2024

I tested this PR on CPU (not enough VRAM to handle 128k context)

@MoonRide303 do you know how much VRAM is required for handling 128K tokens?

Might be different for main & server with some options, but it seems to be around 50 GB for passkey - I was getting this error when I tried to launch CUDA version of passkey for Phi-3-mini-128k-instruct-Q6_K.gguf:

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 49248.00 MiB on device 0: cudaMalloc failed: out of memory
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
main: error: failed to create the llama_context

I am not sure which option should be used to decrease memory requirements for KV cache - I tried adding -ctk q4_0 -ctv q4_0 when launching server with bigger ctx sizes (64k and 128k), but while it decreased KV size, llama.cpp stopped working just after that, with this error:

GGML_ASSERT: D:\repos-git\llama.cpp\ggml-cuda\cpy.cu:402: ggml_nbytes(src1) <= INT_MAX

With the default f16 type for KV cache I am able to launch server with up to -c 61440 (on a 16 GB VRAM GPU).

@hchasens
Copy link

I know 128k support is already on the TODO list, but I thought I'd add how nice it would be, since there are almost no other models with a context length that size. Llama 3 is only 8k, so it'll be a very big deal when this is released.

@flatsiedatsie
Copy link

Absolutely. Wllama already has verified support for the 4K version, and added an additional fix for it.

I believe it will be the most important model for browser-based AI for a while. I know Transformers.js has already added support for it, with a great demo too, and WebLLM support seems on the way too. But those both require WebGPU support.

Wllama works with CPU only, so there will be a (slower) fallback option for Safari and Firefox users, finally making Phi 3 128K universally available in the browser. And, importantly, with some headroom to really use that large context, even with a 4GB total memory limit.

@tomasmcm
Copy link

New models are out. Not sure which ones are supported and which ones need changing. But probably all the 128K versions have the same issue.

Short Context Long Context
Mini 4K [HF] ; [ONNX] ; [GGUF] 128K [HF] ; [ONNX]
Small 8K [HF] ; [ONNX] 128K [HF] ; [ONNX]
Medium 4K [HF] ; [ONNX] 128K [HF] ; [ONNX]
Vision 128K [HF]

@s-smits
Copy link

s-smits commented May 21, 2024

I was able to convert the Medium models without issues, have not tested it yet. The small-128k models apparently use a new Rope Scaling method called 'su'.
INFO:convert:Loading model file /content/Phi-3-small-128k-instruct/model-00001-of-00004.safetensors INFO:convert:Loading model file /content/Phi-3-small-128k-instruct/model-00001-of-00004.safetensors INFO:convert:Loading model file /content/Phi-3-small-128k-instruct/model-00002-of-00004.safetensors INFO:convert:Loading model file /content/Phi-3-small-128k-instruct/model-00003-of-00004.safetensors INFO:convert:Loading model file /content/Phi-3-small-128k-instruct/model-00004-of-00004.safetensors INFO:convert:model parameters count : 7392272384 (7B) Traceback (most recent call last): File "/content/llama.cpp/convert.py", line 1714, in <module> main() File "/content/llama.cpp/convert.py", line 1647, in main params = Params.load(model_plus) File "/content/llama.cpp/convert.py", line 334, in load params = Params.loadHFTransformerJson(model_plus.model, hf_config_path) File "/content/llama.cpp/convert.py", line 240, in loadHFTransformerJson raise NotImplementedError(f'Unknown rope scaling type: {typ}') NotImplementedError: Unknown rope scaling type: su

@dmsweetser
Copy link

I think "su" is the same for mini, so hopefully any current effort will carry these through as well.

@MoonRide303
Copy link

When trying to convert Phi-3-small-8k-instruct:

NotImplementedError: Architecture 'Phi3SmallForCausalLM' not supported!

Also different tokenizer - based on cl100k_base tiktoken, adapted to support ChatML:
https://huggingface.co/microsoft/Phi-3-small-8k-instruct/blob/main/tokenization_phi3_small.py#L15

@coder543
Copy link

It looks like #7225 was merged. Is there any other outstanding work on the Phi-3 models?

Does the new Phi-3 vision model work?

@candre23
Copy link

It looks like #7225 was merged. Is there any other outstanding work on the Phi-3 models?

Does the new Phi-3 vision model work?

Yes, small is still not working, though it may just be a naming issue? There are also some conflicting reports of functionality and/or quality. More discussion here: #7439

@RLXIWC
Copy link

RLXIWC commented May 31, 2024

Hi guys, what to do with this error? unknown model architecture: 'phi3'

I fine-tuned my own phi-3 and converted it to gguf with this command: python llama.cpp/convert-hf-to-gguf.py midesk-private --outfile midesk-private-gguf-4k-v0.0.gguf

I get the error when I run

from llama_cpp import Llama
llm = Llama(
      model_path="./midesk-private-gguf-4k-v0.0.gguf"
)

I would be very thankful for any help or push in the right direction.

Do you have found a solution for that, facing the same issue?

@Galunid
Copy link
Collaborator

Galunid commented May 31, 2024

@RLXIWC Update llama.cpp-python version.

@RLXIWC
Copy link

RLXIWC commented Jun 2, 2024

I am already using version 0.2.76, but I still get the error. The model is Phi-3-mini-4k-instruct-q4.gguf from Huggingface.

@Galunid
Copy link
Collaborator

Galunid commented Jun 2, 2024

unknown model architecture: 'phi3'

That error comes from llama.cpp backend. The error you get happens only when there's no phi3 in this mapping

image

That means there's some issue with your llama-cpp-python. Maybe you reinstalled the library and the backend didn't get recompiled or something was cached? Try to recreate your venv and make sure everything is correctly installed. I tried the latest version (0.2.76) with phi3 model and it worked as expected, same code as yours.

@d-kleine
Copy link

d-kleine commented Jun 8, 2024

I am having a similar issue, not the 'Phi3SmallForCausalLM' but for 'Phi3ForSequenceClassification', with the newest llama.cpp release (d4d915d)

NotImplementedError: Architecture 'Phi3ForSequenceClassification' not supported!

(edit: removed link)

@mzwing
Copy link

mzwing commented Jun 9, 2024

I am having a similar issue, not the 'Phi3SmallForCausalLM' but for Phi3ForSequenceClassification, with the newest llama.cpp release (d4d915d)

NotImplementedError: Architecture 'Phi3ForSequenceClassification' not supported!

Found someone else having the same problem on HF: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/discussions/3

Maybe you should use the convert-hf-to-gguf.py script.

@d-kleine
Copy link

d-kleine commented Jun 9, 2024

Maybe you should use the convert-hf-to-gguf.py script.

I have used convert-hf-to-gguf.py

I was trying to convert a Phi-3 mini (3.8B) based LLM to f16 GGUF with llama.cpp that uses the Phi3ForSequenceClassification architecture, a variant of the Phi-3 language model with a sequence classification head on top (a linear layer). It seems like Phi3ForSequenceClassification has not yet been implemented into llama cpp's convert-hf-to-gguf.py. There is a decorator for Phi3ForCausalLM but not yet for Phi3ForSequenceClassification.

Linking #7439 in any case too

@bartowski1182
Copy link
Contributor

looks like the new Phi-3 from today uses microsoft's new longrope which is still unsupported

@coder543
Copy link

coder543 commented Jul 2, 2024

@mann1x
Copy link

mann1x commented Jul 3, 2024

Sent a request to the MS Team on HF to support the longrope implementation if they can:

https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/discussions/88

@maziyarpanahi
Copy link

The new released model openbmb/MiniCPM3-4B also has the similar issue

https://huggingface.co/openbmb/MiniCPM3-4B

NotImplementedError: Unknown rope scaling type: longrope

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers model Model specific
Projects
None yet
Development

No branches or pull requests