Support for Phi-3 models #6849

criminact · 2024-04-23T15:22:53Z

Microsoft recently released Phi-3 models in 3 variants (mini, small & medium). Can we add support for this new family of models.

criminact · 2024-04-23T15:40:54Z

Model directly works 👍

GGUF link - https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/blob/main/Phi-3-mini-4k-instruct-q4.gguf
Command - main -m Phi-3-mini-4k-instruct-q4.gguf -p "<|system|>\nYou are a helpful AI assistant.<|end|>\n<|user|>\nHow to explain Internet for a medieval knight?<|end|>\n<|assistant|>"

K-Mistele · 2024-04-23T15:59:53Z

Have you tested compatibility with the server? There probably needs to be a new prompt template since it's not compatible with the current ones AFAIK. Happy to dig into this in the next couple of days.

sorasoras · 2024-04-23T16:21:36Z

I believe llama cpp does not support long rope which is use by 128k variant.

LiuChaoXD · 2024-04-23T16:24:39Z

I believe llama cpp does not support long rope which is use by 128k variant.

yeah, I tried to convert 128K version. python convert.py ....
Raise NotImplementedError: Unknown rope scaling type: longrope

MoonRide303 · 2024-04-23T16:35:30Z

Also NotImplementedError: Architecture 'Phi3ForCausalLM' not supported! from convert-hf-to-gguf.py.

apepkuss · 2024-04-23T16:36:27Z

@MoonRide303 Same error with convert-hf-to-gguf.py.

candre23 · 2024-04-23T16:41:58Z

Model directly works 👍

Only partially. MS is using some new rope technique they're calling "longrope". As-is, LCPP will work ok for the first few gens but will then abruptly go insane. This new longrope thing is likely the culprit.

K-Mistele · 2024-04-23T16:46:55Z

Ah yes - it looks like they published the paper in April. Details here, PDF here

Dampfinchen · 2024-04-23T17:39:10Z

This model is insane for its size.

mirek190 · 2024-04-23T17:42:18Z

template for llamacpp

main.exe --model models/new3/Phi-3-mini-4k-instruct-fp16.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 0 --interactive -ins -ngl 99 --simple-io --in-prefix "<|user|>\n" --in-suffix "<|end|>\n<|assistant|>" -p "<|system|>You are a helpful assistant.<|end|>\n "

MoonRide303 · 2024-04-23T17:44:16Z

I have to admit it seems to be pretty smart, even that smallest 3.8B - looks worthy of fully supporting it. Many 7B and bigger models struggled with those simple tests, yet Phi-3 handled it pretty nicely:

mirek190 · 2024-04-23T18:03:08Z

Tested with llamacpp.
fp16 and Q8 version.

Do you also have a problem : generating tokens until I manually stop it?

I had to add
-r "----" -r "---" -r "<|end|>>" -r "### Answer:"

criminact · 2024-04-23T18:07:23Z

@mirek190 The model doesn't stop generating you mean? Yeah faced it too. This PR: #6851 handles it.

mirek190 · 2024-04-23T18:07:38Z

Not too bad ... not level llama 8b but still

phi-3

A father and son are in a car accident where the father is killed. The ambulance brings the son to the hospital. He needs immediate surgery. In the operating room, the surgeon looks at the boy and says "I can't operate on him, he's my son!"
<|end|>\n<|assistant|>This scenario appears to be a hypothetical situation rather than an actual event. However, it is intended to evoke thought about ethical dilemmas in the medical profession.

In reality, such situations are highly unlikely due to professional and personal boundaries that doctors maintain, even if they happen to know the patient personally through other connections (e.g., being friends or acquaintances).

To address this hypothetical situation responsibly:

1. The surgeon would not be able to perform surgery on their biological son based solely on familial relations due to ethical standards that prevent personal relationships from interfering with professional duties and patient care.

2. If the boy's father, who is no longer alive in this scenario, had appointed a close family member or friend as his medical proxy for such circumstances, then it would be possible for them (a relative) to make decisions on behalf of their son.

3. In such cases where an immediate decision must be made about the child's treatment but no designated proxy is available, hospital ethics committees or legal authorities may need to step in to determine the best course of action for the patient's well-being.

It is important to remember that real-life situations involving surgeons and their family members are handled with utmost care for the welfare of both parties involved, always prioritizing medical ethics and professional obligations.

llama 3

A father and son are in a car accident where the father is killed. The ambulance brings the son to the hospital. He needs immediate surgery. In the operating room, the surgeon looks at the boy and says "I can't operate on him, he's my son!"
<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThis is a classic lateral thinking puzzle!

The answer lies in the fact that the surgeon is not the biological father of the boy. The correct interpretation is that the surgeon is an adoptive parent or a stepfather.

If we consider other possible scenarios, it's also possible that the surgeon is a woman who has adopted the son or is his stepmother. In any case, the key point is that the surgeon is not biologically related to the boy as his father.

Llama 3 is on totally different level comparing to phi-3 ...

tristandruyen · 2024-04-23T23:37:09Z

Doing my part by adding the chat template :) #6857

dspasyuk · 2024-04-24T01:42:54Z

Model seems working fine on my end, just endlessly generating text. ../llama.cpp/main --model /home/denis/Downloads/phi-3-mini-4k-instruct.Q8_0.gguf --n-gpu-layers 35 -ins --interactive --keep -1 --n-predict -1 --simple-io -b 2048 --ctx_size 0 --temp 0.1 --top_k 10 -mg 0 --multiline-input --repeat_penalty 1.12 -t 4 -r "/n>" -p <|system|>Your name is Alice. You are kind, honest, logical, precise, good at writing and mathematics assistant. <|end|>

criminact · 2024-04-24T11:39:22Z

Closing this since PR: #6857 was merged into master with support for Phi-3 4K context length.

s-kostyaev · 2024-04-24T12:01:24Z

What about 128k context length variant?

lukestanley · 2024-04-24T12:34:41Z

Support for 128K context length seems pretty important to me for "Phi-3" support to be considered "done", right? @criminact

criminact · 2024-04-24T13:03:22Z

Status: Phi-3 4K models are supported in master after #6857 merge

Phi-3 128K models aren't supported yet (as of 24th Apr 2024)

phalexo · 2024-04-25T16:18:42Z

template for llamacpp

main.exe --model models/new3/Phi-3-mini-4k-instruct-fp16.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 0 --interactive -ins -ngl 99 --simple-io --in-prefix "<|user|>\n" --in-suffix "<|end|>\n<|assistant|>" -p "<|system|>You are a helpful assistant.<|end|>\n "

Are templates different for 4K vs. 128K?

jtomek · 2024-04-25T16:30:36Z

Hi guys, what to do with this error?
unknown model architecture: 'phi3'

I fine-tuned my own phi-3 and converted it to gguf with this command:
python llama.cpp/convert-hf-to-gguf.py midesk-private --outfile midesk-private-gguf-4k-v0.0.gguf

I get the error when I run

from llama_cpp import Llama
llm = Llama(
      model_path="./midesk-private-gguf-4k-v0.0.gguf"
)

I would be very thankful for any help or push in the right direction.

phalexo · 2024-04-25T16:46:06Z

With reduced context size of 60000 I can load a 128K model. The prompting is still messed up though.

./main --model /opt/data/pjh64/Phi-3-mini-128K-Instruct.gguf/phi-3-mini-128K-Instruct_q8_0.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 60000 --interactive -ins -ngl 33 --simple-io --in-prefix "<|user|>\n" --in-suffix "<|end|>\n<|assistant|>" -p "<|system|>You are a helpful assistant.<|end|>\n "

main: interactive mode on.
Reverse prompt: '### Instruction:

'
Input prefix: '<|user|>\n'
Input suffix: '<|end|>\n<|assistant|>'
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 60000, n_batch = 2048, n_predict = -1, n_keep = 12

== Running in interactive mode. ==

Press Ctrl+C to interject at any time.
Press Return to return control to LLaMa.
To return control without starting a new line, end your input with '/'.
If you want to submit another line, end your input with ''.

~~<|system|>You are a helpful assistant.<|end|>\n~~

<|user|>\nHello. Tell me a story.
<|end|>\n<|assistant|>Once upon a time, in the serene land of Veridia, there was an enchanted forest known as Luminae Woods. The woodland sparkled under moonlight with each tree shimmering like living stars.

In this mystical place lived Elara, a beautiful young maiden blessed with iridescent hair and eyes that mirrored the depth of the cosmos.

Elara had one unique trait - she could converse with nature itself. She conversed with trees whispering secrets in rustling leaves, birds humming songs only they could understand.

One fateful day, a dark cloud loomed over Veridia. A malicious sorcerer named Malachar desired to steal the magical essence of Veridia for his own nefarious purposes.

Upon hearing this news, Elara decided she wouldn't let her homeland fall into despair. With bravery coursing through her veins and courage in her heart, she embarked on a peril
ous quest to stop the wicked sorcerer.

With each passing day, Elara encountered numerous trials that tested her courage, wisdom, and resilience. She journeyed across treacherous terrains, braved wild beasts and outsmarted magical illusions crafted by Malachar himself.

As Elara ventured deeper into the darkness of Maleficent's lair, she came face-to-face with the sorcerer. A battle of magic unfolded - a clash between good and evil, light against dark.

Despite feeling overwhelmed by Malachar_s mightier spells, Elara held on to her heart's purity, believing in herself and her mission for Veridia's peace.

In the end, it was Elara who prevailed. With a final surge of magic she wielded from within, she vanquished Malachar, breaking his dark curse over Veridia.

Afterwards, with peace restored to Veridia and its inhabitants living in harmony once more, Elara became the beloved guardian of Luminae Woods, continuing her duty as the voice
of nature itself.

Thus ends a tale about courage, goodness, and the power that resides within us all. It's a timeless story of how one person can make an immense difference in preserving peace and harmony.

And so, dear listener, let this legend inspire you to face your own battles with bravery and integrity _ for it is these virtues which truly define the worthiness of any individual or character.<|end|>

<|user|>\n

MoonRide303 · 2024-05-13T18:43:15Z

@mirek190 Sadly passkey doesn't support most of the main and server options, including -fa.

halbtuerke · 2024-05-13T20:22:02Z

I tested this PR on CPU (not enough VRAM to handle 128k context)

@MoonRide303 do you know how much VRAM is required for handling 128K tokens?

MoonRide303 · 2024-05-13T20:36:50Z

I tested this PR on CPU (not enough VRAM to handle 128k context)

@MoonRide303 do you know how much VRAM is required for handling 128K tokens?

Might be different for main & server with some options, but it seems to be around 50 GB for passkey - I was getting this error when I tried to launch CUDA version of passkey for Phi-3-mini-128k-instruct-Q6_K.gguf:

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 49248.00 MiB on device 0: cudaMalloc failed: out of memory
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
main: error: failed to create the llama_context

I am not sure which option should be used to decrease memory requirements for KV cache - I tried adding -ctk q4_0 -ctv q4_0 when launching server with bigger ctx sizes (64k and 128k), but while it decreased KV size, llama.cpp stopped working just after that, with this error:

GGML_ASSERT: D:\repos-git\llama.cpp\ggml-cuda\cpy.cu:402: ggml_nbytes(src1) <= INT_MAX

With the default f16 type for KV cache I am able to launch server with up to -c 61440 (on a 16 GB VRAM GPU).

hchasens · 2024-05-19T01:52:02Z

I know 128k support is already on the TODO list, but I thought I'd add how nice it would be, since there are almost no other models with a context length that size. Llama 3 is only 8k, so it'll be a very big deal when this is released.

flatsiedatsie · 2024-05-19T15:05:23Z

Absolutely. Wllama already has verified support for the 4K version, and added an additional fix for it.

I believe it will be the most important model for browser-based AI for a while. I know Transformers.js has already added support for it, with a great demo too, and WebLLM support seems on the way too. But those both require WebGPU support.

Wllama works with CPU only, so there will be a (slower) fallback option for Safari and Firefox users, finally making Phi 3 128K universally available in the browser. And, importantly, with some headroom to really use that large context, even with a 4GB total memory limit.

tomasmcm · 2024-05-21T15:35:46Z

New models are out. Not sure which ones are supported and which ones need changing. But probably all the 128K versions have the same issue.

	Short Context	Long Context
Mini	4K [HF] ; [ONNX] ; [GGUF]	128K [HF] ; [ONNX]
Small	8K [HF] ; [ONNX]	128K [HF] ; [ONNX]
Medium	4K [HF] ; [ONNX]	128K [HF] ; [ONNX]
Vision		128K [HF]

s-smits · 2024-05-21T16:31:26Z

I was able to convert the Medium models without issues, have not tested it yet. The small-128k models apparently use a new Rope Scaling method called 'su'.
INFO:convert:Loading model file /content/Phi-3-small-128k-instruct/model-00001-of-00004.safetensors INFO:convert:Loading model file /content/Phi-3-small-128k-instruct/model-00001-of-00004.safetensors INFO:convert:Loading model file /content/Phi-3-small-128k-instruct/model-00002-of-00004.safetensors INFO:convert:Loading model file /content/Phi-3-small-128k-instruct/model-00003-of-00004.safetensors INFO:convert:Loading model file /content/Phi-3-small-128k-instruct/model-00004-of-00004.safetensors INFO:convert:model parameters count : 7392272384 (7B) Traceback (most recent call last): File "/content/llama.cpp/convert.py", line 1714, in <module> main() File "/content/llama.cpp/convert.py", line 1647, in main params = Params.load(model_plus) File "/content/llama.cpp/convert.py", line 334, in load params = Params.loadHFTransformerJson(model_plus.model, hf_config_path) File "/content/llama.cpp/convert.py", line 240, in loadHFTransformerJson raise NotImplementedError(f'Unknown rope scaling type: {typ}') NotImplementedError: Unknown rope scaling type: su

dmsweetser · 2024-05-21T17:00:37Z

I think "su" is the same for mini, so hopefully any current effort will carry these through as well.

MoonRide303 · 2024-05-21T18:37:34Z

When trying to convert Phi-3-small-8k-instruct:

NotImplementedError: Architecture 'Phi3SmallForCausalLM' not supported!

Also different tokenizer - based on cl100k_base tiktoken, adapted to support ChatML:
https://huggingface.co/microsoft/Phi-3-small-8k-instruct/blob/main/tokenization_phi3_small.py#L15

coder543 · 2024-05-21T21:27:01Z

It looks like #7225 was merged. Is there any other outstanding work on the Phi-3 models?

Does the new Phi-3 vision model work?

candre23 · 2024-05-21T23:02:33Z

It looks like #7225 was merged. Is there any other outstanding work on the Phi-3 models?

Does the new Phi-3 vision model work?

Yes, small is still not working, though it may just be a naming issue? There are also some conflicting reports of functionality and/or quality. More discussion here: #7439

RLXIWC · 2024-05-31T10:50:11Z

Hi guys, what to do with this error? unknown model architecture: 'phi3'

I fine-tuned my own phi-3 and converted it to gguf with this command: python llama.cpp/convert-hf-to-gguf.py midesk-private --outfile midesk-private-gguf-4k-v0.0.gguf

I get the error when I run
from llama_cpp import Llama
llm = Llama(
      model_path="./midesk-private-gguf-4k-v0.0.gguf"
)
I would be very thankful for any help or push in the right direction.

Do you have found a solution for that, facing the same issue?

Galunid · 2024-05-31T18:20:08Z

@RLXIWC Update llama.cpp-python version.

RLXIWC · 2024-06-02T15:28:34Z

I am already using version 0.2.76, but I still get the error. The model is Phi-3-mini-4k-instruct-q4.gguf from Huggingface.

Galunid · 2024-06-02T21:53:58Z

unknown model architecture: 'phi3'

That error comes from llama.cpp backend. The error you get happens only when there's no phi3 in this mapping

That means there's some issue with your llama-cpp-python. Maybe you reinstalled the library and the backend didn't get recompiled or something was cached? Try to recreate your venv and make sure everything is correctly installed. I tried the latest version (0.2.76) with phi3 model and it worked as expected, same code as yours.

d-kleine · 2024-06-08T22:49:03Z

I am having a similar issue, not the 'Phi3SmallForCausalLM' but for 'Phi3ForSequenceClassification', with the newest llama.cpp release (d4d915d)

NotImplementedError: Architecture 'Phi3ForSequenceClassification' not supported!

(edit: removed link)

mzwing · 2024-06-09T00:24:53Z

I am having a similar issue, not the 'Phi3SmallForCausalLM' but for Phi3ForSequenceClassification, with the newest llama.cpp release (d4d915d)
NotImplementedError: Architecture 'Phi3ForSequenceClassification' not supported!
Found someone else having the same problem on HF: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/discussions/3

Maybe you should use the convert-hf-to-gguf.py script.

d-kleine · 2024-06-09T00:33:52Z

Maybe you should use the convert-hf-to-gguf.py script.

I have used convert-hf-to-gguf.py

I was trying to convert a Phi-3 mini (3.8B) based LLM to f16 GGUF with llama.cpp that uses the Phi3ForSequenceClassification architecture, a variant of the Phi-3 language model with a sequence classification head on top (a linear layer). It seems like Phi3ForSequenceClassification has not yet been implemented into llama cpp's convert-hf-to-gguf.py. There is a decorator for Phi3ForCausalLM but not yet for Phi3ForSequenceClassification.

Linking #7439 in any case too

bartowski1182 · 2024-07-02T16:55:00Z

looks like the new Phi-3 from today uses microsoft's new longrope which is still unsupported

coder543 · 2024-07-02T17:44:50Z

Relevant link: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/blob/main/config.json#L128

mann1x · 2024-07-03T16:56:47Z

Sent a request to the MS Team on HF to support the longrope implementation if they can:

https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/discussions/88

maziyarpanahi · 2024-09-06T09:15:06Z

The new released model openbmb/MiniCPM3-4B also has the similar issue

https://huggingface.co/openbmb/MiniCPM3-4B

NotImplementedError: Unknown rope scaling type: longrope

ggerganov mentioned this issue Apr 23, 2024

convert : add phi-3 support #6851

Closed

thinkverse mentioned this issue Apr 23, 2024

support 128k context length phi3 ollama/ollama#3853

Closed

sozercan mentioned this issue Apr 23, 2024

[REQ] Add support for phi-3 mini model sozercan/aikit#209

Closed

1 task

criminact closed this as completed Apr 24, 2024

criminact reopened this Apr 24, 2024

hak8or mentioned this issue Apr 25, 2024

I have tested 4-5 phi-3-128K-Instruct models from different providers with different quants, all GGUF files, none are runnable with ollama ollama/ollama#3894

Open

dmsweetser mentioned this issue Apr 25, 2024

PHI-3 128K GGUF - Model Fails to Load oobabooga/text-generation-webui#5930

Open

1 task

arnfaldur mentioned this issue May 14, 2024

Error when trying to convert a HF model which is a LORA PEFT fine tuned version of phi-128k #7287

Closed

ggerganov mentioned this issue May 16, 2024

Add phi3 128K model support #7225

Merged

5 tasks

ngxson mentioned this issue Jul 2, 2024

Fix phi 3 conversion #8262

Merged

2 tasks

This was referenced Jul 3, 2024

Failing to convert the new PHI-3 models. #8259

Closed

STILL no way to convert phi-3-small to GGUF #8241

Closed

SystemPanic mentioned this issue Jul 5, 2024

Conversion to EXL2 of Phi-3 Mini 128k July update produces gibberish output turboderp/exllamav2#537

Closed

ThiloteE mentioned this issue Aug 3, 2024

Models: Add Phi-3.1-mini-128k-instruct nomic-ai/gpt4all#2790

Open

10 tasks

ThiloteE mentioned this issue Sep 10, 2024

convert : refactor rope_freqs generation #9396

Open

10 tasks

Support for Phi-3 models #6849

Support for Phi-3 models #6849

Comments

criminact commented Apr 23, 2024 • edited Loading

criminact commented Apr 23, 2024

K-Mistele commented Apr 23, 2024

sorasoras commented Apr 23, 2024

LiuChaoXD commented Apr 23, 2024

MoonRide303 commented Apr 23, 2024

apepkuss commented Apr 23, 2024

candre23 commented Apr 23, 2024

K-Mistele commented Apr 23, 2024

Dampfinchen commented Apr 23, 2024

mirek190 commented Apr 23, 2024 • edited Loading

MoonRide303 commented Apr 23, 2024

mirek190 commented Apr 23, 2024 • edited Loading

criminact commented Apr 23, 2024

mirek190 commented Apr 23, 2024 • edited Loading

tristandruyen commented Apr 23, 2024

dspasyuk commented Apr 24, 2024

criminact commented Apr 24, 2024

s-kostyaev commented Apr 24, 2024

lukestanley commented Apr 24, 2024

criminact commented Apr 24, 2024

phalexo commented Apr 25, 2024

jtomek commented Apr 25, 2024 • edited Loading

phalexo commented Apr 25, 2024

MoonRide303 commented May 13, 2024

halbtuerke commented May 13, 2024

MoonRide303 commented May 13, 2024 • edited Loading

hchasens commented May 19, 2024

flatsiedatsie commented May 19, 2024

tomasmcm commented May 21, 2024

s-smits commented May 21, 2024 • edited Loading

dmsweetser commented May 21, 2024

MoonRide303 commented May 21, 2024

coder543 commented May 21, 2024

candre23 commented May 21, 2024

RLXIWC commented May 31, 2024

Galunid commented May 31, 2024

RLXIWC commented Jun 2, 2024

Galunid commented Jun 2, 2024

d-kleine commented Jun 8, 2024 • edited Loading

mzwing commented Jun 9, 2024

d-kleine commented Jun 9, 2024 • edited Loading

bartowski1182 commented Jul 2, 2024

coder543 commented Jul 2, 2024

mann1x commented Jul 3, 2024

maziyarpanahi commented Sep 6, 2024

criminact commented Apr 23, 2024 •

edited

Loading

mirek190 commented Apr 23, 2024 •

edited

Loading

mirek190 commented Apr 23, 2024 •

edited

Loading

mirek190 commented Apr 23, 2024 •

edited

Loading

jtomek commented Apr 25, 2024 •

edited

Loading

MoonRide303 commented May 13, 2024 •

edited

Loading

s-smits commented May 21, 2024 •

edited

Loading

d-kleine commented Jun 8, 2024 •

edited

Loading

d-kleine commented Jun 9, 2024 •

edited

Loading