[Model] Adds Phi-3 support #4298

caiom · 2024-04-23T15:58:39Z

This PR adds support for microsoft/Phi-3-mini-4k-instruct and microsoft/Phi-3-mini-128k-instruct.

Changes:

Adds PhiLongScaledRotaryEmbedding layer. This is a pure Pytorch implementation. Optimizations may come in the future.
Makes the layer name mapping in the llama model more robust to avoid layer name collision.

Models tested before this PR:

microsoft/Phi-3-mini-4k-instruct
microsoft/Phi-3-mini-128k-instruct
meta-llama/Llama-2-7b-chat-hf
meta-llama/Meta-Llama-3-8B
mistralai/Mistral-7B-Instruct-v0.2

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

simon-mo · 2024-04-23T16:45:26Z

This is a pure Pytorch implementation. Optimizations may come in the future.

Interested to see whether torch.compile can help here.

ArthurZucker · 2024-04-23T17:11:26Z

Torch compile compatible version is here without scaling. @caiom would you mind syncronizing with huggingface/transformers#30423 and @gugarosa to make sure we have a similar format?

caiom · 2024-04-24T00:28:25Z

I've updated the naming to match the ones used in HF and the config changes. Let's wait until the HF PR is merged.

mindest · 2024-04-24T07:31:22Z

vllm/model_executor/layers/rotary_embedding.py

+        long_prompt_offset = (torch.any(positions > k).float() *
+                              torch.full_like(positions, k)).long()


I have question on this line: In a case where there are prompts longer than k and others shorter than k, it will have a uniform offset (choosing the long cos_sin_cache), therefore, the shorter prompts will also refer to the long cos_sin_cache, and the outputs will be different if the short prompts are processed alone. But is this as expected? In shorter-prompt cases the short cos_sin_cache should be used I think.

Hi @mindest, that's correct, when a batch contains long and short sequences, it will always use long factor, even for short samples. Currently we don't support such mixed batches.

linxihui · 2024-04-24T00:29:59Z

vllm/model_executor/layers/rotary_embedding.py

@@ -349,16 +469,17 @@ def get_rope(
    rope_scaling: Optional[Dict[str, Any]] = None,
 ) -> RotaryEmbedding:
    key = (head_size, rotary_dim, max_position, base, is_neox_style,
-           tuple(rope_scaling.items()) if rope_scaling is not None else None)
+           (v for v in rope_scaling.items()
+            if not isinstance(v, list)) if rope_scaling is not None else None)


Maybe instead of replace the list as None, consider turning it into a tuple instead? That will be more accurate.

Hi! I fixed the code and disentangled it be to clear. I also verified that the layer caching system is working.

pseudotensor · 2024-04-24T17:25:05Z

HF PR is merged: huggingface/transformers#30423

caiom · 2024-04-24T19:22:38Z

About torch.compile: The Phi3SuScaledRotaryEmbedding is fully compatible with it and there are no graph breaks. However, I noticed a longer startup time and some recompilation warnings. Maybe someone with more compile experience can add this feature later?

The speedup is between 2-5x depending on the sequence length.

esmeetu · 2024-04-25T00:41:26Z

vllm/model_executor/models/llama.py

-            ("qkv_proj", "v_proj", "v"),
-            ("gate_up_proj", "gate_proj", 0),
-            ("gate_up_proj", "up_proj", 1),
+            (".qkv_proj", ".q_proj", "q"),


A quick question: Why adding dot here?

Does Phi have a different set of mappings?

does this not break the existing llama executor?

This ensures that we are replacing the full layer name instead of a partial layer name.

Phi3 uses the vLLM layer naming convention (no replacement necessary), therefore we have a layer named gate_up_proj, without the dots, this layer would be incorrectly renamed to gate_gate_up_proj due to line 389.

So the dot in up_proj makes sure that we only change the name of the layer if the layer name starts with "up_proj" instead of containing "up_proj" anywhere in its name.

This change is compatible with Llama and Mistral models.

Thanks for clarification! make sense.

esmeetu · 2024-04-25T00:51:35Z

LGTM! Could you help adding this model to document and readme file?

caiom · 2024-04-25T02:13:46Z

LGTM! Could you help adding this model to document and readme file?

Thanks for the review. I've added Phi3 to README and to the Supported Models doc.

pseudotensor · 2024-04-26T05:37:45Z

Hi @caiom , Phi-3 with this is relatively slow for a 3B model. Is there a way to speed it up?

I get about 73 tokens/sec on H100 with no in-context stuff. And about 35 tokens/sec with context filled. But that's with streaming.

I get better performance at 125 tokens/s with no in-context stuff when not streaming.

davidgxue · 2024-04-27T02:50:01Z

vllm/model_executor/models/__init__.py

@@ -46,6 +46,7 @@
    "OPTForCausalLM": ("opt", "OPTForCausalLM"),
    "OrionForCausalLM": ("orion", "OrionForCausalLM"),
    "PhiForCausalLM": ("phi", "PhiForCausalLM"),
+    "Phi3ForCausalLM": ("llama", "LlamaForCausalLM"),


Retrospectively a quick question (more for my own understanding): why are these llama and LlamaForCausalLM? Isn't it on huggingface transformers library it has been defined as Phi3ForCausalLM? https://huggingface.co/docs/transformers/main/en/model_doc/phi3.

It seems like module and cls name is later used to import from vllm's /models directory, which we have not yet added phi3.py in to the list or created its own file yet?

I think Phi3, while on paper is very similar to Llama 2, it does have a few modifications that makes it different from llama 2 like fused qkv and mlp. Should we make Phi 3 its own file or do you think it is ok to just keep inheriting from llama this way? (or I guess based on the rest of the PR you may have already done this by just changing llama.py so maybe that just works haha)

The model is very similar to llama, hence it is better to reuse the existing code base. The same is true for Mistral.

Ah i saw your changes to the other files and they make sense now. Ty!

caiom · 2024-04-27T02:58:33Z

Hi @caiom , Phi-3 with this is relatively slow for a 3B model. Is there a way to speed it up?

I get about 73 tokens/sec on H100 with no in-context stuff. And about 35 tokens/sec with context filled. But that's with streaming.

I get better performance at 125 tokens/s with no in-context stuff when not streaming.

Will be working on that.

davidgxue · 2024-04-27T03:28:46Z

Hi sorry it's me again, I am actually getting an error with Phi-3 after installing from source

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/david_xue/vllm/vllm/entrypoints/openai/api_server.py", line 159, in <module>
    engine = AsyncLLMEngine.from_engine_args(
  File "/home/david_xue/vllm/vllm/engine/async_llm_engine.py", line 361, in from_engine_args
    engine = cls(
  File "/home/david_xue/vllm/vllm/engine/async_llm_engine.py", line 319, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/david_xue/vllm/vllm/engine/async_llm_engine.py", line 437, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/david_xue/vllm/vllm/engine/llm_engine.py", line 214, in __init__
    self.scheduler = Scheduler(scheduler_config, cache_config, lora_config)
  File "/home/david_xue/vllm/vllm/core/scheduler.py", line 265, in __init__
    self.block_manager = BlockSpaceManagerImpl(
  File "/home/david_xue/vllm/vllm/core/block_manager_v1.py", line 223, in __init__
    assert sliding_window % block_size == 0, (sliding_window,
AssertionError: (2047, 16)

Sliding window is 2047 and that seems correct based on phi 3 code. Block size 16 is also correct...? Am I missing something?

I am currently working on a custom quant for Phi 3 which maybe related. Should I open a separate issue?

Oh also I am not sure how this was tested but Phi 3 mini 4k has sliding window of 2047 but I think 128k context length version has sliding window disabled or something different?

caiom · 2024-04-27T03:32:54Z

Hi @davidgxue, sorry about that! Have a look here: #4380

Caio Mendes added 4 commits April 22, 2024 18:50

Initial support

b61b2c4

Fix doc

9b18c43

Merge remote-tracking branch 'origin/main' into caiom/adds_phi3_support

6b9b1ca

Run yapf and ruff

cc94d0c

agt mentioned this pull request Apr 23, 2024

[New Model]: Support Phi-3 #4306

Closed

Caio Mendes added 2 commits April 23, 2024 17:12

Update naming to match HF

10da966

Run yapf and ruff

3f9878e

Clean up

6c93f5d

mindest reviewed Apr 24, 2024

View reviewed changes

bks5881 mentioned this pull request Apr 24, 2024

[Bug]: phi-3 (microsoft/Phi-3-mini-128k-instruct) fails with assert "factor" in rope_scaling #4323

Closed

linxihui reviewed Apr 24, 2024

View reviewed changes

Fixes and disentangle rope args

d694081

Fix double argument.

9a9eebe

linxihui mentioned this pull request Apr 24, 2024

[Model][Kernels] Support Phi3small architecture, prefilling kernel, CUDA+Triton paged attn kernels, SPDA for V100 beagleski/sparse_vllm#2

Closed

esmeetu reviewed Apr 25, 2024

View reviewed changes

sumukshashidhar approved these changes Apr 25, 2024 •

edited

Loading

View reviewed changes

simon-mo assigned esmeetu Apr 25, 2024

Update README and docs

3ef1a41

Update Phi-3 name

935a705

esmeetu approved these changes Apr 25, 2024

View reviewed changes

esmeetu enabled auto-merge (squash) April 25, 2024 02:36

esmeetu merged commit 96e90fd into vllm-project:main Apr 25, 2024
47 checks passed

xjpang pushed a commit to xjpang/vllm that referenced this pull request Apr 25, 2024

[Model] Adds Phi-3 support (vllm-project#4298)

beb6ef3

saurabh-srivastava mentioned this pull request Apr 26, 2024

Compute Phi3 reasoning gap: Add vllm model_api accessor ConsequentAI/fneval#22

Open

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 26, 2024

[Model] Adds Phi-3 support (vllm-project#4298)

e207f23

davidgxue reviewed Apr 27, 2024

View reviewed changes

alexeykondrat pushed a commit to alexeykondrat/ci-vllm that referenced this pull request May 1, 2024

[Model] Adds Phi-3 support (vllm-project#4298)

b140916

ilkersigirci mentioned this pull request May 2, 2024

[Bug]: Phi3 still not supported #4375

Closed

saattrupdan mentioned this pull request May 3, 2024

[MODEL EVALUATION REQUEST] Phi-3-mini-4k-instruct-dansk ScandEval/ScandEval#427

Closed

8 tasks

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 7, 2024

[Model] Adds Phi-3 support (vllm-project#4298)

d6bba29

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

This was referenced May 24, 2024

Add RoPE scaling arguments to engine PygmalionAI/aphrodite-engine#220

Closed

[New Model]: Phi3ForCausalLM PygmalionAI/aphrodite-engine#478

Closed

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

[Model] Adds Phi-3 support (vllm-project#4298)

d6a8e9b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Adds Phi-3 support #4298

[Model] Adds Phi-3 support #4298

caiom commented Apr 23, 2024

simon-mo commented Apr 23, 2024

ArthurZucker commented Apr 23, 2024

caiom commented Apr 24, 2024

mindest Apr 24, 2024

caiom Apr 24, 2024

linxihui Apr 24, 2024

caiom Apr 24, 2024

pseudotensor commented Apr 24, 2024

caiom commented Apr 24, 2024

esmeetu Apr 25, 2024

sumukshashidhar Apr 25, 2024

caiom Apr 25, 2024 •

edited

Loading

esmeetu Apr 25, 2024

esmeetu commented Apr 25, 2024

caiom commented Apr 25, 2024

pseudotensor commented Apr 26, 2024 •

edited

Loading

davidgxue Apr 27, 2024 •

edited

Loading

caiom Apr 27, 2024

davidgxue Apr 27, 2024

caiom commented Apr 27, 2024

davidgxue commented Apr 27, 2024 •

edited

Loading

caiom commented Apr 27, 2024

		long_prompt_offset = (torch.any(positions > k).float() *
		torch.full_like(positions, k)).long()

[Model] Adds Phi-3 support #4298

[Model] Adds Phi-3 support #4298

Conversation

caiom commented Apr 23, 2024

PR Title and Classification

Code Quality

Notes for Large Changes

What to Expect for the Reviews

Thank You

simon-mo commented Apr 23, 2024

ArthurZucker commented Apr 23, 2024

caiom commented Apr 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pseudotensor commented Apr 24, 2024

caiom commented Apr 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

caiom Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

esmeetu commented Apr 25, 2024

caiom commented Apr 25, 2024

pseudotensor commented Apr 26, 2024 • edited Loading

davidgxue Apr 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

caiom commented Apr 27, 2024

davidgxue commented Apr 27, 2024 • edited Loading

caiom commented Apr 27, 2024

caiom Apr 25, 2024 •

edited

Loading

pseudotensor commented Apr 26, 2024 •

edited

Loading

davidgxue Apr 27, 2024 •

edited

Loading

davidgxue commented Apr 27, 2024 •

edited

Loading