Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] Adds Phi-3 support #4298

Merged
merged 11 commits into from
Apr 25, 2024
Merged

Conversation

caiom
Copy link
Contributor

@caiom caiom commented Apr 23, 2024

This PR adds support for microsoft/Phi-3-mini-4k-instruct and microsoft/Phi-3-mini-128k-instruct.

Changes:

  1. Adds PhiLongScaledRotaryEmbedding layer. This is a pure Pytorch implementation. Optimizations may come in the future.
  2. Makes the layer name mapping in the llama model more robust to avoid layer name collision.

Models tested before this PR:

  • microsoft/Phi-3-mini-4k-instruct
  • microsoft/Phi-3-mini-128k-instruct
  • meta-llama/Llama-2-7b-chat-hf
  • meta-llama/Meta-Llama-3-8B
  • mistralai/Mistral-7B-Instruct-v0.2

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE


PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [Doc] for documentation fixes and improvements.
  • [Model] for adding a new model or improving an existing model. Model name should appear in the title.
  • [Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
  • [Kernel] for changes affecting CUDA kernels or other compute kernels.
  • [Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
  • [Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • We adhere to Google Python style guide and Google C++ style guide.
  • Pass all linter checks. Please use format.sh to format your code.
  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
  • Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

  • After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
  • After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
  • After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
  • Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

@simon-mo
Copy link
Collaborator

This is a pure Pytorch implementation. Optimizations may come in the future.

Interested to see whether torch.compile can help here.

@ArthurZucker
Copy link

Torch compile compatible version is here without scaling. @caiom would you mind syncronizing with huggingface/transformers#30423 and @gugarosa to make sure we have a similar format?

@agt agt mentioned this pull request Apr 23, 2024
@caiom
Copy link
Contributor Author

caiom commented Apr 24, 2024

I've updated the naming to match the ones used in HF and the config changes. Let's wait until the HF PR is merged.

Comment on lines +430 to +431
long_prompt_offset = (torch.any(positions > k).float() *
torch.full_like(positions, k)).long()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have question on this line: In a case where there are prompts longer than k and others shorter than k, it will have a uniform offset (choosing the long cos_sin_cache), therefore, the shorter prompts will also refer to the long cos_sin_cache, and the outputs will be different if the short prompts are processed alone. But is this as expected? In shorter-prompt cases the short cos_sin_cache should be used I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mindest, that's correct, when a batch contains long and short sequences, it will always use long factor, even for short samples. Currently we don't support such mixed batches.

@@ -349,16 +469,17 @@ def get_rope(
rope_scaling: Optional[Dict[str, Any]] = None,
) -> RotaryEmbedding:
key = (head_size, rotary_dim, max_position, base, is_neox_style,
tuple(rope_scaling.items()) if rope_scaling is not None else None)
(v for v in rope_scaling.items()
if not isinstance(v, list)) if rope_scaling is not None else None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of replace the list as None, consider turning it into a tuple instead? That will be more accurate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! I fixed the code and disentangled it be to clear. I also verified that the layer caching system is working.

@pseudotensor
Copy link

HF PR is merged: huggingface/transformers#30423

@caiom
Copy link
Contributor Author

caiom commented Apr 24, 2024

About torch.compile: The Phi3SuScaledRotaryEmbedding is fully compatible with it and there are no graph breaks. However, I noticed a longer startup time and some recompilation warnings. Maybe someone with more compile experience can add this feature later?

The speedup is between 2-5x depending on the sequence length.

("qkv_proj", "v_proj", "v"),
("gate_up_proj", "gate_proj", 0),
("gate_up_proj", "up_proj", 1),
(".qkv_proj", ".q_proj", "q"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick question: Why adding dot here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Phi have a different set of mappings?

  • does this not break the existing llama executor?

Copy link
Contributor Author

@caiom caiom Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ensures that we are replacing the full layer name instead of a partial layer name.

Phi3 uses the vLLM layer naming convention (no replacement necessary), therefore we have a layer named gate_up_proj, without the dots, this layer would be incorrectly renamed to gate_gate_up_proj due to line 389.

So the dot in up_proj makes sure that we only change the name of the layer if the layer name starts with "up_proj" instead of containing "up_proj" anywhere in its name.

This change is compatible with Llama and Mistral models.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarification! make sense.

@esmeetu
Copy link
Collaborator

esmeetu commented Apr 25, 2024

LGTM! Could you help adding this model to document and readme file?

sumukshashidhar
sumukshashidhar approved these changes Apr 25, 2024
@caiom
Copy link
Contributor Author

caiom commented Apr 25, 2024

LGTM! Could you help adding this model to document and readme file?

Thanks for the review. I've added Phi3 to README and to the Supported Models doc.

@esmeetu esmeetu enabled auto-merge (squash) April 25, 2024 02:36
@esmeetu esmeetu merged commit 96e90fd into vllm-project:main Apr 25, 2024
47 checks passed
xjpang pushed a commit to xjpang/vllm that referenced this pull request Apr 25, 2024
@pseudotensor
Copy link

pseudotensor commented Apr 26, 2024

Hi @caiom , Phi-3 with this is relatively slow for a 3B model. Is there a way to speed it up?

I get about 73 tokens/sec on H100 with no in-context stuff. And about 35 tokens/sec with context filled. But that's with streaming.

I get better performance at 125 tokens/s with no in-context stuff when not streaming.

@@ -46,6 +46,7 @@
"OPTForCausalLM": ("opt", "OPTForCausalLM"),
"OrionForCausalLM": ("orion", "OrionForCausalLM"),
"PhiForCausalLM": ("phi", "PhiForCausalLM"),
"Phi3ForCausalLM": ("llama", "LlamaForCausalLM"),
Copy link

@davidgxue davidgxue Apr 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retrospectively a quick question (more for my own understanding): why are these llama and LlamaForCausalLM? Isn't it on huggingface transformers library it has been defined as Phi3ForCausalLM? https://huggingface.co/docs/transformers/main/en/model_doc/phi3.

It seems like module and cls name is later used to import from vllm's /models directory, which we have not yet added phi3.py in to the list or created its own file yet?

I think Phi3, while on paper is very similar to Llama 2, it does have a few modifications that makes it different from llama 2 like fused qkv and mlp. Should we make Phi 3 its own file or do you think it is ok to just keep inheriting from llama this way? (or I guess based on the rest of the PR you may have already done this by just changing llama.py so maybe that just works haha)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model is very similar to llama, hence it is better to reuse the existing code base. The same is true for Mistral.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah i saw your changes to the other files and they make sense now. Ty!

@caiom
Copy link
Contributor Author

caiom commented Apr 27, 2024

Hi @caiom , Phi-3 with this is relatively slow for a 3B model. Is there a way to speed it up?

I get about 73 tokens/sec on H100 with no in-context stuff. And about 35 tokens/sec with context filled. But that's with streaming.

I get better performance at 125 tokens/s with no in-context stuff when not streaming.

Will be working on that.

@davidgxue
Copy link

davidgxue commented Apr 27, 2024

Hi sorry it's me again, I am actually getting an error with Phi-3 after installing from source

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/david_xue/vllm/vllm/entrypoints/openai/api_server.py", line 159, in <module>
    engine = AsyncLLMEngine.from_engine_args(
  File "/home/david_xue/vllm/vllm/engine/async_llm_engine.py", line 361, in from_engine_args
    engine = cls(
  File "/home/david_xue/vllm/vllm/engine/async_llm_engine.py", line 319, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/david_xue/vllm/vllm/engine/async_llm_engine.py", line 437, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/david_xue/vllm/vllm/engine/llm_engine.py", line 214, in __init__
    self.scheduler = Scheduler(scheduler_config, cache_config, lora_config)
  File "/home/david_xue/vllm/vllm/core/scheduler.py", line 265, in __init__
    self.block_manager = BlockSpaceManagerImpl(
  File "/home/david_xue/vllm/vllm/core/block_manager_v1.py", line 223, in __init__
    assert sliding_window % block_size == 0, (sliding_window,
AssertionError: (2047, 16)

Sliding window is 2047 and that seems correct based on phi 3 code. Block size 16 is also correct...? Am I missing something?

I am currently working on a custom quant for Phi 3 which maybe related. Should I open a separate issue?

Oh also I am not sure how this was tested but Phi 3 mini 4k has sliding window of 2047 but I think 128k context length version has sliding window disabled or something different?

@caiom
Copy link
Contributor Author

caiom commented Apr 27, 2024

Hi @davidgxue, sorry about that! Have a look here: #4380

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants