Support Deepseek MoE #2429

esmeetu · 2024-01-12T16:29:12Z

Model info:

https://huggingface.co/deepseek-ai/deepseek-moe-16b-chat

~~Current implementation will generate garbled text, and i need some help.~~

Test code:

from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "def greet"
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.0, max_tokens=128)

# Create an LLM.
llm = LLM(model="deepseek-ai/deepseek-moe-16b-chat", dtype="half", enforce_eager=True, tensor_parallel_size=4, gpu_memory_utilization=0.95, trust_remote_code=True)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.

outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Old Ouput:

Prompt: 'def greet', Generated text: " 家4《0“d人 同一-开li...网是 下 （ 不是\n同：手望=你在同是是哪个字�d/ 3,1\n02的宇\n市、Google 3一时中国下在那里\n Me、、 -分 f\n\n小的高维AF'A 的 是（入\n 是很 名\n人 >>折料\n2的留 当\n2下\nof再\n的 狗了顾团队\n1手、重 了\n一些有\n一个在\n12"�

------- Update Ouput

Prompt: 'def greet', Generated text: '(name):\n    print("Hello, " + name + "!")\n\ngreet("Alice")'

Additional Model Chat template:

{% for message in messages %}
{% if message['role'] == 'user' %}
User: {{ message['content']|trim -}}
{% if not loop.last %}

{% endif %}
{% elif message['role'] == 'assistant' %}
Assistant: {{ message['content']|trim -}}{{ eos_token }}
{% if not loop.last %}

{% endif %}
{% endif %}
{% endfor %}
{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}

Assistant: {% endif %}

zhuohan123 · 2024-01-12T21:59:58Z

Can you compare with HF implementation by printing the tensors layer by layer to see where the results become off? This is typically how we debug this kind of issue.

esmeetu · 2024-01-12T23:32:46Z

Can you compare with HF implementation by printing the tensors layer by layer to see where the results become off? This is typically how we debug this kind of issue.

Ok, i will try this.

esmeetu · 2024-01-13T15:26:23Z

Hi, @zhuohan123. I returned to the official implementation of MoE. There should be some improvement space by adapting expert parallelism. And this PR is ready for review and will exploring optimization in the future PR. cc @WoosukKwon

zwd003 · 2024-01-16T03:56:17Z

Thank you for your support of deepseek moe，we will subsequently launch an optimized inference version.

esmeetu · 2024-01-16T04:42:52Z

Thank you for your support of deepseek moe，we will subsequently launch an optimized inference version.

Hi @zwd003, I am happy to hear Deepseek team voice about this. Glad to see your commit as soon as possible! And hope it will reach a great performance boost. 💪

init support

3a2e75e

esmeetu marked this pull request as draft January 13, 2024 01:14

fix residual and moe

8d7a0f5

esmeetu changed the title ~~[WIP] Support Deepseek MoE (Need Help)~~ Support Deepseek MoE Jan 13, 2024

esmeetu marked this pull request as ready for review January 13, 2024 15:16

zwd003 mentioned this pull request Jan 16, 2024

DeepseekMoE support with Fused MoE kernel #2453

Merged

esmeetu closed this Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Deepseek MoE #2429

Support Deepseek MoE #2429

esmeetu commented Jan 12, 2024 •

edited

Loading

zhuohan123 commented Jan 12, 2024

esmeetu commented Jan 12, 2024

esmeetu commented Jan 13, 2024 •

edited

Loading

zwd003 commented Jan 16, 2024

esmeetu commented Jan 16, 2024

Support Deepseek MoE #2429

Support Deepseek MoE #2429

Conversation

esmeetu commented Jan 12, 2024 • edited Loading

Model info:

zhuohan123 commented Jan 12, 2024

esmeetu commented Jan 12, 2024

esmeetu commented Jan 13, 2024 • edited Loading

zwd003 commented Jan 16, 2024

esmeetu commented Jan 16, 2024

esmeetu commented Jan 12, 2024 •

edited

Loading

esmeetu commented Jan 13, 2024 •

edited

Loading