feat. add OpenVINO Model Server as a Backend #1722

fakezeta · 2024-02-18T21:39:53Z

Is your feature request related to a problem? Please describe.
From my benchmark OpenVINO performance on iGPU is almost 5 to 8 times faster than llama.cpp SYCL implementation for Mistral based 7B models.

With SYCL I can serve with iGPU (UHD 770) Starling and Openchat from 2 to 4 token/s while I can easily inference at 15-16 token/second with OpenVINO with INT8.
I don't know what are the performance on ARC or NPU since I don't have the hardware to test.

Could be an effective solution for computer with iGPU

I've uploaded an OpenVINO version of openchat-3.5-0106 to HF for testing https://huggingface.co/fakezeta/openchat-3.5-0106-openvino-int8/

It will be compatible with torch, onnx, openvino model format.

Describe the solution you'd like

This could be implemente with Optimum-Intel library or with gRPC OpenVINO model server

fakezeta added the enhancement New feature or request label Feb 18, 2024

mudler added the roadmap label Feb 19, 2024

mudler mentioned this issue Feb 19, 2024

[EPIC] Model support dashboard (v2) #1126

Open

92 tasks

mudler mentioned this issue Feb 28, 2024

add openvino support for igpu acceleation #1768

Closed

fakezeta closed this as completed Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat. add OpenVINO Model Server as a Backend #1722

feat. add OpenVINO Model Server as a Backend #1722

fakezeta commented Feb 18, 2024

feat. add OpenVINO Model Server as a Backend #1722

feat. add OpenVINO Model Server as a Backend #1722

Comments

fakezeta commented Feb 18, 2024