Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat. add OpenVINO Model Server as a Backend #1722

Closed
fakezeta opened this issue Feb 18, 2024 · 0 comments
Closed

feat. add OpenVINO Model Server as a Backend #1722

fakezeta opened this issue Feb 18, 2024 · 0 comments
Labels
enhancement New feature or request roadmap

Comments

@fakezeta
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
From my benchmark OpenVINO performance on iGPU is almost 5 to 8 times faster than llama.cpp SYCL implementation for Mistral based 7B models.

With SYCL I can serve with iGPU (UHD 770) Starling and Openchat from 2 to 4 token/s while I can easily inference at 15-16 token/second with OpenVINO with INT8.
I don't know what are the performance on ARC or NPU since I don't have the hardware to test.

Could be an effective solution for computer with iGPU

I've uploaded an OpenVINO version of openchat-3.5-0106 to HF for testing https://huggingface.co/fakezeta/openchat-3.5-0106-openvino-int8/

It will be compatible with torch, onnx, openvino model format.

Describe the solution you'd like

This could be implemente with Optimum-Intel library or with gRPC OpenVINO model server

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request roadmap
Projects
None yet
Development

No branches or pull requests

2 participants