You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
From my benchmark OpenVINO performance on iGPU is almost 5 to 8 times faster than llama.cpp SYCL implementation for Mistral based 7B models.
With SYCL I can serve with iGPU (UHD 770) Starling and Openchat from 2 to 4 token/s while I can easily inference at 15-16 token/second with OpenVINO with INT8.
I don't know what are the performance on ARC or NPU since I don't have the hardware to test.
Could be an effective solution for computer with iGPU
Is your feature request related to a problem? Please describe.
From my benchmark OpenVINO performance on iGPU is almost 5 to 8 times faster than llama.cpp SYCL implementation for Mistral based 7B models.
With SYCL I can serve with iGPU (UHD 770) Starling and Openchat from 2 to 4 token/s while I can easily inference at 15-16 token/second with OpenVINO with INT8.
I don't know what are the performance on ARC or NPU since I don't have the hardware to test.
Could be an effective solution for computer with iGPU
I've uploaded an OpenVINO version of openchat-3.5-0106 to HF for testing https://huggingface.co/fakezeta/openchat-3.5-0106-openvino-int8/
It will be compatible with torch, onnx, openvino model format.
Describe the solution you'd like
This could be implemente with Optimum-Intel library or with gRPC OpenVINO model server
The text was updated successfully, but these errors were encountered: