Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed inference with llama.cpp #2322

Closed
mudler opened this issue May 14, 2024 · 2 comments · Fixed by #2324
Closed

Distributed inference with llama.cpp #2322

mudler opened this issue May 14, 2024 · 2 comments · Fixed by #2324
Assignees
Labels
enhancement New feature or request roadmap

Comments

@mudler
Copy link
Owner

mudler commented May 14, 2024

As ggerganov/llama.cpp#6829 (great job llama.cpp!) is in, should be possible to extend our grpc server to distribute the workload to workers.

From a quick look the upstream implementation looks quite lean as we need to pass params to llama.cpp directly.

Only main point is that we want to propagate this setting from the CLI/env rather then having a config portion in the model

@mudler mudler added enhancement New feature or request roadmap labels May 14, 2024
@mudler
Copy link
Owner Author

mudler commented May 14, 2024

might be needed waiting for #2306 as it likely requires a new llama cpp backend specifically enabled for grpc as it is treated internally as a backend for offloading by llama.cpp (like Metal, CUDA, etc.). I've didn't tried if GRPC builds fallbacks to local builds yet

@mudler
Copy link
Owner Author

mudler commented May 14, 2024

seems to work as expected with #2324

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request roadmap
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant