Distributed inference with llama.cpp #2322

mudler · 2024-05-14T11:36:31Z

As ggerganov/llama.cpp#6829 (great job llama.cpp!) is in, should be possible to extend our grpc server to distribute the workload to workers.

From a quick look the upstream implementation looks quite lean as we need to pass params to llama.cpp directly.

Only main point is that we want to propagate this setting from the CLI/env rather then having a config portion in the model

mudler · 2024-05-14T14:52:52Z

might be needed waiting for #2306 as it likely requires a new llama cpp backend specifically enabled for grpc as it is treated internally as a backend for offloading by llama.cpp (like Metal, CUDA, etc.). I've didn't tried if GRPC builds fallbacks to local builds yet

mudler · 2024-05-14T21:46:27Z

seems to work as expected with #2324

mudler added enhancement New feature or request roadmap labels May 14, 2024

mudler mentioned this issue May 14, 2024

feat: auto select llama-cpp cuda runtime #2306

Merged

1 task

mudler self-assigned this May 14, 2024

mudler mentioned this issue May 14, 2024

feat(llama.cpp): add distributed llama.cpp inferencing #2324

Merged

mudler closed this as completed in #2324 May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed inference with llama.cpp #2322

Distributed inference with llama.cpp #2322

mudler commented May 14, 2024 •

edited

Loading

mudler commented May 14, 2024 •

edited

Loading

mudler commented May 14, 2024

Distributed inference with llama.cpp #2322

Distributed inference with llama.cpp #2322

Comments

mudler commented May 14, 2024 • edited Loading

mudler commented May 14, 2024 • edited Loading

mudler commented May 14, 2024

mudler commented May 14, 2024 •

edited

Loading

mudler commented May 14, 2024 •

edited

Loading