Use lists for broadcast #2

njhill · 2024-05-20T15:30:06Z

Avoid sending all the dict keys, and the calling code also ends up simpler.

broadcast_tensor_dict changed to use the list implementation, so there's still an option for calling code to use that too.

Measured end-to-end performance gain from the saved data transfer is negligible however (at least for TP=2 llama-2-7b).

This PR is based on top of vllm-project#4844.

pcmoritz · 2024-05-20T17:08:37Z

Maybe instead of implementing our own serialization here, we should use something like msgspec which is very high performance and also allows good buffer handling. I drafter something here, let me know what you think :)

It supports not sending the keys via array_like=True and if we make the data structures we want to send into msgspec.Struct, we can avoid all the python data munging.

vllm-project/vllm@main...pcmoritz:vllm-public:speed-up-comms

At Anyscale we have been using this method with great success in production for quite a while :)

njhill · 2024-05-21T01:41:59Z

Thanks @pcmoritz that also looks like a good option!

I hadn't really been thinking of what's done currently / in this PR as our own serialization, it's still using pickle, but perhaps using msgspec would be better than pickle.

I have a sense that compactness of the serialization isn't very important here, it doesn't seem to have a noticeable latency difference in terms of data transfer, but maybe faster serialization/deserialization would make a difference, would be good to compare in an end-to-end benchmark. Otherwise I guess we should prefer simplicity of code/maintenance - minimizing boiler-plate etc.

I see you just used a fixed buffer size per call site - assuming this size doesn't meaningfully affect the speed, that would be simpler than the auto-sizing as done in vllm-project#4844.

pcmoritz · 2024-05-21T02:26:42Z

Yes, I think a fixed buffer size will be good if we can ensure all the metadata fits because it will be very simple :)

Use lists for broadcast

6c7da99

njhill mentioned this pull request May 20, 2024

[Core] Avoid one broadcast op when propagating metadata vllm-project/vllm#4844

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use lists for broadcast #2

Use lists for broadcast #2

njhill commented May 20, 2024 •

edited

Loading

pcmoritz commented May 20, 2024 •

edited

Loading

njhill commented May 21, 2024

pcmoritz commented May 21, 2024

Use lists for broadcast #2

Are you sure you want to change the base?

Use lists for broadcast #2

Conversation

njhill commented May 20, 2024 • edited Loading

pcmoritz commented May 20, 2024 • edited Loading

njhill commented May 21, 2024

pcmoritz commented May 21, 2024

njhill commented May 20, 2024 •

edited

Loading

pcmoritz commented May 20, 2024 •

edited

Loading