Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use lists for broadcast #2

Open
wants to merge 1 commit into
base: single-broadcast
Choose a base branch
from
Open

Conversation

njhill
Copy link
Owner

@njhill njhill commented May 20, 2024

Avoid sending all the dict keys, and the calling code also ends up simpler.

broadcast_tensor_dict changed to use the list implementation, so there's still an option for calling code to use that too.

Measured end-to-end performance gain from the saved data transfer is negligible however (at least for TP=2 llama-2-7b).

This PR is based on top of vllm-project#4844.

@pcmoritz
Copy link

pcmoritz commented May 20, 2024

Maybe instead of implementing our own serialization here, we should use something like msgspec which is very high performance and also allows good buffer handling. I drafter something here, let me know what you think :)

It supports not sending the keys via array_like=True and if we make the data structures we want to send into msgspec.Struct, we can avoid all the python data munging.

vllm-project/vllm@main...pcmoritz:vllm-public:speed-up-comms

At Anyscale we have been using this method with great success in production for quite a while :)

@njhill
Copy link
Owner Author

njhill commented May 21, 2024

Thanks @pcmoritz that also looks like a good option!

I hadn't really been thinking of what's done currently / in this PR as our own serialization, it's still using pickle, but perhaps using msgspec would be better than pickle.

I have a sense that compactness of the serialization isn't very important here, it doesn't seem to have a noticeable latency difference in terms of data transfer, but maybe faster serialization/deserialization would make a difference, would be good to compare in an end-to-end benchmark. Otherwise I guess we should prefer simplicity of code/maintenance - minimizing boiler-plate etc.

I see you just used a fixed buffer size per call site - assuming this size doesn't meaningfully affect the speed, that would be simpler than the auto-sizing as done in vllm-project#4844.

@pcmoritz
Copy link

Yes, I think a fixed buffer size will be good if we can ensure all the metadata fits because it will be very simple :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants