-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Text Generation] Multitoken prefill enablement #1130
[Text Generation] Multitoken prefill enablement #1130
Conversation
…om/neuralmagic/deepsparse into feature/damian/causal_mask_support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a deeper look following our offline conversation and understand why you had to go this route - LGTM, but let's update an existing diagram or add a new one to explain the relationship between decoder engine, cache, state, state transfer, and capacity
# self.prompt_processing_sequence_length) | ||
num_non_blank_cache_entries = min( | ||
num_non_blank_cache_entries, | ||
self.sequence_length - self.prompt_processing_sequence_length, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be the total remaining tokens ie something like self.sequence_length - idx * self.prompt_processing_sequence_length
or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are essentially talking about the same thing, but my logic was way too overcomplicated. Refactored the function, so now hopefully anyone reading should grasp what's going on.
…nto feature/damian/multitoken_prefill
…tention_mask for the multitoken prefill
* Update helpers.py * correct implementation of the mapping from inputs to causal mask * [Text Generation] Causal Mask Support (#1127) * initial commit * clean up the PR * working implementation * Ben's review comments * [Text Generation] Multitoken prefill enablement (#1130) * initial commit * clean up the PR * working implementation * initial implementation, hacky lets clean it up * ready for review * few tiny quality improvements * simplify the logic for computing num of unmasked bits for creating attention_mask for the multitoken prefill * replace boolean causal mask for int64 causal mask * fix breaking tests
Enable running the pipeline in the mode, where the prompt is processed (prefill scenario) through multiple, consecutive passes through the multitoken engine. The goal is to achieve optimal inference speed with the deepsparse engine.
Manual Testing
Results: