-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bloom KV cache usage in ORTForCausalLM #1152
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This solution looks good enough to me 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for now. If this conditioning on the model type grows maybe we can find a nicer way of doing it, but right now it seems acceptable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @fxmarty !
Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com>
Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com>
Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com>
The documentation is not available anymore as the PR was closed or merged. |
Fixes #1150
@michaelbenayoun @echarlaix Basically not all models share the exact same
_reorder_cache
andprepare_inputs_for_generation
, in particular here bloom. What do you think of this solution? @echarlaix I would guess there is the same bug in optimum-intel.I think my solution is very ugly (now
ORTBloomForCausalLM
andORTModelForCausalLM
need to be in the same file forever). An other approach is to move all shared methods toORTModelDecoder
(effectively making it a mixin class) and havingORTBloomForCausalLM
not inherit fromORTModelForCausalLM
, but it does not solve the issue of "all classes in one file", and more importantly I believe that changing the inheritance is too breaking of a change (i.e.isinstance(ort_bloom_model, ORTModelForCausalLM)
not working anymore).Other solution: have a single
prepare_inputs_for_generation
,_reorder_cache
, and dispatch to the relevant function from a dictionary. This adds dynamism, which I think is better to avoid.Note: should add tests for
num_beams>1
in this PR as well