Add torch.compile Support For Mamba #31247

zhenglongjiepheonix · 2024-06-04T22:36:14Z

torch.compile support for mamba! Closes #31246

zhenglongjiepheonix · 2024-06-04T22:43:45Z

It seems that the mamba cache is not compatible with the current cache design used in generate, but we have to initialize the cache before we step into model.forward in order to make dynamo happy, and I think a specific conditional check for mamba in get_cache might not be what we want because it's too specific a patch. We can let user specify and create a mamba cache when using torch.compile, or is there any way to solve this so that we set the cache for users in gerenate? @ArthurZucker

HuggingFaceDocBuilderDev · 2024-06-04T22:57:08Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Looking good!
I think for this we want a general solution that would work for hybrid caches as well (Like jamba / mamba2 / zamba / etc).
Here it's possible to init the cache before going into the forward if you set the NEED_SETUP_CACHE_CLASSES_MAPPING = {"static": StaticCache, "sliding_window": SlidingWindowCache} with "mamba" ? it's not too bad 😅
otherwise it could be that we redefiine the staticCache for mamba to be the MambaCache class.
cc @zucchini-nlp and @gante 😉

src/transformers/models/mamba/modeling_mamba.py

…ompile_for_mamba

ArthurZucker

nice 🔥

src/transformers/models/mamba/modeling_mamba.py

ArthurZucker · 2024-06-07T16:29:47Z

src/transformers/models/mamba/modeling_mamba.py

+    def is_initialized(self, layer_idx):
+        return self.is_cache_initialized[layer_idx]


this can be checked with cache_postiions instead no?

I think this is fine just like we need a flag in whisper, here cache_positions is not so meaningful because we always know how to update and get the cache even if cache_positions is not passed

Theoretically yes, but this is adding some complexity, which is not needed in the cache API. Checking the cache positions is more reliable, and is what we want to go with.

you don't have to reset and set another tensor
which is also a win

Let's just use the cache positions

src/transformers/models/mamba/modeling_mamba.py

ArthurZucker · 2024-06-07T16:32:38Z

Could you share benchmark results?

zhenglongjiepheonix · 2024-06-08T00:30:22Z

Could you share benchmark results?

Sure, with mamba-1.4b on a single A100-SXM4-80GB in float16, batch_size=1 inference mode, time is for per token generation:

slow_foward

cuda_kernel_forward

compile

And here is the function I used for benchmarking

@torch.no_grad
def perf():
    tokenizer = AutoTokenizer.from_pretrained("state-spaces/mamba-1.4b-hf")
    tokenizer.pad_token = tokenizer.eos_token
    inputs = tokenizer("Hey how are you doing today ? " * 100, return_tensors="pt", padding=True).to('cuda')

    model = MambaForCausalLM.from_pretrained("state-spaces/mamba-1.4b-hf", torch_dtype=torch.float16)
    model.config.use_cache = True
    model.to('cuda')    

    input_ids = inputs.input_ids
    cache = MambaCache(model.config, 1, device=input_ids.device)
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    logits = model(input_ids, cache_params = cache).logits
    next_token = torch.argmax(logits[:, -1], dim=-1)[:, None]

    model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
    torch.cuda.synchronize()
    for i in range(10):
        start.record()
        logits = model(next_token.clone(), cache_params = cache).logits
        next_token = torch.argmax(logits[:, -1], dim=-1)[:, None]

        end.record()
        torch.cuda.synchronize()
        print(f'Step {i}, Total time: {start.elapsed_time(end)} ms, next_token = {next_token.int()}')

As we can see from the results above, it takes a lot of time for the first and second decoding step, and actually in my script I skipped compile for the very first prefilling stage because it takes forever(nearly one hour) to compile, so if we only focus on the decoding phase, then we get a steady 8x speedup
even if comparing with cuda kernel implementation

…ompile_for_mamba

ArthurZucker

Let's use cache positions for the check, and be careful of BC!
Otherwise great work!

src/transformers/models/mamba/modeling_mamba.py

ArthurZucker · 2024-06-10T11:34:43Z

src/transformers/models/mamba/modeling_mamba.py

+    def is_initialized(self, layer_idx):
+        return self.is_cache_initialized[layer_idx]


Theoretically yes, but this is adding some complexity, which is not needed in the cache API. Checking the cache positions is more reliable, and is what we want to go with.

you don't have to reset and set another tensor
which is also a win

Let's just use the cache positions

src/transformers/models/mamba/modeling_mamba.py

tests/models/mamba/test_modeling_mamba.py

…ompile_for_mamba

ArthurZucker

Almost done! Again great work! Let's just define a good api for futur models / for RecurrentGemma for example that will also benefit from this!

ArthurZucker · 2024-06-26T13:48:08Z

src/transformers/generation/utils.py

+            elif generation_config.cache_implementation == "mamba":
+                from ..models.mamba.modeling_mamba import MambaCache, MambaConfig
+
+                if not isinstance(self.config, MambaConfig):
+                    raise ValueError(
+                        "You can only specify `cache_implementation` to `mamba` if you are using mamba model"
+                    )
+
+                if hasattr(self, "_cache"):
+                    assert isinstance(self._cache, MambaCache), "Only `MambaCache` can be used on mamba model"
+                    need_new_cache = self._cache.conv_states.shape[1] != batch_size
+                else:
+                    need_new_cache = True
+
+                if need_new_cache:
+                    self._cache = MambaCache(
+                        config=self.config, batch_size=batch_size, dtype=self.dtype, device=self.device
+                    )
+                else:
+                    self._cache.reset()
+                model_kwargs["cache_params"] = self._cache


THe problem with this is that it does not scale with new models. It's not something we want to do at all TBH.
The simplest is to import the MambaCache, and add it to the mapping "mamba": MambaCache.

needs_new_cache should be specific to the cache class.
Maybe this is the best approach as for new cache class it will be a new correct way to say whether or not we reset!

ArthurZucker · 2024-06-26T13:49:29Z

src/transformers/models/mamba/modeling_mamba.py

+        self, layer_idx: int, new_conv_state: torch.Tensor, cache_position: torch.LongTensor
+    ) -> torch.Tensor:
+        conv_state = self.conv_states[layer_idx]
+        cache_position = cache_position.clamp(0, self.conv_kernel_size - 1)


bool flag should be dynamo compatible, but I trust you on this one and it's fairly small so LGTM

ArthurZucker · 2024-06-26T13:51:09Z

src/transformers/models/mamba/modeling_mamba.py

-                hidden_states = self.act(hidden_states).to(dtype).unsqueeze(-1)         # [batch, intermediate_size, 1] : decoding
-            else:
+
+            if cache_position.shape[0] == self.conv_kernel_size:


might be worth adding a comment in the code to explain the trick.
More in favor of using cache position[0] to detect decoding if it works, if not then a small comment!

ArthurZucker · 2024-06-26T13:52:24Z

src/transformers/models/mamba/modeling_mamba.py

+                cache_params = MambaCache(
+                    self.config, inputs_embeds.size(0), device=inputs_embeds.device, dtype=inputs_embeds.dtype
+                )
+                cache_position = torch.arange(0, self.config.conv_kernel, device=inputs_embeds.device)


Okay. cache_postions[0] > 0 breaks full graph I gues?

ArthurZucker · 2024-06-26T13:52:42Z

src/transformers/models/mamba/modeling_mamba.py

-            input_ids = input_ids[:, -1].unsqueeze(-1)
+        if use_cache:
+            # `cache_position` should have been initialized in `generate`
+            assert cache_position is not None


let's raise an error rather than using asserts

…ompile_for_mamba

ArthurZucker

Looks good to me! cc @gante if you can have a look for the generate changes!

ArthurZucker · 2024-07-09T19:29:59Z

src/transformers/generation/utils.py

@@ -1751,7 +1758,8 @@ def generate(
        )

        use_dynamic_cache_by_default = False
-        if generation_config.cache_implementation is not None and model_kwargs.get("past_key_values") is not None:
+        cache_name = getattr(self, "cache_name", "past_key_values")


might be better to set this as a class attribute, all that inherit from Cache will have "path_key_values" and mamba will get "cache_params" WDYT?

currently MambaCache is not inherited from Cache because of APIs of Cache are only suitable for transformer models with kv states, so you mean make cache_name a class attribute of Cache and MambaCache with values being past_key_values and cache_params respectively?

cache_name = "cache_params" for mamba cache class, and cache_name = "past_key_values" for Cache classes !

I also prefer a more verbose version for now. I spent some time looking for the cache_name variable in this review, which is not a good indicator of readability :D

e.g.

if "mamba" in self.__class__.__name__.lower(): cache_var_name = "cache_params" else: cache_var_name = "past_key_values"

Ok, I guess it's a way to see if we are using mamba-related models, and there is an issue with associating cache_name with Cache, we need to know which cache we are creating in order to know the cache name, which brings a circular issue when we are trying to check if users are passing both cache_implementation and a cache instance, let's go with it for now.

src/transformers/generation/utils.py

ArthurZucker · 2024-07-09T19:32:07Z

src/transformers/models/mamba/modeling_mamba.py

+                # we initialize the `cache_position` to full size of `conv_states` at prefill stage
+                # considering padding will be applied when input length is shorter, and truncation
+                # will be applied when it is longer, so it will be equivalent to always have it match
+                # the length of `cache_params.conv_states`, which is `config.conv_kernel`
+                cache_position = torch.arange(0, self.config.conv_kernel, device=input_ids.device)


I think there might be a more compile friendly way to do this, but that will be a todo. https://docs.google.com/document/d/1y5CRfMLdwEoF1nTk9q8qEu1mgMUuUtvhklPKJ2emLU8/edit#heading=h.ivdr7fmrbeab might have answers, since I do not, LGTM for now!

Yes, it's just a matter of making it use data-independent ops, it could either be a flag or using shape-dependent way to see which stage it is in, I think a bool flag in forward will also do the trick, but we have introduced cache_position in order to address this anyway, another way of thinking this is we are kind of altering the length of hidden states by apply padding(positive or negative) before we update the cache, so we need to make sure the cache_position is aligned with the hidden states after padding rather than before padding

got it thanks!

gante

generate changes look (mostly) good to me 🤗

gante · 2024-07-16T09:24:27Z

src/transformers/generation/utils.py

@@ -1751,7 +1758,8 @@ def generate(
        )

        use_dynamic_cache_by_default = False
-        if generation_config.cache_implementation is not None and model_kwargs.get("past_key_values") is not None:
+        cache_name = getattr(self, "cache_name", "past_key_values")


I also prefer a more verbose version for now. I spent some time looking for the cache_name variable in this review, which is not a good indicator of readability :D

e.g.

if "mamba" in self.__class__.__name__.lower(): cache_var_name = "cache_params" else: cache_var_name = "past_key_values"

…ompile_for_mamba

ArthurZucker · 2024-07-18T08:30:21Z

Looks good!

* modify mamba cache * set up cache * add test * [run-slow] mamba * [run-slow] mamba * address comments * [run-slow] mamba * use_cache_position * [run-slow] mamba * [run-slow] mamba * [run-slow] mamba * [run-slow] mamba * fix * cache in generate * [run-slow] mamba * address comments * [run-slow] mamba * [run-slow] mamba * address comments * [run-slow] mamba * fix * [run-slow] mamba * fix * [run-slow] mamba * fix cache name * [run-slow] mamba

ArthurZucker · 2024-07-22T12:05:00Z

Congrats on the merge! 🔥

* modify mamba cache * set up cache * add test * [run-slow] mamba * [run-slow] mamba * address comments * [run-slow] mamba * use_cache_position * [run-slow] mamba * [run-slow] mamba * [run-slow] mamba * [run-slow] mamba * fix * cache in generate * [run-slow] mamba * address comments * [run-slow] mamba * [run-slow] mamba * address comments * [run-slow] mamba * fix * [run-slow] mamba * fix * [run-slow] mamba * fix cache name * [run-slow] mamba

modify mamba cache

3f9eeb8

zucchini-nlp mentioned this pull request Jun 5, 2024

tracker: generate compatibility with torch.compile #28981

Open

32 tasks

ArthurZucker reviewed Jun 5, 2024

View reviewed changes

zhenglongjiepheonix added 4 commits June 5, 2024 20:54

set up cache

d6413fe

add test

054d7cf

fix conflict

93fa82f

[run-slow] mamba

48e0ff0

zhenglongjiepheonix commented Jun 6, 2024

View reviewed changes

src/transformers/models/mamba/modeling_mamba.py Outdated Show resolved Hide resolved

zhenglongjiepheonix requested a review from ArthurZucker June 6, 2024 18:28

zhenglongjiepheonix added the run-slow label Jun 6, 2024

[run-slow] mamba

1e6f0e9

zhenglongjiepheonix changed the title ~~[WIP] Add torch.compile Support For Mamba~~ Add torch.compile Support For Mamba Jun 7, 2024

Merge remote-tracking branch 'upstream/main' into longjie/add_torch_c…

23d383a

…ompile_for_mamba

ArthurZucker reviewed Jun 7, 2024

View reviewed changes

zhenglongjiepheonix added 3 commits June 10, 2024 01:32

address comments

49ee4cb

Merge remote-tracking branch 'upstream/main' into longjie/add_torch_c…

bc85aa9

…ompile_for_mamba

[run-slow] mamba

247e789

zhenglongjiepheonix requested a review from ArthurZucker June 9, 2024 23:42

ArthurZucker reviewed Jun 10, 2024

View reviewed changes

zhenglongjiepheonix added 4 commits June 10, 2024 22:16

use_cache_position

9e1fb0e

Merge remote-tracking branch 'upstream/main' into longjie/add_torch_c…

8a132ac

…ompile_for_mamba

[run-slow] mamba

100b999

[run-slow] mamba

2dc8986

zhenglongjiepheonix requested a review from ArthurZucker June 11, 2024 15:04

zhenglongjiepheonix added 2 commits June 12, 2024 01:06

Merge remote-tracking branch 'upstream/main' into longjie/add_torch_c…

9732a62

…ompile_for_mamba

[run-slow] mamba

c2c8e5d

ArthurZucker reviewed Jun 26, 2024

View reviewed changes

zhenglongjiepheonix added 3 commits June 27, 2024 19:49

address comments

de9182d

resolve conflict

38441cd

[run-slow] mamba

a254a09

zhenglongjiepheonix requested a review from ArthurZucker June 27, 2024 18:28

zhenglongjiepheonix added 8 commits June 27, 2024 20:36

[run-slow] mamba

ac456ed

Merge remote-tracking branch 'upstream/main' into longjie/add_torch_c…

3e95813

…ompile_for_mamba

address comments

801e8d1

[run-slow] mamba

b8af2a3

fix

b2e8a0b

fix

708d302

[run-slow] mamba

97b1add

fix

2feeeb0

ArthurZucker approved these changes Jul 9, 2024

View reviewed changes

zhenglongjiepheonix added 2 commits July 13, 2024 02:14

fix conflict

b045c38

[run-slow] mamba

fcdc98f

gante approved these changes Jul 16, 2024

View reviewed changes

zhenglongjiepheonix added 4 commits July 16, 2024 19:09

fix cache name

82b0c9b

Merge remote-tracking branch 'upstream/main' into longjie/add_torch_c…

611e0d7

…ompile_for_mamba

Merge remote-tracking branch 'upstream/main' into longjie/add_torch_c…

bc1563e

…ompile_for_mamba

[run-slow] mamba

78c8c1c

zhenglongjiepheonix merged commit c75969e into huggingface:main Jul 18, 2024
23 of 24 checks passed

ArthurZucker mentioned this pull request Aug 3, 2024

mamba generation throughput lower than original due to DecodingCGCache #29699

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add torch.compile Support For Mamba #31247

Add torch.compile Support For Mamba #31247

zhenglongjiepheonix commented Jun 4, 2024 •

edited

Loading

zhenglongjiepheonix commented Jun 4, 2024

HuggingFaceDocBuilderDev commented Jun 4, 2024

ArthurZucker left a comment

ArthurZucker left a comment

ArthurZucker Jun 7, 2024

zhenglongjiepheonix Jun 8, 2024

ArthurZucker Jun 10, 2024

ArthurZucker commented Jun 7, 2024

zhenglongjiepheonix commented Jun 8, 2024 •

edited

Loading

ArthurZucker left a comment

ArthurZucker Jun 10, 2024

ArthurZucker left a comment

ArthurZucker Jun 26, 2024

ArthurZucker Jun 26, 2024

ArthurZucker Jun 26, 2024

ArthurZucker Jun 26, 2024

ArthurZucker Jun 26, 2024

ArthurZucker left a comment

ArthurZucker Jul 9, 2024

zhenglongjiepheonix Jul 13, 2024 •

edited

Loading

ArthurZucker Jul 15, 2024

gante Jul 16, 2024

zhenglongjiepheonix Jul 16, 2024

ArthurZucker Jul 9, 2024

zhenglongjiepheonix Jul 13, 2024 •

edited

Loading

ArthurZucker Jul 15, 2024

gante left a comment

gante Jul 16, 2024

ArthurZucker commented Jul 18, 2024

ArthurZucker commented Jul 22, 2024

		def is_initialized(self, layer_idx):
		return self.is_cache_initialized[layer_idx]

Add torch.compile Support For Mamba #31247

Add torch.compile Support For Mamba #31247

Conversation

zhenglongjiepheonix commented Jun 4, 2024 • edited Loading

zhenglongjiepheonix commented Jun 4, 2024

HuggingFaceDocBuilderDev commented Jun 4, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker commented Jun 7, 2024

zhenglongjiepheonix commented Jun 8, 2024 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhenglongjiepheonix Jul 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhenglongjiepheonix Jul 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gante left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker commented Jul 18, 2024

ArthurZucker commented Jul 22, 2024

zhenglongjiepheonix commented Jun 4, 2024 •

edited

Loading

zhenglongjiepheonix commented Jun 8, 2024 •

edited

Loading

zhenglongjiepheonix Jul 13, 2024 •

edited

Loading

zhenglongjiepheonix Jul 13, 2024 •

edited

Loading