Question on Comparison between Mamba and S4 #486

MstarLioning · 2024-07-23T02:59:57Z

Hello. I am currently reading Mamba-1 and there is one point I don't quite understand. In the comparison with the S4 paper, it is mentioned that in order to make Mamba dependent on the input, we change matrix B from Hidden state size * Size of input vector to batch size * Hidden state size * Sequence Length. However, similar to Transformers, aren't the Wq, Wk, and Wv in Transformers also of the size Hidden state size * Size of input vector? So why does incorporating sequence length and batch size resolve the content-aware issue? I hope to receive your reply and am deeply grateful!

albertfgu · 2024-07-23T13:26:47Z

$B$ in Mamba is analogous to $K$ in attention (not $W_k$)

n1o · 2024-08-12T13:40:17Z

@MstarLioning

Well my take on S4 ist that you enforce structure on you parameters in SSM. Than you do convolution that is implicitly parametrized by the SSM. The downside of this approach is that your parameters are not functions of your input. In Mamba you actually perform projections of the input to get your parameters A,B,C and X. And with Mamba-2 you do the same but the model is a bit more contrained.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on Comparison between Mamba and S4 #486

Question on Comparison between Mamba and S4 #486

MstarLioning commented Jul 23, 2024

albertfgu commented Jul 23, 2024

n1o commented Aug 12, 2024 •

edited

Loading

Question on Comparison between Mamba and S4 #486

Question on Comparison between Mamba and S4 #486

Comments

MstarLioning commented Jul 23, 2024

albertfgu commented Jul 23, 2024

n1o commented Aug 12, 2024 • edited Loading

n1o commented Aug 12, 2024 •

edited

Loading