Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on Comparison between Mamba and S4 #486

Open
MstarLioning opened this issue Jul 23, 2024 · 2 comments
Open

Question on Comparison between Mamba and S4 #486

MstarLioning opened this issue Jul 23, 2024 · 2 comments

Comments

@MstarLioning
Copy link

Hello. I am currently reading Mamba-1 and there is one point I don't quite understand. In the comparison with the S4 paper, it is mentioned that in order to make Mamba dependent on the input, we change matrix B from Hidden state size * Size of input vector to batch size * Hidden state size * Sequence Length. However, similar to Transformers, aren't the Wq, Wk, and Wv in Transformers also of the size Hidden state size * Size of input vector? So why does incorporating sequence length and batch size resolve the content-aware issue? I hope to receive your reply and am deeply grateful!

@albertfgu
Copy link
Contributor

$B$ in Mamba is analogous to $K$ in attention (not $W_k$)

@n1o
Copy link

n1o commented Aug 12, 2024

@MstarLioning

Well my take on S4 ist that you enforce structure on you parameters in SSM. Than you do convolution that is implicitly parametrized by the SSM. The downside of this approach is that your parameters are not functions of your input. In Mamba you actually perform projections of the input to get your parameters A,B,C and X. And with Mamba-2 you do the same but the model is a bit more contrained.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants