Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High GPU memory consumption #7

Open
saareliad opened this issue Aug 8, 2019 · 4 comments
Open

High GPU memory consumption #7

saareliad opened this issue Aug 8, 2019 · 4 comments

Comments

@saareliad
Copy link

saareliad commented Aug 8, 2019

Hi,
I tried to integrate the TTLinear layer into TransformerXL,
however I found that it consumes much more memory than usual.
Couldn't even train it.

Model before compression was 151M params, after compression was 124M params.
It even consumed much more memory in inference - 3021MB for the compressed model versus 2132MB for the normal model.

I also tried to write the "forward" method more efficiently (e.g with bmm) , it didn't help too.

Did you experience such problems? do you know anyway around this?
Thanks,

@khakhulin
Copy link
Owner

Hi!

Could you please share information about your GPU, batch size. Did you try to compress only part of the layers and off compress mode for attention matrixes (see arguments)? How many GPU did you use?

Also, I carried out some experiments with transformer-XL in January. At the end of August, I'll try to find my code.

Hint: if you want to increase the compressed ratio, you can also try to compressed embedding (or protection matrixes ) in the same way.

@saareliad
Copy link
Author

  • compressed all FF layers (only them).

  • The compression mode was Tensor Train (that's what you meant?)

  • Tested on 4 GPUs (Titan Xp),

  • For training: batch size=64 (4 on GPU0, 20 on other gpus), seq_len=150. Also tried to reduce the batch size, although its not desirable.

BTW I tried to profile the memory, most of it allocated when doing matmuls.

@khakhulin
Copy link
Owner

It's strange, I'll see on the implementation of tt-layer one more time.

Unfortunately, I didn't find the code for LM, but maybe you will be interested in work from NeurIps 2020 which have the same idea for LM. A Tensorized Transformer for Language Modeling

You could ask the authors to share their code.

@saareliad
Copy link
Author

I partially solved it - (reconstruct the matrix the do the operation) solution suggested in t3nsor repo. I implemented it to my custom code and it worked too.

Btw
I read that paper you mentioned. (And also emailed the author who didn't answer, and I think he had a good reason) I found a lot of problems with it. The code they published is very bad to say the least. One of the proofs is completely incorrect.

Check the redit thread on it
https://www.reddit.com/r/MachineLearning/comments/c4zxc6/r_a_tensorized_transformer_for_language_modeling/
We're I publicly shared some of my critsizem.
(Which some of it also comes from my team)

Without working code to prove - I don't believe anything they say.

I even implemented this myself based on the paper and found more nonsense they did...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants