Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multilingual-bert issue? #6

Open
Smu-Tan opened this issue Jan 25, 2022 · 4 comments
Open

multilingual-bert issue? #6

Smu-Tan opened this issue Jan 25, 2022 · 4 comments

Comments

@Smu-Tan
Copy link

Smu-Tan commented Jan 25, 2022

Hi,

I found a weird thing that if using the multilingual-bert e.g: bert-base-multilingual-uncased, it seems like the grad_cache doesn't work. I know it sounds weird, changing different bert models shouldn't affect it, but the thing is I tried normal bert, german bert, and m-bert, only the latter one need very small batch_size (like 4) to successfully run. Other models like german bert runs with batch_size=128 successfully. Do you probably know the reason of this? Btw, great paper and code, extremely helpful! Thanks in advance!

@luyug
Copy link
Owner

luyug commented Jan 26, 2022

Thanks for your interest in my work!

As a sanity check step, can you try training bert-base-multilingual-uncased with grad cache disabled and compare memory usage against some of the other models'. I think we need to first make sure that the regression does not come from mBERT.

@Smu-Tan
Copy link
Author

Smu-Tan commented Feb 14, 2022

Thanks for your interest in my work!

As a sanity check step, can you try training bert-base-multilingual-uncased with grad cache disabled and compare memory usage against some of the other models'. I think we need to first make sure that the regression does not come from mBERT.

Hi, indeed, I tried without grad_cache and it requires smaller batch_size, thus the grad_cache should works for m-bert as well.

@Smu-Tan
Copy link
Author

Smu-Tan commented Feb 14, 2022

Another question: Does the train_dense_retriever support multi-gpu training as well? Coz m-bert requires more memories, I think using multiple-GPU might helps. I tried to use python -m torch.distributed.launch --nproc_per_node=4 train_dense_retriever.py for both with and without grad_cache, and I got the RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn.

@luyug
Copy link
Owner

luyug commented Feb 15, 2022

It was working the last time we check. Maybe take a look at #2 where we had a discussion on parallelism.

If it does not help, please paste the entire error log here. I will take a look. Also, make sure this is not an issue unique to mBERT.

Alternatively, we have been developing another dense retrieval toolkit called Tevatron, where I have built in native multi-GPU/TPU grad cache support. Maybe take a look at that to see it can better align with your need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants