Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModelCheckpoint with custom filepath don't support training on multiple nodes #2916

Closed
angshine opened this issue Aug 11, 2020 · 6 comments · Fixed by #2925
Closed

ModelCheckpoint with custom filepath don't support training on multiple nodes #2916

angshine opened this issue Aug 11, 2020 · 6 comments · Fixed by #2925
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task

Comments

@angshine
Copy link

🐛 Bug

When training on multiple nodes using ModelCheckpoint with custom filepath, it will raise FileExistsError caused by the following line of code: model_checkpoint.py#L127.

Maybe a try-except block is needed?

@angshine angshine added bug Something isn't working help wanted Open to be worked on labels Aug 11, 2020
@awaelchli
Copy link
Member

awaelchli commented Aug 11, 2020

No I think we just need to pass in exist_ok=True into makedirs :)

@awaelchli
Copy link
Member

@f4hy your PR added this line. Do you know a good way to fix it?

@f4hy
Copy link
Contributor

f4hy commented Aug 11, 2020

Ah sorry. I think I know what's up. I'll get a patch out this evening. Sorry!

@Borda Borda added the priority: 0 High priority task label Aug 11, 2020
@f4hy
Copy link
Contributor

f4hy commented Aug 12, 2020

@angshine I found a few issues with the model checkpoint path stuff. Not 100% sure I found the particular bug you were seeing but I think this should fix it. Can you give my branch in the above PR a test? Sorry to have introduced this bug for you.

@angshine
Copy link
Author

angshine commented Sep 3, 2020

Sorry for the late reply, but it seems that this bug has not been fully fixed. This line still raises an exception: tensorboard.compat.tensorflow_stub.errors.AlreadyExistsError: Directory already exists when training with DDP. I still need to manually add a try-except to ignore the exception.

@f4hy
Copy link
Contributor

f4hy commented Sep 4, 2020

@angshine That line has been completely replaced now on master. Can you give it another try. I hope #3320 has finally resolved this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants