Checkpoint saving isn't atomic #688

fgerzer · 2020-01-15T15:42:14Z

🐛 Bug

Saving checkpoints happens non-atomically. In some cases, this causes an incomplete write of a checkpoint (for example when receiving a SIGKILL during writing), causing any subsequent loading to fail with
RuntimeError: unexpected EOF, expected 8 more bytes. The file might be corrupted.

To Reproduce

This is difficult to reproduce, since it relies on timing outside of code. For me, it happens with fast-running models that run at ~1-4 seconds per epoch.

Expected behavior

Checkpointing should be resistant to such issues, and instead simply continue as-is.

The text was updated successfully, but these errors were encountered:

fgerzer added the bug Something isn't working label Jan 15, 2020

fgerzer mentioned this issue Jan 15, 2020

Added atomic checkpoint creation #689

Merged

4 tasks

williamFalcon closed this as completed in #689 Jan 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoint saving isn't atomic #688

Checkpoint saving isn't atomic #688

fgerzer commented Jan 15, 2020

Checkpoint saving isn't atomic #688

Checkpoint saving isn't atomic #688

Comments

fgerzer commented Jan 15, 2020

🐛 Bug

To Reproduce

Expected behavior