You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Saving checkpoints happens non-atomically. In some cases, this causes an incomplete write of a checkpoint (for example when receiving a SIGKILL during writing), causing any subsequent loading to fail with RuntimeError: unexpected EOF, expected 8 more bytes. The file might be corrupted.
To Reproduce
This is difficult to reproduce, since it relies on timing outside of code. For me, it happens with fast-running models that run at ~1-4 seconds per epoch.
Expected behavior
Checkpointing should be resistant to such issues, and instead simply continue as-is.
The text was updated successfully, but these errors were encountered:
🐛 Bug
Saving checkpoints happens non-atomically. In some cases, this causes an incomplete write of a checkpoint (for example when receiving a SIGKILL during writing), causing any subsequent loading to fail with
RuntimeError: unexpected EOF, expected 8 more bytes. The file might be corrupted.
To Reproduce
This is difficult to reproduce, since it relies on timing outside of code. For me, it happens with fast-running models that run at ~1-4 seconds per epoch.
Expected behavior
Checkpointing should be resistant to such issues, and instead simply continue as-is.
The text was updated successfully, but these errors were encountered: