Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid unnecessary copy saving checkpoint to fsspec files #17590

Closed
wants to merge 4 commits into from

Conversation

byronyi
Copy link

@byronyi byronyi commented May 8, 2023

In particular, saving large checkpoints >2GB will hit "Requested array size exceeds VM limit" error when writing to HDFS through pyarrow:

  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/strategies/strategy.py", line 466, in save_checkpoint
    self.checkpoint_io.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
  File "/usr/local/lib/python3.9/dist-packages/lightning_fabric/plugins/io/torch_io.py", line 54, in save_checkpoint
    _atomic_save(checkpoint, path)
  File "/usr/local/lib/python3.9/dist-packages/lightning_fabric/utilities/cloud_io.py", line 72, in _atomic_save
    f.write(bytesbuffer.getvalue())
  File "pyarrow/io.pxi", line 359, in pyarrow.lib.NativeFile.write
  File "pyarrow/error.pxi", line 113, in pyarrow.lib.check_status
OSError: [Errno 12] HDFS Write failed. Detail: [errno 12] Cannot allocate memory

In particular, saving large checkpoints >2GB will hit "Requested array size exceeds VM limit" error when writing to HDFS through pyarrow:

```
  File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/strategies/strategy.py", line 466, in save_checkpoint
    self.checkpoint_io.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
  File "/usr/local/lib/python3.9/dist-packages/lightning_fabric/plugins/io/torch_io.py", line 54, in save_checkpoint
    _atomic_save(checkpoint, path)
  File "/usr/local/lib/python3.9/dist-packages/lightning_fabric/utilities/cloud_io.py", line 72, in _atomic_save
    f.write(bytesbuffer.getvalue())
  File "pyarrow/io.pxi", line 359, in pyarrow.lib.NativeFile.write
  File "pyarrow/error.pxi", line 113, in pyarrow.lib.check_status
OSError: [Errno 12] HDFS Write failed. Detail: [errno 12] Cannot allocate memory
```
@github-actions github-actions bot added the fabric lightning.fabric.Fabric label May 8, 2023
@awaelchli awaelchli added checkpointing Related to checkpointing community This PR is from the community labels May 8, 2023
@stale stale bot added the won't fix This will not be worked on label Jun 18, 2023
@Lightning-AI Lightning-AI deleted a comment from stale bot Jun 18, 2023
@stale stale bot removed won't fix This will not be worked on labels Jun 18, 2023
@carmocca carmocca removed their assignment Jul 4, 2023
@stale
Copy link

stale bot commented Aug 12, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://lightning.ai/docs/pytorch/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Discord. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Aug 12, 2023
@stale
Copy link

stale bot commented Sep 17, 2023

This pull request is going to be closed. Please feel free to reopen it or create a new one based on top of the 'master' branch.

@stale stale bot closed this Sep 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
checkpointing Related to checkpointing community This PR is from the community fabric lightning.fabric.Fabric has conflicts won't fix This will not be worked on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants