Add AWS S3 i/o #2175

Laksh1997 · 2020-06-13T15:20:30Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?
If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

Allows user to save and load checkpoints to S3 paths (e.g "s3://my-bucket/my-folder/my-checkpoint.ckpt")

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

justusschock · 2020-06-13T15:54:30Z

pytorch_lightning/utilities/io.py

+
+
+def load_s3_checkpoint(checkpoint_path, map_location, **pickle_load_args):
+    from torch.serialization import _legacy_load


why is _legacy_load required here? I think we should aim for torch.load here

We could use boto3 to download an S3 file, then use torch.load. Thoughts?

justusschock · 2020-06-13T15:54:51Z

pytorch_lightning/utilities/io.py

+
+
+def save_s3_checkpoint(checkpoint, checkpoint_path):
+    from torch.serialization import _legacy_save


same as above. IMO we should aim for torch.save

torch.save can only save locally. We could use boto3 to upload locally saved files. Thoughts?

williamFalcon · 2020-06-13T16:47:51Z

@Borda this has failing tests

Laksh1997 · 2020-06-13T16:57:25Z

Thoughts on _legacy_save and direct write to S3 vs normal torch.save to local, then boto3 upload to S3?

Borda · 2020-06-13T17:37:54Z

@Borda this has failing tests

I am fixing master here #2176

Thoughts on _legacy_save and direct write to S3 vs normal torch.save to local, then boto3 upload to S3?

depends how long the legacy func will be available in PyTorch
@Laksh1997 mind add a test for these new functions?

Laksh1997 · 2020-06-13T17:51:30Z

I'm thinking about how to add tests.

One is to just save and load a file actually to an open-access S3 bucket.

However, another way is to use the moto framework. This would require using boto3 for upload however.

Thinking what the best solution is, any help appreciated.

awaelchli · 2020-06-13T17:55:06Z

Thoughts on _legacy_save and direct write to S3 vs normal torch.save to local, then boto3 upload to S3?

anything that has "legacy" in it's name is a big alarm sign for me hehe :)
aren't we supposed to use TorchServe to upload stuff to the amazon servers?

Laksh1997 · 2020-06-13T18:09:18Z

@awaelchli I agree that _legacy_save raises alarms.

Shall we do this then?

User passes in checkpointing filepath, eg s3://my-bucket/my-key/checkpoint.ckpt
Initially, we save it locally normally with torch.save with filepath my-key/checkpoint.ckpt
We then upload the file to the s3 path as specified using boto3

The only complication with the above is the following: If we have model checkpointing configured to only save the best 5 models, then we'll end up saving many more to S3 because it will upload every time a new model entry occurs. So to get around this, we would also have to accordingly delete files on S3.

Thoughts?

Laksh1997 · 2020-06-13T18:59:57Z

Thoughts on deleting the cloud checkpoint if it's not in the top K models anymore?

We would need to edit _del_model in model_checkpoint.py on line 158

Laksh1997 · 2020-06-13T19:10:33Z

Alright moving everything over to boto3. Will make tests easier that way with moto.

f4hy · 2020-06-13T21:15:46Z

This seems related to #2164 which would support s3/gcs/hdfs.

Laksh1997 · 2020-06-13T21:41:56Z

@f4hy Yeah it seems it. What shall we do?

f4hy · 2020-06-13T21:52:19Z

@f4hy Yeah it seems it. What shall we do?

So I need support for writing all logs and such to a remote path, not just the checkpoints. And I need support for paths other than just s3.

So that's why I set up #2164 to use gfile. This problem has been solved elsewhere, so we should just use some other lib. Gfile is great, but it requires pip install tensorflow to get full support. Tensorboard has a pruned down implementation that uses the full tf gfile if present otherwise gives just local and s3 support so hopefully that is acceptable and requires no extra dependency.

I think its a mistake to do this in a way that just implements s3. Then we would need to implement the same thing again for gcs or hdfs or others. If the gfile solution is not acceptable I hope we can find some other lib that would do this for us.

Laksh1997 · 2020-06-13T22:22:26Z

Hmmm, I agree @f4hy. Let's think of a unified way to do this long term.

I feel however that because S3 is AWS, the cleanest solution to get S3 support would be to use boto3. Also, one gets a nice testing framework as well (moto)

@Borda @williamFalcon I've just redone the PR to now work using boto3. Will be adding moto tests soon. Any comments please let me know.

yukw777 · 2020-06-15T05:09:09Z

I mentioned this in #1532, but https://github.com/RaRe-Technologies/smart_open is really nice. It provides a file-like interface for all kinds of transports. I highly recommend using it for this PR!

mergify · 2020-06-16T10:35:38Z

This pull request is now in conflict... :(

f4hy · 2020-06-16T18:46:43Z

I mentioned this in #1532, but https://github.com/RaRe-Technologies/smart_open is really nice. It provides a file-like interface for all kinds of transports. I highly recommend using it for this PR!

How is this better than the same gfile interface provided by tensorboard. Again #2164 but we dont need to add a new dependency.

yukw777 · 2020-06-16T22:06:16Z

@f4hy smart_open supports a lot more protocols than the gfile interface with a clean, unified interface with nifty extra features (not that we necessarily need these new features), so we don't need to keep writing new code to support other protocols than s3, gcs, hdfs. I personally think the benefit of introducing this new dependency is worth the effort (i'm currently using it for work, and it's been working quite nicely for us.), but ultimately, if the core team decides to go with another route, I'm fine with that too.

Borda · 2020-06-26T22:08:34Z

pytorch_lightning/callbacks/model_checkpoint.py

@@ -95,7 +96,7 @@ class ModelCheckpoint(Callback):

    def __init__(self, filepath: Optional[str] = None, monitor: str = 'val_loss', verbose: bool = False,
                 save_last: bool = False, save_top_k: int = 1, save_weights_only: bool = False,
-                 mode: str = 'auto', period: int = 1, prefix: str = ''):
+                 mode: str = 'auto', period: int = 1, prefix: str = '', remove_non_top_k_s3_files: bool = True):


here I would keep it synced so I would drop remove_non_top_k_s3_files

mergify · 2020-06-27T01:39:49Z

This pull request is now in conflict... :(

Borda · 2020-08-09T20:30:08Z

@Laksh1997 can we finish this one? :]

f4hy · 2020-08-09T20:44:44Z

I believe #2164 resolved this as well. After the first in #2894 I am able to train and have all outputs go to AWS s3.

Borda · 2020-08-09T20:52:44Z

cool, so let's close this one for now and reopen if needed :]

mergify bot requested a review from a team June 13, 2020 15:21

justusschock reviewed Jun 13, 2020

View reviewed changes

mergify bot requested a review from a team June 13, 2020 15:55

Borda added the feature Is an improvement or enhancement label Jun 13, 2020

Borda requested review from Borda, luiscape and williamFalcon June 13, 2020 16:31

Borda changed the title ~~(WIP) Add AWS S3 i/o~~ [blocked by #2176] Add AWS S3 i/o Jun 13, 2020

Borda changed the title ~~[blocked by #2176] Add AWS S3 i/o~~ Add AWS S3 i/o Jun 14, 2020

Borda force-pushed the add_s3_io branch from 43d40ed to d82a147 Compare June 14, 2020 10:55

Laksh Aithani and others added 4 commits June 27, 2020 00:04

Initial commit

9292a34

flake8

60848c4

add s3fs

166489b

Implement boto3 S3 I/O

d3bbdd5

Laksh Aithani added 2 commits June 27, 2020 00:05

Made start on moto testing

d27d7be

Remove botocore dep

9598c74

Borda force-pushed the add_s3_io branch from d82a147 to 9598c74 Compare June 26, 2020 22:05

Borda reviewed Jun 26, 2020

View reviewed changes

mergify bot requested a review from a team June 26, 2020 22:09

Borda added this to the 0.8.x milestone Jun 26, 2020

Borda mentioned this pull request Jun 26, 2020

[Ready for review] use gfile to support remote directories #2164

Merged

5 tasks

Borda changed the title ~~Add AWS S3 i/o~~ [blocked by #2164] Add AWS S3 i/o Jun 27, 2020

edenlightning mentioned this pull request Aug 4, 2020

Generalize i/o to other storage systems #2825

Closed

Borda modified the milestones: 0.8.x, 0.9.0 Aug 6, 2020

awaelchli modified the milestones: 0.9.0, 0.9.x Aug 8, 2020

Borda changed the title ~~[blocked by #2164] Add AWS S3 i/o~~ Add AWS S3 i/o Aug 9, 2020

Borda closed this Aug 9, 2020

Borda modified the milestones: 0.9.x, 0.9.0 Aug 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AWS S3 i/o #2175

Add AWS S3 i/o #2175

Laksh1997 commented Jun 13, 2020 •

edited

Loading

justusschock Jun 13, 2020

Laksh1997 Jun 13, 2020

justusschock Jun 13, 2020

Laksh1997 Jun 13, 2020

williamFalcon commented Jun 13, 2020

Laksh1997 commented Jun 13, 2020

Borda commented Jun 13, 2020

Laksh1997 commented Jun 13, 2020

awaelchli commented Jun 13, 2020

Laksh1997 commented Jun 13, 2020

Laksh1997 commented Jun 13, 2020

Laksh1997 commented Jun 13, 2020

f4hy commented Jun 13, 2020

Laksh1997 commented Jun 13, 2020

f4hy commented Jun 13, 2020

Laksh1997 commented Jun 13, 2020

yukw777 commented Jun 15, 2020

mergify bot commented Jun 16, 2020

f4hy commented Jun 16, 2020

yukw777 commented Jun 16, 2020

Borda Jun 26, 2020

mergify bot commented Jun 27, 2020

Borda commented Aug 9, 2020

f4hy commented Aug 9, 2020

Borda commented Aug 9, 2020



		def load_s3_checkpoint(checkpoint_path, map_location, **pickle_load_args):
		from torch.serialization import _legacy_load



		def save_s3_checkpoint(checkpoint, checkpoint_path):
		from torch.serialization import _legacy_save

Add AWS S3 i/o #2175

Add AWS S3 i/o #2175

Conversation

Laksh1997 commented Jun 13, 2020 • edited Loading

Before submitting

What does this PR do?

PR review

Did you have fun?

justusschock Jun 13, 2020

Choose a reason for hiding this comment

Laksh1997 Jun 13, 2020

Choose a reason for hiding this comment

justusschock Jun 13, 2020

Choose a reason for hiding this comment

Laksh1997 Jun 13, 2020

Choose a reason for hiding this comment

williamFalcon commented Jun 13, 2020

Laksh1997 commented Jun 13, 2020

Borda commented Jun 13, 2020

Laksh1997 commented Jun 13, 2020

awaelchli commented Jun 13, 2020

Laksh1997 commented Jun 13, 2020

Laksh1997 commented Jun 13, 2020

Laksh1997 commented Jun 13, 2020

f4hy commented Jun 13, 2020

Laksh1997 commented Jun 13, 2020

f4hy commented Jun 13, 2020

Laksh1997 commented Jun 13, 2020

yukw777 commented Jun 15, 2020

mergify bot commented Jun 16, 2020

f4hy commented Jun 16, 2020

yukw777 commented Jun 16, 2020

Borda Jun 26, 2020

Choose a reason for hiding this comment

mergify bot commented Jun 27, 2020

Borda commented Aug 9, 2020

f4hy commented Aug 9, 2020

Borda commented Aug 9, 2020

Laksh1997 commented Jun 13, 2020 •

edited

Loading