Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL passed to fs.listdir instead of a path #15379

Closed
leoleoasd opened this issue Oct 28, 2022 · 4 comments · Fixed by #15413
Closed

URL passed to fs.listdir instead of a path #15379

leoleoasd opened this issue Oct 28, 2022 · 4 comments · Fixed by #15413
Labels
bug Something isn't working checkpointing Related to checkpointing
Milestone

Comments

@leoleoasd
Copy link
Contributor

leoleoasd commented Oct 28, 2022

Bug description

https://github.com/Lightning-AI/lightning/blob/master/src/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L582
This line is called fs.listdir(dir_path) where dir_path is an URL instead of a path.

pyarrow will complain about it:

pyarrow.lib.ArrowInvalid: FileSelector.base_dir must not be a URI, got: hdfs:///somewhere/MNIST

How to reproduce the bug

use a HDFS path in trainer's default_root_dir

Error messages and logs

  File "main.py", line 63, in main
    trainer.fit(mnist_model, train_loader, valid_loader)
  File "/home/jobuser/build/yuxlu-test/environments/satellites/python/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit
    self._call_and_handle_interrupt(
  File "/home/jobuser/build/yuxlu-test/environments/satellites/python/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/jobuser/build/yuxlu-test/environments/satellites/python/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/jobuser/build/yuxlu-test/environments/satellites/python/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1110, in _run
    self._restore_modules_and_callbacks(ckpt_path)
  File "/home/jobuser/build/yuxlu-test/environments/satellites/python/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1063, in _restore_modules_and_callbacks
    self._checkpoint_connector.resume_start(checkpoint_path)
  File "/home/jobuser/build/yuxlu-test/environments/satellites/python/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 78, in resume_start
    self.resume_checkpoint_path = self._hpc_resume_path or checkpoint_path
  File "/home/jobuser/build/yuxlu-test/environments/satellites/python/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 66, in _hpc_resume_path
    max_version = self.__max_ckpt_version_in_folder(dir_path_hpc, "hpc_ckpt_")
  File "/home/jobuser/build/yuxlu-test/environments/satellites/python/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 506, in __max_ckpt_version_in_folder
    files = [os.path.basename(f["name"]) for f in fs.listdir(dir_path)]
  File "/home/jobuser/build/yuxlu-test/environments/satellites/python/lib/python3.10/site-packages/fsspec/spec.py", line 1301, in listdir
    return self.ls(path, detail=detail, **kwargs)
  File "/home/jobuser/build/yuxlu-test/environments/satellites/python/lib/python3.10/site-packages/fsspec/implementations/arrow.py", line 66, in ls
    for entry in self.fs.get_file_info(FileSelector(path))
  File "pyarrow/_fs.pyx", line 433, in pyarrow._fs.FileSystem.get_file_info
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: FileSelector.base_dir must not be a URI, got: hdfs:///somewhere/MNIST

Environment

fsspec: 2022.10.0
pyarrow: 8.0.0
pytorch_lightning: 1.7.7

More info

No response

cc @awaelchli @ananthsub @ninginthecloud @rohitgr7 @otaj

@leoleoasd leoleoasd added the needs triage Waiting to be triaged by maintainers label Oct 28, 2022
@awaelchli awaelchli added bug Something isn't working checkpointing Related to checkpointing and removed needs triage Waiting to be triaged by maintainers labels Oct 28, 2022
@leoleoasd
Copy link
Contributor Author

changing to this fixes it:

        # check directory existence
        # fs = get_filesystem(dir_path)
        import fsspec.core
        fs, uri = fsspec.core.url_to_fs(dir_path)
        if not fs.exists(uri):
            return None

        # check corresponding file existence
        files = [os.path.basename(f["name"]) for f in fs.listdir(uri)]
        files = [x for x in files if name_key in x]
        if len(files) == 0:
            return None

I think using url_to_fs and passing paths instead of URL to FS operations is the standard behavior, isn't it? Lightning's get_filesystem is just a wrapper around url_to_fs but it discards the returned path. Maybe we need to change all usage of get_filesystem, to make sure a path is passed to FS instead of an URL?

@awaelchli
Copy link
Contributor

@leoleoasd Thanks for reporting this.
I don't fully understand what the difference is between the uri and dir_path. I feel like fs.exists(dir_path) should work as expected like it does for S3 and other filesystems.

@leoleoasd
Copy link
Contributor Author

Yes, fs.exists(path) works, but I don't think it is the standard usage. (Maybe it won't work in some else implementations)
https://filesystem-spec.readthedocs.io/en/latest/usage.html#use-a-file-system
All examples in fsspec document call fs.xxxx with a path instead of a URI.

@leoleoasd
Copy link
Contributor Author

I think, to follow the examples, we should update all usages of fs.xxx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working checkpointing Related to checkpointing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants