-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid copying files with same name prefix when copying folder #576
Conversation
|
||
ind = min(indstar, indques, indbrace) | ||
prefix = path[:ind].split("/")[-1] | ||
return await super()._glob(path, prefix=prefix, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we apply the fix in the _find
method below then this glob prefix construction no longer work for the edge case highlighted here.
This is because the the glob subdir*
will create a subdir
prefix and will be handled like a folder in the _find
method (and add a trailing slash and miss the subdir.txt
file).
Also, not sure to understand the initial benefit of this _glob
method override. Was this just for caching?
Anyway, if another fix exists maybe this could be left as-is.
assert all(d.startswith(TEST_BUCKET + "/nested") for d in gcs.dircache) | ||
# gcs.dircache.clear() | ||
# gcs.glob(TEST_BUCKET + "/nested**1") | ||
# assert all(d.startswith(TEST_BUCKET + "/nested") for d in gcs.dircache) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the _glob
method override, _find
is now called for the parent folder (TEST_BUCKET
) so other folder like test
are now in the dircache.
@ianthomas23 , if you have time |
gcsfs/core.py
Outdated
@@ -1410,6 +1398,9 @@ async def _find( | |||
else: | |||
_prefix = key | |||
|
|||
if _prefix != "" and await self._isdir(f"{bucket}/{_prefix}"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no way to avoid isdir here? This doubles the number of calls for a simple list - and isdir surely is doing a list op anyway!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, it's not useful to double the calls there. That's why I initially indicated that a better solution should probably exist ;)
I updated the code with this commit to try out another solution inspired by the _find
method in s3fs which first fetches the path as if it is a directory then calls the _info
method on the path to check if it's a file and returns it accordingly. In gcsfs, we can't call _info
because some tests are expecting versioned files (_info
does not support versioned files) or a prefixed search. So I simply moved the initial implementation into a block executed if the first search (when we treat path as a directory) does not return anything. It still doubles the calls when we are searching a single file though.
Let me know if that works for you. This commit is tested against this PR to fix this test.
ebd17cb
to
d1ff25f
Compare
Will be releasing tomorrow :) |
As described here, when copying a folder, the files with the same name prefix are also copied.
This PR fixes this behavior. I guess there's probably a better way to fix this though.
TODO:
main
once this PR is merged so the derived tests are executed in this repo.