Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent pathlib recursive globbing when using Windows 10 case-sensitive filepaths #94537

Closed
calebthomas259 opened this issue Jul 3, 2022 · 7 comments
Labels
OS-windows topic-pathlib type-bug An unexpected behavior, bug, or error

Comments

@calebthomas259
Copy link

Bug report

Windows now supports marking specific directories on an NTFS filesystem as case sensitive (link). Unfortunately, there seems to be inconsistent behaviour when detecting case-sensitive filenames using pathlib's recursive globbing.

Minimal example

Run these commands in PowerShell:

# create empty directory
mkdir temp1

# mark directory as case-sensitive (this command requires admin privileges)
fsutil.exe file setCaseSensitiveInfo temp1 enable

# make some test files
New-Item temp1/a.txt
New-Item temp1/A.txt

# start interactive python
python

Run these commands in interactive Python:

from pathlib import Path

# define directory
temp1 = Path("temp1")

# count number of files (method 1, standard glob)
len(list(temp1.glob("*.txt")))  # prints 2

# count number of files (method 2, recursive glob)
len(list(temp1.glob("**/*.txt")))  # prints 1

To clean up, return to PowerShell then type:

rm -r temp1

Actual behaviour
Method 1 detects both files, whereas method 2 only detects one file.

Expected behaviour
Both methods should detect the same number of files.

Your environment

  • CPython version tested on: 3.9.5 (installed via miniconda). Full version string is Python 3.9.5 (default, May 18 2021, 14:42:02) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32

  • Operating system and architecture: Windows 10, Version 21H1 (OS Build 19043.1766), x86_64 architecture

  • Filesystem: NTFS (set up on solid state drive). Some relevant details (taken from fsutil fsinfo volumeInfo) are:

Max Component Length : 255
File System Name : NTFS
Is ReadWrite
Not Thinly-Provisioned
Supports Case-sensitive filenames
Preserves Case of filenames
Supports Unicode in filenames
Preserves & Enforces ACL's
Supports file-based Compression
Supports Disk Quotas
Supports Sparse files
Supports Reparse Points
Returns Handle Close Result Information
Supports POSIX-style Unlink and Rename
Supports Object Identifiers
Supports Encrypted File System
Supports Named Streams
Supports Transactions
Supports Hard Links
Supports Extended Attributes
Supports Open By FileID
Supports USN Journal

Some more relevant details (taken from fsutil fsinfo ntfsInfo) are:

NTFS Version: 3.1
LFS Version: 2.0
@calebthomas259 calebthomas259 added the type-bug An unexpected behavior, bug, or error label Jul 3, 2022
@eryksun
Copy link
Contributor

eryksun commented Jul 11, 2022

Path objects compare and hash using the case-folded parts, such as when they're added to a set. I suppose a constructor parameter could override the platform's default case sensitivity.

@eryksun
Copy link
Contributor

eryksun commented Jul 12, 2022

@barneygale, what do you think about an inheritable override for case sensitivity, e.g. Path("temp1", case_sensitive=True)? The Path instances created by temp1.glob("**/*.txt") would inherit this.

@barneygale
Copy link
Contributor

Does ntpath.normcase('temp1/A.txt') do the right thing in this instance? I'd love to rely on that if possible.

@eryksun
Copy link
Contributor

eryksun commented Jul 12, 2022

Does ntpath.normcase('temp1/A.txt') do the right thing in this instance? I'd love to rely on that if possible.

ntpath.normcase() calls LCMapStringEx() to get a locale-invariant lowercase string. It doesn't query the system and filesystem to get the canonical names of the device and file path. That's possible for an open file or directory via GetFinalPathNameByHandleW(). In Windows 10 (NTDDI_WIN10_19H1 and above), it's possible to just get the normalized path in the filesystem, without the canonical device name, via GetFileInformationByHandleEx(): FileNormalizedNameInfo (24).

For pattern matching, one needs to know whether or not the directory is case insensitive, e.g. should "a.txt" match "A.TXT"? In Windows 10 (NTDDI_WIN10_19H1 and above), this can be queried for an open directory via GetFileInformationByHandleEx(): FileCaseSensitiveInfo (23). Also, the filesystem, by way of WinAPI FindFirstFile[Ex]W(), honors the case sensitivity of a directory for DOS-style (weird) pattern matching, but Python only uses the find-file API to list all files in a directory (i.e. "*.*" or just "*").

For now, I'm suggesting to only implement a manual parameter that overrides the platform default case sensitivity, which will be inherited for child paths returned as Path objects. More could be implemented later to query the filesystem.

@barneygale
Copy link
Contributor

👍 gotcha.

At the moment it's finnickity to pass a case_sensitive flag around to new Path objects generated by glob(), parents, etc, because pathlib uses a couple of private constructors - _from_parts() and _from_parsed_parts() - in addition to __new__(). As soon as #31691 lands I'm planning to merge those constructors and route generation of new Path via a new makepath() method. That method can be used to pass state/flags/etc around to new objects, and can obviously be customized in subclassses too.

@barneygale
Copy link
Contributor

FYI, glob() and rglob() now take a case_sensitive argument. See #81079.

@calebthomas259 would that satisfy your use case (once Python 3.12 is released)?

@calebthomas259
Copy link
Author

Yes! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OS-windows topic-pathlib type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

4 participants