Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-116380: Speed up glob.glob() by removing some system calls #116392

Open
wants to merge 73 commits into
base: main
Choose a base branch
from

Conversation

barneygale
Copy link
Contributor

@barneygale barneygale commented Mar 5, 2024

Speed up glob.glob() and glob.iglob() by reducing the number of system calls made.

This unifies the implementations of globbing in the glob and pathlib modules.

Depends on

Filtered recursive walk

Expanding a recursive ** segment entails walking the entire directory tree, and so any subsequent pattern segments (except special segments) can be evaluated by filtering the expanded paths through a regex. For example, glob.glob("foo/**/*.py", recursive=True) recursively walks foo/ with os.scandir(), and then filters paths through a regex based on "**/*.py, with no further filesystem access needed.

This solves #104269 as a side-effect.

Tracking path existence

We store a flag alongside each path indicating whether the path is guaranteed to exist. As we process the pattern:

  • Certain special pattern segments ("", "." and "..") leave the flag unchanged
  • Literal pattern segments (e.g. foo/bar) set the flag to false
  • Wildcard pattern segments (e.g. */*.py) set the flag to true (because children are found via os.scandir())
  • Recursive pattern segments (e.g. **) leave the flag unchanged for the root path, and set it to true for descendants discovered via os.scandir().

If the flag is false at the end, we call lstat() on each path to filter out missing paths.

Minor speed-ups

We:

  • Exclude paths that don't match a non-terminal non-recursive wildcard pattern prior to calling is_dir().
  • Use a stack rather than recursion to implement recursive wildcards.
  • Pre-compile regular expressions and pre-join literal pattern segments.
  • Convert to/from bytes (a minor use-case) in iglob() rather than supporting bytes throughout. This particularly simplifies the code needed to handle relative bytes paths with dir_fd.
  • Avoid calling os.path.join(); instead we keep paths in a normalized form and append trailing slashes when needed.
  • Avoid calling os.path.normcase(); instead we use case-insensitive regex matching.

Implementation notes

Much of this functionality is already present in pathlib's implementation of globbing. The specific additions we make are:

  1. Support for dir_fd
  2. Support for include_hidden
  3. Support for generating paths relative to root_dir

Results

Speedups via python -m timeit -s "from glob import glob" "glob(pattern, recursive=True, include_hidden=True)" from CPython source directory on Linux:

pattern speedup
Lib/* 1.48x
Lib/*/ 1.82x
Lib/*.py 1.15x
Lib/** 4.98x
Lib/**/ 1.31x
Lib/**/* 1.82x
Lib/**/** 14.9x
Lib/**/*/ 2.25x
Lib/**/*.py 1.81x
Lib/**/__init__.py 1.08x
Lib/**/*/*.py 2.38x
Lib/**/*/__init__.py 1.19x

@barneygale
Copy link
Contributor Author

barneygale commented Mar 5, 2024

Needs a fix for #116377 to land.

Copy link
Member

@gpshead gpshead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

Lib/glob.py Outdated Show resolved Hide resolved
Lib/glob.py Outdated Show resolved Hide resolved
Lib/glob.py Outdated Show resolved Hide resolved
Lib/glob.py Outdated Show resolved Hide resolved
Lib/glob.py Outdated Show resolved Hide resolved
@serhiy-storchaka serhiy-storchaka self-requested a review March 6, 2024 17:12
Lib/glob.py Show resolved Hide resolved
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not hurry to merge. This is an old code. The main advantage of the initial code was its simplicity, but since then it was complicated by adding new features and optimizations. In particularly the use of os.scandir() instead of os.listdir() significantly improved performance. The new implementation should be benchmarked with different test cases: deep and wide threes, files and directories domination.

Lib/glob.py Outdated Show resolved Hide resolved
@barneygale
Copy link
Contributor Author

barneygale commented Mar 7, 2024

Thanks Serhiy! We use os.scandir() if either:

  1. We're expanding a recursive wildcard (we need to distinguish directories in order to recurse)
  2. We're expanding a non-final non-recursive wildcard (we need to select only directories)

If neither of these are true, then we don't need to stat() the children, and so os.listdir() is actually a little faster I think. But I will test this on a few machines to be sure!

edit: to further illustrate what I mean, here's where os.listdir() is used:

non-recursive part recursive part
non-terminal part os.scandir() os.scandir()
terminal part os.listdir() <-- os.scandir()

@barneygale
Copy link
Contributor Author

The new implementation should be benchmarked with different test cases: deep and wide threes, files and directories domination.

I've been looking into this! The randomfiletree project is helpful - it can repeatedly walk a tree and create child files/folders according to a gaussian distribution, which seems to me like a good approximation for an average "shallow and wide" filesystem structure, including tweaking for file or folder distribution.

It's difficult to produce "deep and narrow" trees this way, as the file/folder probability would need to change with the depth (I think?). I've been considering writing a tree generator that works this way, e.g.:

  • At depth==0, generate 100 subdirectories
  • At 0 < depth < 50, generate 1 subdirectory
  • At depth==50, generate 100 files

... but is that overly arbitrary? Is there a better way? Or do I just need to come up with a bunch of test cases along those lines?

@barneygale
Copy link
Contributor Author

A test of 100 nested directories named "deep" from my Linux machine:

pattern speedup
deep/** 3.86x
deep/**/ 4.03x
deep/**/* 4.92x
deep/**/*/ 4.93x

@barneygale barneygale changed the title GH-116380: Make glob.glob() twice as fast GH-116380: Speed up glob.glob() by removing some system calls May 10, 2024
@barneygale
Copy link
Contributor Author

Hey @gpshead and @serhiy-storchaka - would you like to review this? If not, I might make an appeal on discuss.python.org in a couple weeks' time. Thanks for the feedback so far, and no worries at all if you'd rather focus your efforts elsewhere!

@barneygale
Copy link
Contributor Author

Just spotted an issue: previously iglob() returned a generator with a close() method that would os.close() any open file descriptors. I'll get this re-instated.

Lib/glob.py Outdated Show resolved Hide resolved
Lib/glob.py Outdated Show resolved Hide resolved
Lib/glob.py Outdated Show resolved Hide resolved
Lib/glob.py Show resolved Hide resolved
Lib/glob.py Outdated Show resolved Hide resolved
Lib/glob.py Outdated Show resolved Hide resolved
@picnixz picnixz self-requested a review August 28, 2024 12:29
Doc/whatsnew/3.14.rst Outdated Show resolved Hide resolved
Comment on lines +73 to +76
for path in paths:
if path:
yield path
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for path in paths:
if path:
yield path
break
path = next(paths, None)
if path:
yield path

Not sure whether it's faster to use next or if instantiating the loop would be faster.

Lib/glob.py Show resolved Hide resolved
Lib/glob.py Show resolved Hide resolved
Comment on lines +100 to +101
for path in select(dirname, dir_fd, dirname):
yield slicer(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for path in select(dirname, dir_fd, dirname):
yield slicer(path)
yield from map(slicer, select(dirname, dir_fd, dirname))

You just want a generator I think (to be as lazy as possible)

return dirname or basename
return os.path.join(dirname, basename)
def _relative_glob(select, dirname, dir_fd=None):
"""Globs using a *select* function from the given dirname. The dirname
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe explain how dir_fd would be used if specified.

prefix is removed from results.
"""
dirname = _StringGlobber.add_slash(dirname)
slicer = operator.itemgetter(slice(len(dirname), None))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it more efficient to use this or a plain yield path[dirname_length:]? (with this approach, we can't use a map anymore).

Comment on lines +50 to +51
for path in _iglob(pathname, root_dir, dir_fd, recursive, include_hidden):
yield os.fsencode(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for path in _iglob(pathname, root_dir, dir_fd, recursive, include_hidden):
yield os.fsencode(path)
paths = _iglob(pathname, root_dir, dir_fd, recursive, include_hidden)
yield from map(os.fsencode, paths)

Lib/glob.py Outdated Show resolved Hide resolved
@barneygale
Copy link
Contributor Author

barneygale commented Aug 28, 2024

@picnixz to address some of your comments on using map() rather than looping and yield: I did this so that calling close() on the iglob(dir_fd=blah) generator causes os.close() to be called on all open file descriptors, which seems to work with a stack of for loops but not map(). But I didn't add a test case - I'll do that now :)

@picnixz
Copy link
Contributor

picnixz commented Aug 28, 2024

That's... an interesting functionality I wasn't aware of :) If someone could explain to me the reason I'd be happy. Anyway, let's keep your loops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting core review performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants