GH-116380: Speed up `glob.glob()` by removing some system calls #116392

barneygale · 2024-03-05T23:13:48Z

Speed up glob.glob() and glob.iglob() by reducing the number of system calls made.

This unifies the implementations of globbing in the glob and pathlib modules.

Depends on

Filtered recursive walk

Expanding a recursive ** segment entails walking the entire directory tree, and so any subsequent pattern segments (except special segments) can be evaluated by filtering the expanded paths through a regex. For example, glob.glob("foo/**/*.py", recursive=True) recursively walks foo/ with os.scandir(), and then filters paths through a regex based on "**/*.py, with no further filesystem access needed.

This solves #104269 as a side-effect.

Tracking path existence

We store a flag alongside each path indicating whether the path is guaranteed to exist. As we process the pattern:

Certain special pattern segments ("", "." and "..") leave the flag unchanged
Literal pattern segments (e.g. foo/bar) set the flag to false
Wildcard pattern segments (e.g. */*.py) set the flag to true (because children are found via os.scandir())
Recursive pattern segments (e.g. **) leave the flag unchanged for the root path, and set it to true for descendants discovered via os.scandir().

If the flag is false at the end, we call lstat() on each path to filter out missing paths.

Minor speed-ups

We:

Exclude paths that don't match a non-terminal non-recursive wildcard pattern prior to calling is_dir().
Use a stack rather than recursion to implement recursive wildcards.
- Addresses Change shutil.rmtree and os.walk to support very deep hierarchies #89727 for the glob module.
Pre-compile regular expressions and pre-join literal pattern segments.
Convert to/from bytes (a minor use-case) in iglob() rather than supporting bytes throughout. This particularly simplifies the code needed to handle relative bytes paths with dir_fd.
Avoid calling os.path.join(); instead we keep paths in a normalized form and append trailing slashes when needed.
Avoid calling os.path.normcase(); instead we use case-insensitive regex matching.

Implementation notes

Much of this functionality is already present in pathlib's implementation of globbing. The specific additions we make are:

Support for dir_fd
Support for include_hidden
Support for generating paths relative to root_dir

Results

Speedups via python -m timeit -s "from glob import glob" "glob(pattern, recursive=True, include_hidden=True)" from CPython source directory on Linux:

pattern	speedup
`Lib/*`	1.48x
`Lib/*/`	1.82x
`Lib/*.py`	1.15x
`Lib/**`	4.98x
`Lib/**/`	1.31x
`Lib/*/`	1.82x
`Lib//`	14.9x
`Lib/*//`	2.25x
`Lib/*/.py`	1.81x
`Lib/**/__init__.py`	1.08x
`Lib/*//*.py`	2.38x
`Lib/*//__init__.py`	1.19x

Issue: Speed up glob.glob() by reducing number of system calls made #116380

barneygale · 2024-03-05T23:53:36Z

Needs a fix for #116377 to land.

gpshead

Nice work!

Lib/glob.py

Misc/NEWS.d/next/Library/2024-03-05-23-08-11.gh-issue-116380.56HU7I.rst

Lib/glob.py

serhiy-storchaka

Please do not hurry to merge. This is an old code. The main advantage of the initial code was its simplicity, but since then it was complicated by adding new features and optimizations. In particularly the use of os.scandir() instead of os.listdir() significantly improved performance. The new implementation should be benchmarked with different test cases: deep and wide threes, files and directories domination.

Lib/glob.py

barneygale · 2024-03-07T12:01:44Z

Thanks Serhiy! We use os.scandir() if either:

We're expanding a recursive wildcard (we need to distinguish directories in order to recurse)
We're expanding a non-final non-recursive wildcard (we need to select only directories)

If neither of these are true, then we don't need to stat() the children, and so os.listdir() is actually a little faster I think. But I will test this on a few machines to be sure!

edit: to further illustrate what I mean, here's where os.listdir() is used:

	non-recursive part	recursive part
non-terminal part	`os.scandir()`	`os.scandir()`
terminal part	`os.listdir()` <--	`os.scandir()`

barneygale · 2024-03-10T18:49:54Z

The new implementation should be benchmarked with different test cases: deep and wide threes, files and directories domination.

I've been looking into this! The randomfiletree project is helpful - it can repeatedly walk a tree and create child files/folders according to a gaussian distribution, which seems to me like a good approximation for an average "shallow and wide" filesystem structure, including tweaking for file or folder distribution.

It's difficult to produce "deep and narrow" trees this way, as the file/folder probability would need to change with the depth (I think?). I've been considering writing a tree generator that works this way, e.g.:

At depth==0, generate 100 subdirectories
At 0 < depth < 50, generate 1 subdirectory
At depth==50, generate 100 files

... but is that overly arbitrary? Is there a better way? Or do I just need to come up with a bunch of test cases along those lines?

barneygale · 2024-03-15T01:18:37Z

A test of 100 nested directories named "deep" from my Linux machine:

pattern	speedup
`deep/**`	3.86x
`deep/**/`	4.03x
`deep/*/`	4.92x
`deep/*//`	4.93x

barneygale · 2024-05-14T18:18:05Z

Hey @gpshead and @serhiy-storchaka - would you like to review this? If not, I might make an appeal on discuss.python.org in a couple weeks' time. Thanks for the feedback so far, and no worries at all if you'd rather focus your efforts elsewhere!

barneygale · 2024-05-31T20:12:33Z

Just spotted an issue: previously iglob() returned a generator with a close() method that would os.close() any open file descriptors. I'll get this re-instated.

Lib/glob.py

Doc/whatsnew/3.14.rst

picnixz · 2024-08-28T12:36:23Z

Lib/glob.py

+        for path in paths:
+            if path:
+                yield path
+            break


Suggested change

for path in paths:

if path:

yield path

break

path = next(paths, None)

if path:

yield path

Not sure whether it's faster to use next or if instantiating the loop would be faster.

Lib/glob.py

picnixz · 2024-08-28T12:39:26Z

Lib/glob.py

+    for path in select(dirname, dir_fd, dirname):
+        yield slicer(path)


Suggested change

for path in select(dirname, dir_fd, dirname):

yield slicer(path)

yield from map(slicer, select(dirname, dir_fd, dirname))

You just want a generator I think (to be as lazy as possible)

Misc/NEWS.d/next/Library/2024-03-05-23-08-11.gh-issue-116380.56HU7I.rst

picnixz · 2024-08-28T12:49:59Z

Lib/glob.py

-        return dirname or basename
-    return os.path.join(dirname, basename)
+def _relative_glob(select, dirname, dir_fd=None):
+    """Globs using a *select* function from the given dirname. The dirname


Maybe explain how dir_fd would be used if specified.

picnixz · 2024-08-28T12:51:18Z

Lib/glob.py

+    prefix is removed from results.
+    """
+    dirname = _StringGlobber.add_slash(dirname)
+    slicer = operator.itemgetter(slice(len(dirname), None))


Is it more efficient to use this or a plain yield path[dirname_length:]? (with this approach, we can't use a map anymore).

picnixz · 2024-08-28T12:55:18Z

Lib/glob.py

+        for path in _iglob(pathname, root_dir, dir_fd, recursive, include_hidden):
+            yield os.fsencode(path)


Suggested change

for path in _iglob(pathname, root_dir, dir_fd, recursive, include_hidden):

yield os.fsencode(path)

paths = _iglob(pathname, root_dir, dir_fd, recursive, include_hidden)

yield from map(os.fsencode, paths)

Lib/glob.py

barneygale · 2024-08-28T16:36:43Z

@picnixz to address some of your comments on using map() rather than looping and yield: I did this so that calling close() on the iglob(dir_fd=blah) generator causes os.close() to be called on all open file descriptors, which seems to work with a stack of for loops but not map(). But I didn't add a test case - I'll do that now :)

picnixz · 2024-08-28T16:44:23Z

That's... an interesting functionality I wasn't aware of :) If someone could explain to me the reason I'd be happy. Anyway, let's keep your loops.

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>

pythonGH-116380: Make glob.glob() twice as fast

db3c620

barneygale added the performance Performance or resource usage label Mar 5, 2024

bedevere-app bot added the awaiting core review label Mar 5, 2024

bedevere-app bot mentioned this pull request Mar 5, 2024

Speed up glob.glob() by reducing number of system calls made #116380

Open

Use os.listdir() if we don't need to check entry type.

9e1f059

barneygale added 8 commits March 6, 2024 01:21

A few small speedups.

10432df

Simplify prefix removal

7e389e2

Re-implement glob0(), glob1(), and has_magic().

8680a0a

Fix errant StopIteration.

3bf3124

Skip compiling pattern for consecutive ** segments.

f8fb992

Clarify regex/path building in literal and recursive selectors.

50ef080

Simplify code to ignore root_dir.

ccefacd

Fix possible Windows separator issue.

fa951f6

gpshead reviewed Mar 6, 2024

View reviewed changes

Lib/glob.py Outdated Show resolved Hide resolved

Lib/glob.py Outdated Show resolved Hide resolved

Lib/glob.py Outdated Show resolved Hide resolved

Misc/NEWS.d/next/Library/2024-03-05-23-08-11.gh-issue-116380.56HU7I.rst Outdated Show resolved Hide resolved

Privat33r-dev reviewed Mar 6, 2024

View reviewed changes

Lib/glob.py Outdated Show resolved Hide resolved

Privat33r-dev reviewed Mar 6, 2024

View reviewed changes

Lib/glob.py Outdated Show resolved Hide resolved

serhiy-storchaka self-requested a review March 6, 2024 17:12

barneygale added 5 commits March 6, 2024 21:20

Address some review feedback.

0aec12c

Use assignment expressions in a couple of places

72691ba

Replace lambda with operator.not_.

c58dd21

Merge branch 'main' into pythongh-116380

c361ec9

Speed up _add_trailing_slash()

22b30db

barneygale commented Mar 7, 2024

View reviewed changes

Lib/glob.py Show resolved Hide resolved

barneygale added 2 commits March 7, 2024 02:10

Speed up select_literal()

83b70bd

Speed up select_recursive()

1d32d14

serhiy-storchaka reviewed Mar 7, 2024

View reviewed changes

Lib/glob.py Outdated Show resolved Hide resolved

barneygale added 2 commits May 8, 2024 18:00

Merge branch 'main' into pythongh-116380

54efa7c

Update whatsnew

cf11922

barneygale changed the title ~~GH-116380: Make glob.glob() twice as fast~~ GH-116380: Speed up glob.glob() by removing some system calls May 10, 2024

Merge branch 'main' into pythongh-116380

6710924

barneygale mentioned this pull request May 28, 2024

Change shutil.rmtree and os.walk to support very deep hierarchies #89727

Open

Merge branch 'main' into pythongh-116380

a547cd2

barneygale added 8 commits May 31, 2024 22:15

Close file descriptors when recursive_selector is finalized.

14ae438

Make iglob() a generator.

69d7a86

Make _iglob() a generator.

3b84a1d

Make _relative_glob() a generator.

f9f9a8d

Simplify skipping empty string

24a9ee4

Merge branch 'main' into pythongh-116380

d05d58d

Merge branch 'main' into pythongh-116380

27c463e

Make _GlobberBase fully abstract.

a94f2a7

eryksun reviewed Jun 7, 2024

View reviewed changes

Lib/glob.py Outdated Show resolved Hide resolved

picnixz reviewed Jun 9, 2024

View reviewed changes

Lib/glob.py Outdated Show resolved Hide resolved

Lib/glob.py Outdated Show resolved Hide resolved

Lib/glob.py Show resolved Hide resolved

Lib/glob.py Outdated Show resolved Hide resolved

Lib/glob.py Outdated Show resolved Hide resolved

barneygale added 5 commits June 9, 2024 21:12

Address review feedback

d19bb89

Typo fix

1677588

Speed up pattern parsing.

539f044

Add test for globbing above recursion limit.

70a1b42

Merge branch 'main' into pythongh-116380

1560712

picnixz self-requested a review August 28, 2024 12:29

picnixz reviewed Aug 28, 2024

View reviewed changes

barneygale and others added 3 commits September 1, 2024 15:52

Apply suggestions from code review

099e86e

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>

Test that iglob().close() closes file descriptors.

ee76faf

Address some review feedback

4cf8a4d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-116380: Speed up `glob.glob()` by removing some system calls #116392

GH-116380: Speed up `glob.glob()` by removing some system calls #116392

barneygale commented Mar 5, 2024 •

edited

Loading

barneygale commented Mar 5, 2024 •

edited

Loading

gpshead left a comment

serhiy-storchaka left a comment

barneygale commented Mar 7, 2024 •

edited

Loading

barneygale commented Mar 10, 2024

barneygale commented Mar 15, 2024

barneygale commented May 14, 2024

barneygale commented May 31, 2024

picnixz Aug 28, 2024

picnixz Aug 28, 2024

picnixz Aug 28, 2024

picnixz Aug 28, 2024

picnixz Aug 28, 2024

barneygale commented Aug 28, 2024 •

edited

Loading

picnixz commented Aug 28, 2024

		for path in select(dirname, dir_fd, dirname):
		yield slicer(path)

	for path in select(dirname, dir_fd, dirname):
	yield slicer(path)
	yield from map(slicer, select(dirname, dir_fd, dirname))

		for path in _iglob(pathname, root_dir, dir_fd, recursive, include_hidden):
		yield os.fsencode(path)

GH-116380: Speed up glob.glob() by removing some system calls #116392

Are you sure you want to change the base?

GH-116380: Speed up glob.glob() by removing some system calls #116392

Conversation

barneygale commented Mar 5, 2024 • edited Loading

Depends on

Filtered recursive walk

Tracking path existence

Minor speed-ups

Implementation notes

Results

barneygale commented Mar 5, 2024 • edited Loading

gpshead left a comment

Choose a reason for hiding this comment

serhiy-storchaka left a comment

Choose a reason for hiding this comment

barneygale commented Mar 7, 2024 • edited Loading

barneygale commented Mar 10, 2024

barneygale commented Mar 15, 2024

barneygale commented May 14, 2024

barneygale commented May 31, 2024

picnixz Aug 28, 2024

Choose a reason for hiding this comment

picnixz Aug 28, 2024

Choose a reason for hiding this comment

picnixz Aug 28, 2024

Choose a reason for hiding this comment

picnixz Aug 28, 2024

Choose a reason for hiding this comment

picnixz Aug 28, 2024

Choose a reason for hiding this comment

barneygale commented Aug 28, 2024 • edited Loading

picnixz commented Aug 28, 2024

GH-116380: Speed up `glob.glob()` by removing some system calls #116392

GH-116380: Speed up `glob.glob()` by removing some system calls #116392

barneygale commented Mar 5, 2024 •

edited

Loading

barneygale commented Mar 5, 2024 •

edited

Loading

barneygale commented Mar 7, 2024 •

edited

Loading

barneygale commented Aug 28, 2024 •

edited

Loading