send HTTP caching headers for index pages to further reduce bandwidth usage #12257

cosmicexplorer · 2023-09-03T06:45:09Z

This PR is on top of #12256, see the +316/-36 diff against it at https://github.com/cosmicexplorer/pip/compare/link-metadata-cache...cosmicexplorer:pip:link-parsing-cache?expand=1.

Background: Learning More about HTTP Requests

After taking up a suggestion from @dholth in #12208 to consider handling a 304 Not Modified response in HTTP requests, I began to consider whether we could make use of HTTP caching headers to further reduce the time we wait for the network, without reintroducing the delays to see new package uploads described in #5670.

Proposal: Send HTTP Caching Headers and Record `ETag`

This change records the ETag and Date headers from the HTTP response, then sets the If-None-Match and If-Modified-Since headers on future requests against project pages (e.g. https://pypi.org/simple/tensorflow). This allows the server to respond with a zero-length 304 Not Modified instead of a several-hundred KB HTML page:

> curl -L -v -H 'Accept: application/vnd.pypi.simple.v1+json' 'https://pypi.org/simple/tensorflow' | wc -c
# ...
< HTTP/2 200
# ...
< etag: "d/Qc2Yfc8cAEOVWV4ydm0g"
# ...
630017
# now execute the same request again, but providing the ETag
> curl -L -v -H 'Cache-Control: max-age=0' -H 'If-None-Match: "d/Qc2Yfc8cAEOVWV4ydm0g"' -H 'Accept: application/vnd.pypi.simple.v1+json' 'https://pypi.org/simple/tensorflow' | wc -c
# ...
< HTTP/2 304
# ...
0

Result: Slight Performance Improvement

Recording these HTTP headers adds only ~3KB of disk space after a large resolve:

> du -b -s ~/.cache/pip/fetch-resolve/
2842    /home/cosmicexplorer/.cache/pip/fetch-resolve/
> find ~/.cache/pip/fetch-resolve/ -name checksum | parallel -n1 base64
...
WnHabu8DLkGTyGjm3HE+ymvlr9ulcho0ZU+HzyTaM7Q=
> find ~/.cache/pip/fetch-resolve/ -name etag | parallel -n1 "cat {}; echo"
...
Ee725MpdP+o4nSsZzrCWHw
> FILE="$(find ~/.cache/pip/fetch-resolve/ -name modified-since-date | head -n1)" python
Python 3.10.13 (main, Jan 10 2024, 22:22:23) [GCC 13.2.1 20230801] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> with open(os.environ['FILE'], 'rb') as f:
...   i = int.from_bytes(f.read(), byteorder='big', signed=False)
... 
>>> import datetime
>>> datetime.datetime.fromtimestamp(i, tz=datetime.timezone.utc)
datetime.datetime(2024, 1, 11, 12, 10, 17, tzinfo=datetime.timezone.utc)

It has only a very slight (3.4%) performance benefit on top of #12256, converting a 6.1 second resolve to 5.9 seconds:

# this is #12256
> git checkout link-metadata-cache
> python3.8 -m pip install --use-feature=metadata-cache --dry-run --ignore-installed --report test.json --use-feature=fast-deps 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3'
...
real    0m6.142s
user    0m4.414s
sys     0m0.174s
> git checkout link-parsing-cache # apply the change in this PR
> python3.8 -m pip install --use-feature=metadata-cache --dry-run --ignore-installed --report test.json --use-feature=fast-deps 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3'
...
real    0m5.940s
user    0m4.349s
sys     0m0.181s

But more importantly, as described above, it avoids making multiple ~600KB requests against pypi on each resolve.

TODO

Merge cache metadata lookups for sdists and lazy wheels #12256.

cosmicexplorer · 2023-09-03T07:23:12Z

@ewdurbin @uranusjr (not sure who to tag): this change adds If-None-Match and If-Modified-Since headers to requests for pypi project index pages (e.g. curl -L -H 'Accept: application/vnd.pypi.simple.v1+json' 'https://pypi.org/simple/tensorflow') to get them to return empty 304 Not Modified responses when possible. While the changes in this PR produce only a modest improvement in pip runtime, I would like to know whether this is likely to be an appreciable bandwidth reduction in pip requests against pypi.

cosmicexplorer · 2023-09-03T07:24:59Z

I'm actually going to take out the interpreter compatibility caching, since that part adds significant complexity without any effect on network bandwidth usage.

ewdurbin · 2023-09-03T11:34:25Z

@ewdurbin @uranusjr (not sure who to tag): this change adds If-None-Match and If-Modified-Since headers to requests for pypi project index pages (e.g. curl -L -H 'Accept: application/vnd.pypi.simple.v1+json' 'https://pypi.org/simple/tensorflow') to get them to return empty 304 Not Modified responses when possible. While the changes in this PR produce only a modest improvement in pip runtime, I would like to know whether this is likely to be an appreciable bandwidth reduction in pip requests against pypi.

Primary bandwidth concern for PyPI is files.pythonhosted.org, by about 15 times. But everything is marginal so this would have a consequential impact over time. Nice stuff!

uranusjr · 2024-04-03T02:01:17Z

This looks reasonable to me, assuming we accept the dependant PRs. (Still some discussion needed on #12186, it seems.)

When performing `install --dry-run` and PEP 658 .metadata files are available to guide the resolve, do not download the associated wheels. Rather use the distribution information directly from the .metadata files when reporting the results on the CLI and in the --report file. - describe the new --dry-run behavior - finalize linked requirements immediately after resolve - introduce is_concrete - funnel InstalledDistribution through _get_prepared_distribution() too - add test for new install --dry-run functionality (no downloading)

- catch an exception when parsing metadata which only occurs in CI - handle --no-cache-dir - call os.makedirs() before writing to cache too - catch InvalidSchema when attempting git urls with BatchDownloader - fix other test failures - reuse should_cache(req) logic - gzip compress link metadata for a slight reduction in disk space - only cache built sdists - don't check should_cache() when fetching - cache lazy wheel dists - add news - turn debug logs in fetching from cache into exceptions - use scandir over listdir when searching normal wheel cache - handle metadata email parsing errors - correctly handle mutable cached requirement - use bz2 over gzip for an extremely slight improvement in disk usage

- pipe in headers arg - provide full context in Link.comes_from - pull in etag and date and cache the outputs - handle --no-cache-dir - add NEWS - remove quotes from etag and use binary checksum to save a few bytes - parse http modified date to compress the cached representation

psf-chronographer bot added the bot:chronographer:provided label Sep 3, 2023

cosmicexplorer mentioned this pull request Sep 3, 2023

memoizing dependency lookups with a local resolve index #12184

Closed

cosmicexplorer force-pushed the link-parsing-cache branch from 011f02e to b658989 Compare September 3, 2023 07:15

cosmicexplorer force-pushed the link-parsing-cache branch 2 times, most recently from 4e361c4 to 2c0405b Compare September 3, 2023 09:38

cosmicexplorer mentioned this pull request Sep 3, 2023

cache metadata lookups for sdists and lazy wheels #12256

Draft

1 task

cosmicexplorer force-pushed the link-parsing-cache branch 5 times, most recently from 5830ea0 to 28d65ce Compare September 3, 2023 11:30

cosmicexplorer force-pushed the link-parsing-cache branch 2 times, most recently from 2ecc6c4 to 3987be0 Compare September 3, 2023 11:50

cosmicexplorer changed the title ~~make FetchResolveCache to cache Link parsing and interpreter compatibility checking~~ send HTTP caching headers for index pages to reduce bandwidth Sep 3, 2023

cosmicexplorer changed the title ~~send HTTP caching headers for index pages to reduce bandwidth~~ send HTTP caching headers for index pages to further reduce bandwidth usage Sep 3, 2023

cosmicexplorer marked this pull request as ready for review September 3, 2023 11:58

cosmicexplorer mentioned this pull request Sep 3, 2023

persistent cache for link parsing and interpreter compatibility #12258

Draft

cosmicexplorer force-pushed the link-parsing-cache branch 5 times, most recently from d935efa to 5cc8a36 Compare September 3, 2023 23:43

edmorley mentioned this pull request Sep 6, 2023

Populate find_all_candidates cache from threadpool #10480

Open

pradyunsg removed the bot:chronographer:provided label Dec 20, 2023

psf-chronographer bot added the bot:chronographer:provided label Dec 20, 2023

cosmicexplorer mentioned this pull request Jan 11, 2024

perform 1-3 HTTP requests for each wheel using fast-deps #12208

Open

cosmicexplorer force-pushed the link-parsing-cache branch from 5cc8a36 to 03490f9 Compare January 11, 2024 05:34

cosmicexplorer force-pushed the link-parsing-cache branch 6 times, most recently from 6dd7837 to 5c40e19 Compare January 17, 2024 04:25

edmorley mentioned this pull request Jan 22, 2024

pull preparer logic out of the resolver to consume metadata-only dists in commands #12186

Draft

ddelange mentioned this pull request Mar 5, 2024

Expose the METADATA file of wheels in the simple API pypi/warehouse#8254

Closed

cosmicexplorer added 5 commits August 13, 2024 05:34

cache "concrete" dists by Distribution instead of InstallRequirement

ad93d54

refactor much of RequirementPreparer and add lengthy docs

52133ba

cosmicexplorer force-pushed the link-parsing-cache branch from 5c40e19 to f28ecfd Compare August 13, 2024 10:45

cosmicexplorer mentioned this pull request Aug 17, 2024

metadata resolve workstream #12921

Open

10 tasks

cosmicexplorer marked this pull request as draft August 20, 2024 16:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

send HTTP caching headers for index pages to further reduce bandwidth usage #12257

send HTTP caching headers for index pages to further reduce bandwidth usage #12257

cosmicexplorer commented Sep 3, 2023 •

edited

Loading

cosmicexplorer commented Sep 3, 2023

cosmicexplorer commented Sep 3, 2023

ewdurbin commented Sep 3, 2023

uranusjr commented Apr 3, 2024

send HTTP caching headers for index pages to further reduce bandwidth usage #12257

Are you sure you want to change the base?

send HTTP caching headers for index pages to further reduce bandwidth usage #12257

Conversation

cosmicexplorer commented Sep 3, 2023 • edited Loading

Background: Learning More about HTTP Requests

Proposal: Send HTTP Caching Headers and Record ETag

Result: Slight Performance Improvement

TODO

cosmicexplorer commented Sep 3, 2023

cosmicexplorer commented Sep 3, 2023

ewdurbin commented Sep 3, 2023

uranusjr commented Apr 3, 2024

cosmicexplorer commented Sep 3, 2023 •

edited

Loading

Proposal: Send HTTP Caching Headers and Record `ETag`