Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

send HTTP caching headers for index pages to further reduce bandwidth usage #12257

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

cosmicexplorer
Copy link
Contributor

@cosmicexplorer cosmicexplorer commented Sep 3, 2023

This PR is on top of #12256, see the +316/-36 diff against it at https://github.com/cosmicexplorer/pip/compare/link-metadata-cache...cosmicexplorer:pip:link-parsing-cache?expand=1.

Background: Learning More about HTTP Requests

After taking up a suggestion from @dholth in #12208 to consider handling a 304 Not Modified response in HTTP requests, I began to consider whether we could make use of HTTP caching headers to further reduce the time we wait for the network, without reintroducing the delays to see new package uploads described in #5670.

Proposal: Send HTTP Caching Headers and Record ETag

This change records the ETag and Date headers from the HTTP response, then sets the If-None-Match and If-Modified-Since headers on future requests against project pages (e.g. https://pypi.org/simple/tensorflow). This allows the server to respond with a zero-length 304 Not Modified instead of a several-hundred KB HTML page:

> curl -L -v -H 'Accept: application/vnd.pypi.simple.v1+json' 'https://pypi.org/simple/tensorflow' | wc -c
# ...
< HTTP/2 200
# ...
< etag: "d/Qc2Yfc8cAEOVWV4ydm0g"
# ...
630017
# now execute the same request again, but providing the ETag
> curl -L -v -H 'Cache-Control: max-age=0' -H 'If-None-Match: "d/Qc2Yfc8cAEOVWV4ydm0g"' -H 'Accept: application/vnd.pypi.simple.v1+json' 'https://pypi.org/simple/tensorflow' | wc -c
# ...
< HTTP/2 304
# ...
0

Result: Slight Performance Improvement

Recording these HTTP headers adds only ~3KB of disk space after a large resolve:

> du -b -s ~/.cache/pip/fetch-resolve/
2842    /home/cosmicexplorer/.cache/pip/fetch-resolve/
> find ~/.cache/pip/fetch-resolve/ -name checksum | parallel -n1 base64
...
WnHabu8DLkGTyGjm3HE+ymvlr9ulcho0ZU+HzyTaM7Q=
> find ~/.cache/pip/fetch-resolve/ -name etag | parallel -n1 "cat {}; echo"
...
Ee725MpdP+o4nSsZzrCWHw
> FILE="$(find ~/.cache/pip/fetch-resolve/ -name modified-since-date | head -n1)" python
Python 3.10.13 (main, Jan 10 2024, 22:22:23) [GCC 13.2.1 20230801] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> with open(os.environ['FILE'], 'rb') as f:
...   i = int.from_bytes(f.read(), byteorder='big', signed=False)
... 
>>> import datetime
>>> datetime.datetime.fromtimestamp(i, tz=datetime.timezone.utc)
datetime.datetime(2024, 1, 11, 12, 10, 17, tzinfo=datetime.timezone.utc)

It has only a very slight (3.4%) performance benefit on top of #12256, converting a 6.1 second resolve to 5.9 seconds:

# this is #12256
> git checkout link-metadata-cache
> python3.8 -m pip install --use-feature=metadata-cache --dry-run --ignore-installed --report test.json --use-feature=fast-deps 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3'
...
real    0m6.142s
user    0m4.414s
sys     0m0.174s
> git checkout link-parsing-cache # apply the change in this PR
> python3.8 -m pip install --use-feature=metadata-cache --dry-run --ignore-installed --report test.json --use-feature=fast-deps 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3'
...
real    0m5.940s
user    0m4.349s
sys     0m0.181s

But more importantly, as described above, it avoids making multiple ~600KB requests against pypi on each resolve.

TODO

@cosmicexplorer
Copy link
Contributor Author

@ewdurbin @uranusjr (not sure who to tag): this change adds If-None-Match and If-Modified-Since headers to requests for pypi project index pages (e.g. curl -L -H 'Accept: application/vnd.pypi.simple.v1+json' 'https://pypi.org/simple/tensorflow') to get them to return empty 304 Not Modified responses when possible. While the changes in this PR produce only a modest improvement in pip runtime, I would like to know whether this is likely to be an appreciable bandwidth reduction in pip requests against pypi.

@cosmicexplorer
Copy link
Contributor Author

I'm actually going to take out the interpreter compatibility caching, since that part adds significant complexity without any effect on network bandwidth usage.

@ewdurbin
Copy link
Member

ewdurbin commented Sep 3, 2023

@ewdurbin @uranusjr (not sure who to tag): this change adds If-None-Match and If-Modified-Since headers to requests for pypi project index pages (e.g. curl -L -H 'Accept: application/vnd.pypi.simple.v1+json' 'https://pypi.org/simple/tensorflow') to get them to return empty 304 Not Modified responses when possible. While the changes in this PR produce only a modest improvement in pip runtime, I would like to know whether this is likely to be an appreciable bandwidth reduction in pip requests against pypi.

Primary bandwidth concern for PyPI is files.pythonhosted.org, by about 15 times. But everything is marginal so this would have a consequential impact over time. Nice stuff!

@cosmicexplorer cosmicexplorer force-pushed the link-parsing-cache branch 2 times, most recently from 2ecc6c4 to 3987be0 Compare September 3, 2023 11:50
@cosmicexplorer cosmicexplorer changed the title make FetchResolveCache to cache Link parsing and interpreter compatibility checking send HTTP caching headers for index pages to reduce bandwidth Sep 3, 2023
@cosmicexplorer cosmicexplorer changed the title send HTTP caching headers for index pages to reduce bandwidth send HTTP caching headers for index pages to further reduce bandwidth usage Sep 3, 2023
@cosmicexplorer cosmicexplorer marked this pull request as ready for review September 3, 2023 11:58
@cosmicexplorer cosmicexplorer force-pushed the link-parsing-cache branch 5 times, most recently from d935efa to 5cc8a36 Compare September 3, 2023 23:43
@uranusjr
Copy link
Member

uranusjr commented Apr 3, 2024

This looks reasonable to me, assuming we accept the dependant PRs. (Still some discussion needed on #12186, it seems.)

When performing `install --dry-run` and PEP 658 .metadata files are
available to guide the resolve, do not download the associated wheels.

Rather use the distribution information directly from the .metadata
files when reporting the results on the CLI and in the --report file.

- describe the new --dry-run behavior
- finalize linked requirements immediately after resolve
- introduce is_concrete
- funnel InstalledDistribution through _get_prepared_distribution() too
- add test for new install --dry-run functionality (no downloading)
- catch an exception when parsing metadata which only occurs in CI
- handle --no-cache-dir
- call os.makedirs() before writing to cache too
- catch InvalidSchema when attempting git urls with BatchDownloader
- fix other test failures
- reuse should_cache(req) logic
- gzip compress link metadata for a slight reduction in disk space
- only cache built sdists
- don't check should_cache() when fetching
- cache lazy wheel dists
- add news
- turn debug logs in fetching from cache into exceptions
- use scandir over listdir when searching normal wheel cache
- handle metadata email parsing errors
- correctly handle mutable cached requirement
- use bz2 over gzip for an extremely slight improvement in disk usage
- pipe in headers arg
- provide full context in Link.comes_from
- pull in etag and date and cache the outputs
- handle --no-cache-dir
- add NEWS
- remove quotes from etag and use binary checksum to save a few bytes
- parse http modified date to compress the cached representation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants