Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: cache downloaded wheel information #2276

Merged
merged 2 commits into from
Jan 2, 2022

Conversation

mayeut
Copy link
Member

@mayeut mayeut commented Jan 1, 2022

Add downloaded wheel information in the relevant JSON embed file to prevent additional downloads of the same wheel.

This requires to bump the JSON embed file version once again as I was lacking some tests cases in #2273.

I will add some inline comments for some choices that I made.

Fixes #2268

Thanks for contributing, make sure you address all the checklists (for details on how see

development documentation)!

  • ran the linter to address style issues (tox -e fix_lint)
  • wrote descriptive pull request text
  • ensured there are test(s) validating the fix
  • added news fragment in docs/changelog folder
  • updated/extended the documentation

Comment on lines 85 to 97
def add_wheel_to_update_log(wheel, for_py_version, app_data):
embed_update_log = app_data.embed_update_log(wheel.distribution, for_py_version)
logging.debug("adding %s information to %s", wheel.name, embed_update_log.file)
u_log = UpdateLog.from_dict(embed_update_log.read())
if any(version.filename == wheel.name for version in u_log.versions):
logging.warning("%s already present in %s", wheel.name, embed_update_log.file)
return
version = NewVersion(wheel.name, datetime.now(), release_date_for_wheel_path(wheel.path), "download")
u_log.versions.append(version) # always write at the end for proper updates
embed_update_log.write(u_log.to_dict())


Copy link
Member Author

@mayeut mayeut Jan 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been added to periodic_update.py as it is the only place dealing with the embed_update_log (which is not used only for updates anymore).

@@ -250,16 +265,20 @@ def _run_do_update(app_data, distribution, embed_filename, for_py_version, perio
now = datetime.now()
if periodic:
source = "periodic"
# mark everything not updated manually as source "periodic"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the modifications in _run_do_update are what lead me to bump the JSON embed format version.

Comment on lines 321 to 322
# we need to sort versions for the next update to work properly
u_log.versions.sort(reverse=True)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log needs to be sorted. It was sorted by design before. That's not the case anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why is no longer the case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would remove the need to sort the list.

I don't mind sorting the log; why would you consider the other solution less risky?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why is no longer the case?

because other sources logs are always appended rather than inserted.

I don't mind sorting the log; why would you consider the other solution less risky?

I'm feeling a bit hesitant to rely on the tuple ordering given:

    def as_version_tuple(version):
        result = []
        for part in version.split(".")[0:3]:
            try:
                result.append(int(part))
            except ValueError:
                break
        if not result:
            raise ValueError(version)
        return tuple(result)

I feel that the inner try/catch is hiding something I did not think about.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not think of pre-release versions... In this specific sample, it does not matter much but it's not sorted correctly.

{
  "completed": "2022-01-01T21:48:50.557509Z",
  "periodic": false,
  "started": "2022-01-01T21:48:40.230835Z",
  "versions": [
    {
      "filename": "pip-21.3.1-py3-none-any.whl",
      "found_date": "2022-01-01T14:22:37.717326Z",
      "release_date": "2021-10-22T15:57:09.000000Z",
      "source": "manual"
    },
    {
      "filename": "pip-20.3-py2.py3-none-any.whl",
      "found_date": "2022-01-01T21:46:25.286240Z",
      "release_date": "2020-11-30T12:47:49.000000Z",
      "source": "manual"
    },
    {
      "filename": "pip-20.0-py2.py3-none-any.whl",
      "found_date": "2022-01-01T21:48:23.634459Z",
      "release_date": "2020-01-21T11:42:56.000000Z",
      "source": "manual"
    },
    {
      "filename": "pip-20.3b1-py2.py3-none-any.whl",
      "found_date": "2022-01-01T21:44:30.798536Z",
      "release_date": "2020-10-31T19:25:12.000000Z",
      "source": "manual"
    }
  ]
}

Further-more, a manual/periodic update should never lead to defaulting to pre-release version. This promise is broken (at least for manual updates).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given an added test with a pre-release version downloaded as "pinned" version indeed broke manual updates, I went with the alternate implementation of _run_do_update which first splits the UpdateLog in 2 depending if it's coming from a previous update or another source.

@mayeut
Copy link
Member Author

mayeut commented Jan 1, 2022

Edit: this is used by the latest commit

There's another way to do the update which might be a less bit risky for periodic updates:

  • Step A: filter the versions from the UpdateLog to only have manual/periodic sources on one hand (back to the previous problem) and other sources on the other hand.
  • Step B: run the update as was done before
  • Step C: filter-out newly found versions from the "other sources" list
  • Step D: concatenate the two versions list (the first half being from version update & the 2nd half from other sources)

This would remove the need to sort the list.

@mayeut
Copy link
Member Author

mayeut commented Jan 1, 2022

Caveat:
There is a time penalty for first downloads compared to the previous version in order to fill the release_date.
Now that I'm writing this, I now realize it might not be necessary to fill this one which is only ever used by NewVersion.use when the source is periodic.

Edit: this is no longer the case with the latest push.

@mayeut mayeut marked this pull request as draft January 1, 2022 20:29
@mayeut
Copy link
Member Author

mayeut commented Jan 1, 2022

@gaborbernat,

Writing things down makes me realize that there are still some things left to do so converting to draft.
Do you have any remarks to share about the comments I left ?

PS: while thinking about the time penalty I mentioned, I thought about caching PyPI requests on disk as well and use the etag header to know if actually pulling the data (setuptools JSON is about 1MB) was necessary. I deemed this was too heavy to implement for my simple use-case and not worth spending more time on this. I thought of another use-case which might benefit from this: users configuring --download as the default setting (checking if there was an update on PyPI should be much faster than starting a pip download. I guess asking PyPI this way is the first thing done by pip)

Add downloaded wheel information in the relevant JSON embed file to
prevent additional downloads of the same wheel.
This requires to bump the JSON embed file version.
@mayeut mayeut marked this pull request as ready for review January 2, 2022 00:45
@mayeut
Copy link
Member Author

mayeut commented Jan 2, 2022

I rewrote the _run_do_update function as mentioned in #2276 (comment) and removed the time penalty I mentioned in #2276 (comment).

assert read_dict.call_count == 1
assert write.call_count == 1


def test_download_manual_stop_at_first_usable(tmp_path, mocker, freezer):
def test_download_periodic_stop_at_first_usable2(tmp_path, mocker, freezer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the 2 suffix?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's already a test_download_periodic_stop_at_first_usable test and I didn't see how to do this as a parametrize test easily so adding a suffix 2 on the second one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of magic numbers, I generally add to the test name what differentiates them 🤔 for future reference.

Signed-off-by: Bernát Gábor <gaborjbernat@gmail.com>
@gaborbernat gaborbernat merged commit 319a540 into pypa:main Jan 2, 2022
@mayeut mayeut deleted the download-cache branch January 2, 2022 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not attempt to download previously downloaded wheels
2 participants