Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft Release support in PyPI #8941

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

Conversation

alanbato
Copy link
Contributor

This PR aims to implement the functionality required to support draft releases in PyPI.

Drafts releases are intended to enable package maintainers to better control their release processes and pipelines, and giving an extra step for testing things out before a new release is available to the public. This is separate from pre-releases in the sense that pre-releases are for users to use and test, whereas draft releases are invisible to them, and are only inteded to be used by maintainers.

Previous discussions as to why a feature like this could be useful, can be found in #726 and in this thread.

Implementation details and Rationale

This implementations key change is the addition of a published attribute in the Release model, all the new functionality operates on this value. The published timestamp is set when a release is published, and its null value signals that the release is created but not published, so therefore it's a draft release. In other words:

  • Using the current workflow, a release is created with both created and published set to the same timestamp. Everythig is as usual.
  • A release is created, but not published, when the build client's POST request that creates it has a Is-Draft header, and it is set to True.
  • An existing draft release can then be published through the PyPI web UI, which simply sets published to the current timestamp.

Aditionally, a Release now has a draft_hash calculated value, which will be explained later.

The differentiation between draft releases and published releases is done in two separate places: pypi.org and the Simple Index.

Draft releases in PyPI.org

Regarding the PyPI website, the main focus was to 1) filter out draft releases from showing in a project's page for regular users and to 2) add additional UI elements for maintainers to manage draft releases.

Part 1) was done by modifying the traversal logic of a project's release when looking a specific version in the following way:

  • If the release version in the path includes a draft hash, compare it against that release's draft hash. If they're the same, show the page for the draft version.
  • If the release version in the path does not include a draft hash, only match against published versions.

For example, if my-package's version 0.23.0 is a draft,

  • Accessing https://pypi.org/project/my-package/release/0.23.0 would not resolve (404 page)
  • But https://pypi.org/project/my-package/release/0.23.0--draft--<some-hash>/ would talke you to that release's draft preview, given that some-hash is that release's computed draft_hash. If the draft_hash is incorrect, it would not resolve.

As for Part 2), please see the Visual Demo with the annotated screenshots.

Draft releases and the Simple Index

In order to support the upload, download, and installation of the draft versions of packages for the project's release workflows, while still making those drafts invisible to regular users, we would've needed to make significant changes to the Simple Index API.
Since this was a non-goal for the project, an alternative solution was used, in the form of a newly created Draft Index.

The Draft Index implements the same API as the Simple Index, with the key difference that it's isolated for every single draft release. This means that when you create a draft release for a project, PyPI: creates a separate index listing only that project, with only that version available for it.

In order to be able to install the draft version of your package, you need to pass down this index URL as an extra index to pip, by means of the --extra-index-url flag. This allows pip to fetch the wheels & sdist for your package's draft version from this special index, while fetching all its dependencies leveraging the Simple Index and its current infrastructure.

Similar to what we did with PyPI, we use the draft_hash value of a release to construct its draft index URL, in the format of:
https://pypi.org/simple/draft/<draft_hash>/
The full pip command using this URL is shown on the release's page on PyPI for easy access

In order to make this draft_hash deterministic for the maintainer's automation needs, the current implementation computes an md5 hash on the concatenation of the project's name and version.

On the other side of the equation, in order to create and upload a draft release, build clients need to pass down the Is-Draft special header, and could let the package maintainer do so by accepting a --draft flag when uploading such a release.
I made an example implementation of how this could work with Twine, and you can try it out yourself.

Other Details

  • When uploading files for a draft release, files with the same name get overwritten, and new files are added to the release. This makes it possible to make changes to the project without bumping the version.
  • Along the same lines, everytime a draft release is updated, it's metadata is refreshed to reflect the new changes.
  • Since you can now create a release without making it available to the users, you are now able to do things like upload platform-specific wheels and publish it once the release is ready for general use.
  • There were minor refactors made to the code in order to reduce the overall complexity of some parts of the logic with the addition of draft-version handling.
  • The database migration needed to add the columns also specify how to backfill the values in order to make the deployment of these changes easier.

Request for Feedback

All feedback is welcomed and appreciated, and trying out how everything works by pulling in these changes and spinning up a local server is encouraged to get a feel of how this all work. I'll try to answer as many things as I can.

Aditionally, I'm specially looking for feedback on these things:

  • The draft hash value and how it's calculated. Some people think it shouldn't be guessable by regular users, while others want it to be determinable by maintainers in order to automate their workflows. How can we achieve both? Should it be calculated, or should it be stored?
  • The draft index. There's a concern that having isolated draft indexes hosted on PyPI might lead to the creation of "private indexes", which we don't want. How can we make it so these indexes are temporal and ephemeral?
  • How draft releases are created. We settled on specifying a special header so the upload clients can signal PyPI that they inted to create a draft release. I'd like to know if this is reasonable from the upload client's side.
  • What changes are needed in order for this PR to be "mergeable", I know the code and database migrations need to be merged in a certain way/order in order to phase out the change, and could use some advise on how to do it.

Finally, here's the:

Visual Demo

This is a simple project I created to showcase how the new features and workflow look like in PyPI.

image
When looking at the project homepage, there's nothing different than what we have now.

image
The release history only shows one version.

image
If we go to the Simple Index for this project, we can see there is only a wheel and a sdist for that one version.

image
But if we are logged in as a project maintainer and go to the Manage Project section, we see there's a draft release listed!
The release table also has a new column, that list the "published at" date alongside the "created at" date, which may now be different.

image
Draft releases have a new option in the Option menu, Publish.

image
If we choose the Manage option on this draft release, we'll see that there's a "Publish release" section. This has the same functionality as the option in the previous menu.

image
If you click any of these two options, a popup appears to confirm the action. And if you do...

image
The release is published! And will now appear in the release history, the project's homepage, and the simple index/api.

If we wanted to create a new draft release, we can do so by passing the --draft option to twine. ( implementation)
twine upload -r localpypi --draft ../simple-proyecto/dist/proyecto-0.0.3*

image
Navigating back to the release detail page, we can click the "view release" link to get to the preview page.

image
Looking at the preview, we can see some notable differences. The first one being the special pip command that will intall this draft version of the package by leveraging --extra-index-url and a special hash that will prevent regular others from installing this version without access to this information. The second difference is the "draft" badge, which you can click to return to the project home page.

image
image
The index url that we pass to pip is an actual simple-index compatible page that list only this project, and only these draft versions.

image
And now we can install said draft release! (this is an old screenshot where the draft release is version 0.0.2)

That's all folks! 🎶

@alanbato
Copy link
Contributor Author

Ah, the tests are failing because of the db migration, I'm not super familiar with alembic so I'd appreciate a pointer as to how to fix it :)

@ewjoachim
Copy link
Contributor

ewjoachim commented Dec 19, 2020

Congratulations for a non-trivial challenge of a PR :)

The draft hash value and how it's calculated. Some people think it shouldn't be guessable by regular users, while others want it to be determinable by maintainers in order to automate their workflows. How can we achieve both? Should it be calculated, or should it be stored?

Can we send it at upload time ? You upload a draft release, receive the draft hash in some way, and then you can automate things based on that ?

Copy link
Contributor

@ewjoachim ewjoachim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review round one: done!

I didn't read all the tests yet, because I imagine they will be changing quite a bit before they're ready. Also, I saw surprisingly little tests to make a 100% coverage, I'm curious if you managed to hit the 100% or if it's still missing some tests to write (CI will tell).

Congratulations on the PR. There are quite a few comments but it's really nice already 👏 . The 2 big things I'm looking forward to discuss are the DB columns and the --draft-- thing (cf my comments).

It's my first time doing such a big review on Warehouse, so I hope it was good (I'm completely open to feedback on the review if you think there are things I should improve :) )

warehouse/legacy/api/draft.py Show resolved Hide resolved
@@ -173,16 +174,21 @@ class Project(SitemapMixin, db.Model):
def __getitem__(self, version):
session = orm.object_session(self)
canonical_version = packaging.utils.canonicalize_version(version)
try:
canonical_version, something, draft_hash = canonical_version.split("--")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with the dummy variable be named something, but this might trigger flake8 saying it's unused. I believe flake8 will not complain if it's named _, which also is a classical way to indicate a dummy variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we supply 1.2.3--draft--somehash--yay, this code will ignore the hash, and treat the whole string as a version number, while if we limit split to 2 splits:

Suggested change
canonical_version, something, draft_hash = canonical_version.split("--")
canonical_version, something, draft_hash = canonical_version.split("--", 2)

then it will give us canonical_version == 1.2.3, draft_hash == somehash--yay which I believe is slightly more correct.

Note that in both case, we're likely to not find the designed release, but I'd imagine the second case to be less surprising.

And what about 1.2.3--foobar--expected_hash ? Is it expected that it would work ?

I'm thinking splitting by --draft-- would be more explicit and resolve a number of those questions at once.

... Thinking about it again, if I understand correctly, the only reason why we need to have the whole --draft-- thing is for the release url in the pypi UI, right ? In this case, why not have the url be /projects/foo/1.2.3/draft/hash/ ? I believe this could be implemented by implementing __getitem__ on release returning itself only if the hash is correct ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took your suggestion about splitting on --draft--, that way we standarize more on the format of the URLS.

Regarding the change on the URL scheme, if I understood pyramid traversal correctly, there'd be no way to stop the resolver from returning the page for /projects/foo/1.2.3 even when it's a draft release. We'd need to return the object in both cases because we'd need the Release in order continue traversing through /draft/hash.

The other way would be to use routes intead of traversal, but then we'd have separate logic for returning a release page (and managing pages) that we'd need to keep consistent.

Unless there's a strong benefit for changing it, I think we can keep it this way,

Comment on lines +133 to +141
<p>
{% trans project_name=project.name, version=release.version %}
After publishing this release, users will be able to use pip to install <code>{{ project_name }}=={{ version }}</code>.
{% endtrans %}
</p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth mentionning that this action cannot be undone ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be worth mentioning, but maybe also mention that once published, releases can be yanked? Which it's not really the same as unpublishing but has the side effects that one might expect for such action.

Copy link
Contributor

@uranusjr uranusjr May 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioning both the irreversable nature and the only recourse will be to yank sounds like a good idea to me.

warehouse/templates/packaging/detail.html Show resolved Hide resolved
warehouse/forklift/legacy.py Show resolved Hide resolved
warehouse/forklift/legacy.py Show resolved Hide resolved
warehouse/manage/views.py Show resolved Hide resolved
Copy link
Contributor Author

@alanbato alanbato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ewjoachim Thank you for taking the time to review this PR, I know it wasn't quick and easy!
I tried to answer each one of your questions as best as I could, I ceirtainly missed some things :)

There's still some decisions that are stil on the table, so I'd like for others to voice their opinions there.

EDIT: And you're right about the tests not being fully comprehensive, there's no 100% coverage yet :)

@@ -759,6 +759,34 @@ def _is_duplicate_file(db_session, filename, hashes):
return None


def _get_release_classifiers(db_session, classifiers_data):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that there's a lot of space for improvement 👍

I just moved the code out of the file_upload method to make its own function in order to reduce the complexity of file_upload, so I didn't want to change anything about the function in order to reduce the change surface of this PR.

Comment on lines +1174 to +1222
for rc in release_classifiers:
release._classifiers = release_classifiers
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not really sure what I was trying to do here either.
I'll get rid of the loop.

)
for rc in release_classifiers:
release._classifiers = release_classifiers
new_dependencies = list(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key detail here is that apparently there's a difference between the metadata that wheels and sdists provide.
Since they're sequential file uploads, and I don't think there's a guarantee on precedence, we need to consider that we only want to overwrite the metadata when there's information. In other words, we might get a wheel with ONLY dependency information, and then a sdist with ONLY the description. These updates need to be handled differently because they can be cummulative.
Right now what happens is whatever file gets updloaded first sets the metadata of the release, and all subsequent uploads do not modify any fields (so the metadata of the release might be imcomplete at the end).

cc/ @di in case I'm misremembering the conversation we had about this particular issue.

@@ -1274,8 +1316,15 @@ def file_upload(request):
"The digest supplied does not match a digest calculated "
"from the uploaded file.",
)
# Skip duplicate check for files when it's a draft release
if not release.is_draft:
# Skip duplicate check for files when it's a draft release,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, you're right, we'd want to skip the re-upload when the filename and hashes are the same :)

warehouse/forklift/legacy.py Show resolved Hide resolved
@@ -371,6 +383,7 @@ def __table_args__(cls): # noqa
created = Column(
DateTime(timezone=False), nullable=False, server_default=sql.func.now()
)
published = Column(DateTime(timezone=False), nullable=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Well, I was following the pattern of having created and updated timestamps in models, in this case for when it was published. Then it occurred to me that instead of setting a boolean value to know if a release was published or not, we could just look at the published column and check for null. So we get 2 use cases for the price of 1. Otherwise, we'd need another column (or table!) to keep track of what releases are drafts...

  2. This was another option that was considered, having the draft_hash be an attribute itself, or even a whole other model. However I think managing an attribute or another model unnecesary adds complexity to the architecture, and since I don't see most projects using this feature, I think using a column_property is better than having an almost-empty column. An important detail is that the idea is to only compute this value when we know that the release is a draft (by looking at the published column). However, I'm open to hear more in favor of having of this argument :)

  3. Yes, it's supposed to be part of the migration process, and it's automatically populated for all releases that are not explicitly a draft. So no worries there when deploying the new code.

  4. You're correct. However, from what I gathered from the documentation you can't use information from other tables when using column_property. If you know of a way to do this, I'd be happy to avoid the denormalization :)

warehouse/templates/manage/releases.html Show resolved Hide resolved
Comment on lines +133 to +141
<p>
{% trans project_name=project.name, version=release.version %}
After publishing this release, users will be able to use pip to install <code>{{ project_name }}=={{ version }}</code>.
{% endtrans %}
</p>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be worth mentioning, but maybe also mention that once published, releases can be yanked? Which it's not really the same as unpublishing but has the side effects that one might expect for such action.

warehouse/templates/packaging/detail.html Show resolved Hide resolved
Comment on lines +127 to +165
<a class="status-badge status-badge--warn" href="{{ request.route_path('packaging.project', name=release.project.name) }}">
<span>{% trans %}Draft version{% endtrans %}</span>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't release the version, it takes you to the default (latest) version of the project 😅 (similar to when you look at an old release)

But adding such a button would be nice 🤔 the only problem is that this page can be viewed by any user with the right link, and not only maintainers (that's why the publish button is under Manage project)

@brainwane
Copy link
Contributor

@alanbato if you rebase this so it doesn't conflict with main and then make a fresh request for feedback on discuss.python.org I bet you will get some!

@alanbato
Copy link
Contributor Author

Will do!

@brainwane
Copy link
Contributor

Alan, the only merge conflict right now is translations/messages - could you fix?

@brainwane
Copy link
Contributor

I requested further feedback.

@alanbato alanbato force-pushed the draft-releases branch 3 times, most recently from f827a40 to 380accc Compare April 13, 2021 23:13
@alanbato
Copy link
Contributor Author

Fixed the conflicts in translations (I think theres room for improvement in our process there)

Also took @ewjoachim's advice and provided a default value of NOW() for the published attribute. So in the stage where we have migrated the database but not the code yet, packages that are released during this period are not mistaken for unpublished packages after the new code is deployed.

Thanks @brainwane for requesting further feedback, I'd love for more people to take a look at this and see what else can be improved :)

@alanbato
Copy link
Contributor Author

Also, I think tests are failing because of the migration history is a bit wrong. I would appreciate assistance on that 😅

@di
Copy link
Member

di commented May 2, 2022

Sat down with @alanbato at the PyCon sprints and thought through a plan for what we need to do to get this mergeable

  • Programmatic way to turn a draft release into a published release: One thing this feature is missing right now is a way for a user to tell PyPI that a draft release should be published without needing to log into PyPI and do this via the UI itself. Users publishing multiple releases across multiple CI systems probably want some way to determine once all their draft releases have been made, and tell PyPI to publish. This doesn't need to be included in this PR but we should have a plan to work towards it before merging.
  • Ensure that twine is able to reconstitute and surface the URL where the draft release will live on PyPI, so users can use this for testing before publishing

@di
Copy link
Member

di commented May 2, 2022

One other thing: currently the make_draft_hash function is not taking into account the project's normalized name, because Project.normalized_name is a column property and not a proper column. We should eventually move towards making this use the normalized name for consistency.

@@ -344,9 +372,11 @@ def __table_args__(cls): # noqa
ForeignKey("projects.id", onupdate="CASCADE", ondelete="CASCADE"),
nullable=False,
)
project_name = Column(Text)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to think about ways to avoid having to store the project name on the Release model, as this somewhat complicates our models.

This is currently necessary because the project name is otherwise unavailable to the column property to compute the draft_hash.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like draft_hash should not be a column property, and it should just be a string that is a derived value. You could have this triggered in the DB or in python.

If we change the mechanism by which the draft hash is computed, we probably don't want to change old hashes, so that their URLs remain stable, so that leads me to think that it should be derived at the point of release creation I think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or stepping back, what is the value of hashing these values at all? If the inputs are well known it feels like the hash is just there to make the URL more obscure in general, but I don't think they provide much value on their own. Is there any reason the draft release url doesn't just have the project name and version in the url?

If I recall, the original proposal for draft releases used random URLs to prevent people from being able to "guess" the draft release url, and forcing it to be something that they were given by PyPI. This was an attempt to limit some abuses of draft releases as some sort of release channel or something. But if we're just using urls that anyone can compute knowing nothing but project name + version, I don't see a ton of value in jumping through hopes to embed that rather than just doing something like:

/draft/$project/$version/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meain reason that we're calculatin a hash over these values is to avoid accidental usage by regular users. By requiring draft releases be installed with the use of this obfuscated "tokens" which are available and computable by maintainers, it sends the message that you shouldn't try to create them as a user. Avoiding abuse of this feature (as a private release channel for example) could be further mitigated by other means, like deleating the draft reelease after a certain time (think, a day). I do concede that the current implemention has more hoops to jump that I would like.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't feel like there's going to be a huge risk of accidental usage if using it requires typing --index-url https://pypi.org/draft/foo/1.0/ and just happen to know that there exists a draft release of that url? Is there some other mechanism we expect users to enable using a draft release?

If we really do want to hash it like this, we could probably just do /draft/hash($project)/hash($version)/ yea? Could even do /draft/hash($project)hash($version)/ and just split based on length or use a - or something.

IOW, if we treat the project and version as distinct pieces of data to be hashed rather than trying to combine them and then hash them, we can sidestep all the need to denormalize since it wouldn't be any different than looking up by Project.name + Release.version.

I'd also suggest not using md5, not because this is security sensitive, but because some Linux distros patch hashlib.md5 to require passing a usedforsecurity=False attribute to make md5 work at all, and it's simpler to just side step that completely. It should be reasonable enough to use sha256, or if we're worried about the length of the URL and we can mandate 3.6+ (I don't recall what the tools involved support at the moment) you can use blake2 with a custom sized key.

Something like:

import hashlib

hashlib.blake2b(b"adf", digest_size=16).hexdigest()

Should produce digests of similar to MD5, you could probably go smaller too?

Could also use an alternative encoding if we're worried about the length of URLs:

import base64
import hashlib

hashlib.blake2b(b"adf", digest_size=16).hexdigest()  # 32 characters
base64.urlsafe_b64encode(hashlib.blake2b(b"adf", digest_size=16).digest()).strip(b"=") .decode("ascii") # 22 characters

Anyways, just some food for thought

@dstufft
Copy link
Member

dstufft commented Jun 28, 2022

There is now PEP 694: Upload 2.0 API for Python Package Repositories, which has discussions on discuss.python.org which is relevant to this issue.

@alanbato alanbato requested a review from a team as a code owner February 22, 2024 19:07
warsaw added a commit to warsaw/warehouse that referenced this pull request Jul 13, 2024
Resolves all merge conflicts that have crept in from
pypi#8941 since the last commit to that
branch.

This commit does *not* attempt to resolve any comments on the original PR, nor
have I even run the tests on this branch yet.  No promises that I haven't
botched the merge conflict resolutions either.
@warsaw
Copy link
Contributor

warsaw commented Jul 13, 2024

Over in #16277 I've resolved all the merge conflicts that have crept in as main evolved.

I have not attempted to resolve any comments on this original PR, nor have I even run the tests on this branch yet (let's see what the CI status is). No promises that I haven't botched the merge conflict resolutions either.

I'm planning on seeing if I can help move this initiative forward, so I do plan on doing those things. I also based my branch on this one, so @alanbato gets all due credit for the original work.

@alanbato
Copy link
Contributor Author

Oh, it's been quite a while! A thousand thank yous @warsaw, as I'm sure it was quite a lot of conlicts that came with time that you went through.

This PR was abandoned by me due to what I felt was not reaching a concensus on some of the pending decisions to about how this funcionality was going to be exposed to users, pending work on documenting (and then proposing a better version) of the PyPI API, and the discussion thread for this PR eventually went quiet too...

My free-time is not as abundant as it was 4 years ago, but I'd love to collaborate on this if we still think the community will benefit from such a feature :)

Let me know how I can help!

@warsaw
Copy link
Contributor

warsaw commented Jul 13, 2024

@alanbato I'm happy to help, and it's actually part of my job now to pitch in! I think now that we have PEP 694 to work from, maybe we can make good progress again. I do have quite a few questions about the current PEP draft, so it's probably worth resolving those in conjunction with working on a PR. I greatly appreciate the work you've done so far!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants