Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

we need to disambiguate UUIDs in the API #165

Closed
yarikoptic opened this issue Mar 19, 2021 · 12 comments · Fixed by #169
Closed

we need to disambiguate UUIDs in the API #165

yarikoptic opened this issue Mar 19, 2021 · 12 comments · Fixed by #169
Assignees

Comments

@yarikoptic
Copy link
Member

yarikoptic commented Mar 19, 2021

"Moving" discussion from dandi/dandi-cli#478 (comment)

In #150 we used uploads:upload_id, blobs:key and assets:uuid to spec out new upload API and things were clear. But after #156 (and possibly subsequent changes) all of those 3 became a UUID, and now manifest themselves as a "uuid" throughout API interface on swagger. It becomes a huge source of confusion (at least) to me.

E.g. now looking at PUT ​/dandisets​/{versions__dandiset__pk}​/versions​/{versions__version}​/assets​/{uuid}​/ in swagger which is documented to be for "Update the metadata of an asset.", I expect no need for any UUID since we are updating an existing asset with a known UUID (from endpoint), but parameters include uuid* | string($uuid) title: Uuid and it is not clear not only why we need a UUID there and what UUID is expected:
of a blob? but I am just updating metadata... and POST should be used to just create a new asset since we are no longer "tracking" some notion of an "identity" of an asset after #143 edit: ok -- it might be blobs:uuid if we want to replace an asset (in a path) with a new asset, but blobs:uuid should be optional in such case -- from looks of it seems to be required (separate issue?); alternative - just remove uuid from that PUT and then POST just be used for that and API would handle removing prior asset with that path.

Moreover: I do not see any reason to have upload_id to be a uuid -- unlike for blobs:uuid AKA blobs:key (I would have even kept it called key for disambiguation) which should be something of consistent length and randomly evenly spread out for blobs keystore, and assets should be a globally unique identifier, upload_id as a simple auto-increment index would be just as fine IMHO.

So I would suggest following changes

  • uploads table: uuid -> upload_id (and do not bother making it uuid at all, unless there is any benefit)
  • blobs table: uuid -> key (which ATM indeed should be a uuidv4 in its value)
    and leave uuid to be used only within assets table.
@dchiquito
Copy link
Contributor

To make things worse, there is actually a fourth UUID: S3 generates a UUID called upload_id whenever you start a multipart upload. Clients should never need to use it directly so I tucked it away, but you can still see it in the response from /uploads/initialize/:

{
  uuid: uuid,
  multipart_upload: {
    object_key: string,
    upload_id: uuid,
    parts: [{
      part_number: number,
      size: number,
      upload_url: url,
    }]
  }
}

I decided to label everything uuid because they are actually all the same value at different points in the upload lifecycle.

  • During /upload/initialize/, an Upload is created with a random UUID (uploads:upload_id).
  • That same UUID is also used to generate the object store key (blobs:key). So if the Upload key is 123456789, the blob will be stored at /blobs/123/456/123456789.
  • Finally, when /uploads/{uuid}/validate/ is called with a never-before-uploaded blob, it creates a new AssetBlob using that same UUID.
    Since they are all the same value, I would prefer to keep them all named the same thing.

That's a good point about the PUT ​/dandisets​/{versions__dandiset__pk}​/versions​/{versions__version}​/assets​/{uuid}​/ having two different UUIDs both labeled uuid. You are correct, the URL is the UUID of an asset, and the body is the UUID of a blob. (I think it's actually correct to require metadata and uuid, as PUT should require you to completely redefine the object. PATCH should allow you to only change the metadata.)

@yarikoptic How about renaming the upload flow UUID to blob_uuid and renaming the asset UUID to asset_uuid?

@yarikoptic
Copy link
Member Author

Thank you @dchiquito for the explanation!

Although I do appreciate (and even somewhat like a "smart reuse") the "multiple lives of a uuid" (as both uploads:upload_id and blobs:key) IMHO not worth it: my main concern is then not so obvious and thus easy to fall for /uploads/{uuid}/validate which might return another blobs:key (if already exists), and if client just disregards that, they could proceed with the uuid they have. Decoupling of uploads:upload_id and blobs:key would prevent people trying to reuse the same uuid of uploads:upload_id as the blobs:key, thus making use of API more robust (at a minimal cost of minting one more uuid) since it would not take them to the point where they re-upload an existing blob to discover that they should have cared about returned value.

To make things worse, there is actually a fourth UUID: S3 generates a UUID called upload_id whenever you start a multipart upload. Clients should never need to use it directly so I tucked it away, but you can still see it in the response from /uploads/initialize/:

{
  uuid: uuid,

If not to be anyhow used by the client it IMHO either should not be returned to the client client (IMHO preferable), or made explicitly into smth like s3_upload_uuid, or even reusing that UUID for uploads:upload_id (so we could document that "uploads_id uuid for S3 uploads is the same as provided by S3 for the upload session")

@yarikoptic
Copy link
Member Author

@yarikoptic How about renaming the upload flow UUID to blob_uuid and renaming the asset UUID to asset_uuid?

Re asset_uuid: I do not mind making it more explicit at the API (/DB) level into asset_uuid, might make parameters more self-descriptive, but unlike blob_uuid, a uuid of an asset is also present within asset's metadata schema (as uuid) and something (don't see yet) bothers me that such rename might introduce some oddity... but may be it would be fine. WDYT @satra?

Re upload flow UUID: per my comment above, I would prefer to have dedicated upload_uuid and blob_uuid

@satra
Copy link
Member

satra commented Mar 22, 2021

regarding asset uuid, i thought we changed it so that the uuid is added on the server side when returning metadata, but the metadata is not stored with that info. i suspect posting assets with uuids has not yet been taken care of. we should perhaps discuss that separately from the current migration plans.

@yarikoptic
Copy link
Member Author

Yes, server wouldn't store it on metadata record in DB but IIRC would add it to the record when metadata is requested, and also we have it as uuid in AssetMeta

@waxlamp
Copy link
Member

waxlamp commented Mar 22, 2021

I like the idea of using <resource type>_id for all such UUIDs. That emphasizes the opaque nature of these UUIDs, and that they are "typed" to the REST API section in which they appear. So I'd suggest using the names dandi_upload_id*, blod_id and asset_id in order to disambiguate. In particular, the endpoint that validates an upload would be /uploads/{dandi_upload_id}/validate, and its response would include a field named blob_id, referring to the newly created and validated blob.

Yarik, I get where you're coming from w.r.t "id" vs. "key", but I think it's best to call everything an "id" and treat them opaquely. Then there's a kind of uniformity in engaging the API, and the names of the parameters can guide as in how to make use of them.

*Ideally this would be upload_id, but that term already has meaning within S3 (as Daniel pointed out), so for further explicitness, we'd use dandi_upload_id.

@yarikoptic
Copy link
Member Author

*Ideally this would be upload_id, but that term already has meaning within S3 (as Daniel pointed out), so for further explicitness, we'd use dandi_upload_id.

I would strongly disagree on adding dandi_ prefix anywhere in DANDI API: Everything in API is dandi "by default": upload_id, blob_id, whatever... Moreover, I would prefer to not even expose any backend specific IDs at all, unless of some immediate need or notable benefit. And then if needed to be included, they might better be either placed into a dedicated key within record, e.g. "s3_upload": {"uuid": "..."} or at least prefixed as such (s3_upload_id). (I do understand that with such argument then current upload_urls should already become some dedicated s3_upload_urls but not really since those aren't necessarily s3 specific and some other platform could mint similar upload urls)

Also note

  • that we might not be even talking about Amazon S3 specifically but may be about minio... again not sure what exposing of underlying storage solution UUID gives us
  • there eventually be other types of "upload" mechanisms (see our elderly original API google doc: from ipfs, from torrent, from a url etc), and they might also need to provide their own record keys to instruct client.

@yarikoptic
Copy link
Member Author

yarikoptic commented Mar 22, 2021

note (primarily to myself ;)): in DandiMeta schema we do not have .uuid, we have .identifier (which is UUID4 type) so I guess my fear of making uuid into asset_uuid has no ground whatsoever anyways, so

I like the idea of using <resource type>_id for all such UUIDs

so under the premise of id being a shortcut for identifier at the level of API/DB, and <resource type>_ prefix added to variables in API to disambiguate between different (resource) types, I think it would be good indeed to have asset_id (and upload_id, and blob_id) ;)

/uploads/{dandi_upload_id}/validate, and its response would include a field named blob_id, referring to the newly created and validated blob.

+1 (edit -- -1 on that dandi_ prefix ;-))

Yarik, I get where you're coming from w.r.t "id" vs. "key", but I think it's best to call everything an "id" and treat them opaquely.

+1 on that, i.e. making blobs:key into blobs:blob_id. Again would have an easier translation to "identifier of a blob".

@yarikoptic
Copy link
Member Author

@dchiquito so what is your take on making upload_id blob_id and asset_id and hiding away (not returning) any amazon specific ID (or just reusing it for upload_id)?

@dchiquito
Copy link
Contributor

I think that will work. I will work on that this morning.

@yarikoptic
Copy link
Member Author

Great, hopefully it would also resolve dandi/dandi-cli#479 (comment)

@dchiquito
Copy link
Contributor

attn @jwodder the response from /uploads/initialize/ is now:

{
  upload_id: uuid,
  parts: [...]
}

and the response from /uploads/validate/ is now:

{
  blob_id: uuid,
  ...
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants