Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage collection #177

Open
yarikoptic opened this issue Mar 24, 2021 · 8 comments · Fixed by #560
Open

Garbage collection #177

yarikoptic opened this issue Mar 24, 2021 · 8 comments · Fixed by #560
Assignees
Labels
enhancement New feature or request

Comments

@yarikoptic
Copy link
Member

yarikoptic commented Mar 24, 2021

I am adding to migration target since we better iron it out before switching to dandi-api: bugs in GC can lead to data loss thus IMHO we first need make sure (as with extensive unit-testing and user-testing) that it works reliably before deploying it.

Initial design sketch on garbage collection is present within https://github.com/dandi/dandi-api/pull/150/files#diff-c96d4444d1714a52d5d08dd92d94919393a7db8ded038aa84f02ba1075d2c25eR37 but I think it is worth removing it from that PR and starting a new dedicated one.

I see following targets for GC

  • uploads: not sure how if we could query S3 for some timestamp of the last activity for the upload urls
  • blobs: any blob not linked to an asset is subject to GC
    • GCing of a blob (blob_id) should also GC the actual key file in the store. We have 1-to-1 relationship ATM
  • assets: any asset not linked to any (released or draft) version of a dandiset
  • (edited) zarrs : like a blob, a zarr file which is not present in a dandiset it is linked to

Additional aspects:

  • Might need DB-level locking unless we touch any blob or asset upon "being queried", since otherwise we might GC a blob in the middle of an asset being "minted" for an existing blob; very less likely but I guess could happen for assets in GC of an asset is triggered while we are "modifying" it and creating a new asset
  • For any of those we might wan to allow for some "expiry" duration, so even if not "linked", but newer than e.g. a week (or 24h) -- keep around; If we add touch timestamping, this would allow to avoid locking (but would be costly for DB)
  • Bucket is versioned (yeap -- on my insistence), so there might be need for a 2nd stage GCing picking up elderly/stale keys in S3, unless removing of a key would explicitly take care about removing it (not just adding DeleteMarker)
@waxlamp
Copy link
Member

waxlamp commented Mar 24, 2021

Can you explain why this is a migration blocker?

@satra
Copy link
Member

satra commented Mar 24, 2021

just a check here, i thought there is no more uploads - it goes straight into blobs.

@yarikoptic
Copy link
Member Author

Can you explain why this is a migration blocker?

because we do need GC sooner than later, in particular since large datasets are to come soon. It could be "later", but placing it into production where we cannot afford loosing data would be trickier IMHO than implementing it before we "migrate" and while testing the platform, and not carrying much if we loose any data since we would probably still rebootstrap a few times.

@yarikoptic
Copy link
Member Author

just a check here, i thought there is no more uploads - it goes straight into blobs.

yes. But what happens if an upload is never completed or validated? aren't we ending up with 1. stale uploads: records in DB; 2. incomplete keys in the keystore?

@waxlamp
Copy link
Member

waxlamp commented Dec 30, 2021

Design doc is at #560.

@waxlamp waxlamp added the enhancement New feature or request label Dec 30, 2021
waxlamp pushed a commit that referenced this issue Jan 7, 2022
@dchiquito dchiquito reopened this Apr 28, 2022
@dandibot
Copy link
Member

dandibot commented Apr 29, 2022

🚀 Issue was released in v0.2.18 🚀

edit by @yarikoptic: PR with design doc incorrectly marked this issue being fixed, it was not

@dandibot dandibot added the released This issue/pull request has been released. label Apr 29, 2022
@jjnesbitt jjnesbitt removed the released This issue/pull request has been released. label Apr 30, 2022
@yarikoptic
Copy link
Member Author

Now we do have a well aged (10 month) design doc in https://github.com/dandi/dandi-archive/blob/master/doc/design/garbage-collection-1.md . It would be great to re-assess it and implement. In particular in the light of #1450 which might soon produce thousands of loose assets which would get replaced with ones with freshier metadata records.

@waxlamp
Copy link
Member

waxlamp commented Mar 1, 2023

This probably depends on #524.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants