-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script for generating S3 versioned file stats #473
Conversation
Codecov Report
@@ Coverage Diff @@
## master #473 +/- ##
==========================================
+ Coverage 82.27% 83.07% +0.79%
==========================================
Files 55 55
Lines 5750 6043 +293
==========================================
+ Hits 4731 5020 +289
- Misses 1019 1023 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Please
|
FTR, here is what I have got using 0.11.0-14-g3d246b8 version for s3://dandiarchive: $> grep Size /tmp/dandiarchive-*
/tmp/dandiarchive-all.stat:All files: 197558 / Size: 26795152820245
/tmp/dandiarchive-invisible.stat:Invisible files: 14241 / Size: 8104409130145
/tmp/dandiarchive-old.stat:Old files: 538 / Size: 68846538
/tmp/dandiarchive-visible.stat:Visible files: 182779 / Size: 18690674843562 yet to analyze if kosher (invisible of 8TB sounds excessive but may be it is life) |
for invisible there is a good number of such as
which are a bit odd to me -- seems to be "directories" due to having trailing $> datalad download-url s3://dandiarchive/dandiarchive/dandiarchive/data/?versionId=QHPjluoj8pQm0y7Lau7ldEiQ7rEiQXy5
[INFO ] Downloading 's3://dandiarchive/dandiarchive/dandiarchive/data/?versionId=QHPjluoj8pQm0y7Lau7ldEiQ7rEiQXy5' into '/tmp/'
[INFO ] S3 session: Connecting to the bucket dandiarchive anonymously
download_url(error): /tmp/ (file) [S3 refused to provide the key for dandiarchive/dandiarchive/data/ from url s3://dandiarchive/dandiarchive/dandiarchive/data/?versionId=QHPjluoj8pQm0y7Lau7ldEiQ7rEiQXy5: S3ResponseError: 403 Forbidden
[s3.py:get_downloader_session:352]] anyways -- I see that there was a fix added 3d246b8 so will rerun now to get a cleaner picture |
we should also skip that directory: |
@jwodder could you please add |
@yarikoptic Is this for matching against just URLs of the form |
current use case is just the key (prefix) itself, not vesionIds which would be "fun" to match pragmatically via regexes anyways. |
@yarikoptic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @jwodder ! Works like charm. Left some notes from review for posterity
Current stats:
$> grep -h Size /tmp/dandiarchive-3-*
All files: 195975 / Size: 26793923588824 (26.8 TB)
Invisible files: 13705 / Size: 8104409130145 (8.1 TB)
Old files: 2 / Size: 68846538 (68.8 MB)
Visible files: 182270 / Size: 18689514458679 (18.7 TB)
from the run of
for k in all visible invisible old ; do tools/s3-gc-stats --stat $k --exclude s3://dandiarchive/dandiarchive/dandiarchive/data/.* --list s3://dandiarchive >| /tmp/dandiarchive-3-$k.stat; done
# - list_object_versions sorts all versions (including delete markers) by key | ||
# (in ascending order) and last modified date (in descending order, sometimes | ||
# with consecutive equal timestamps) and returns them in chunks of 1000 (by | ||
# default), with each chunk divided into proper versions and delete markers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: these assumptions seems to be validated in the code below: https://github.com/dandi/dandi-cli/pull/473/files#diff-c840778a73d440599aee9be998e0b555e84831c627e589f54439e8d3f2819295R103 . Great!
for dm in page.get("DeleteMarkers", []): | ||
if dm["IsLatest"]: | ||
self.mark_deleted(dm["Key"]) | ||
for v in page["Versions"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: it seems to correctly work out the keys which were added before versioning was turned on -- look at 1version-nonversioned1.txt and 2versions-nonversioned1.txt . GREAT!
(git)lena:~/proj/dandi/dandi-cli-master[gh-470]git
$> tools/s3-gc-stats --stat visible --list s3://datalad-test0-versioned
s3://datalad-test0-versioned/1version-nonversioned1.txt?versionId=null 8
s3://datalad-test0-versioned/1version-removed-recreated.txt?versionId=vurmCZo5d3.iAzTV.YAfTcMr.LDVjQIK 8
s3://datalad-test0-versioned/2versions-nonversioned1.txt?versionId=V4Dqhu0QTEtxmvoNkCHGrjVZVomR1Ryo 8
s3://datalad-test0-versioned/2versions-nonversioned1.txt_sameprefix?versionId=QUOLbSrBHBmU3UNeEvdfcX2puqRffH9l 8
s3://datalad-test0-versioned/2versions-removed-recreated.txt?versionId=bBVSCB4MdBOeEXDQ2KwrjtrevpwFabaY 8
s3://datalad-test0-versioned/2versions-removed-recreated.txt_sameprefix?versionId=cfTdf3N8exZLFg.KcW5szQKrFNLUyCu1 8
s3://datalad-test0-versioned/3versions-allversioned.txt?versionId=Kvuind11HZh._dCPaDAb0OY9dRrQoTMn 8
s3://datalad-test0-versioned/3versions-allversioned.txt_sameprefix?versionId=Mvsc4FgJWc6gExwSw1d6wsLrnk6wdDVa 8
Visible files: 8 / Size: 64 (64 Bytes)
1 66166.....................................:Wed 17 Mar 2021 10:34:34 AM EDT:.
(git)lena:~/proj/dandi/dandi-cli-master[gh-470]git
$> tools/s3-gc-stats --stat invisible --list s3://datalad-test0-versioned
s3://datalad-test0-versioned/1version-removed-recreated.txt?versionId=B7cP2PKGRNeRZ7EktWN82UnUaGjb9dh_ 8
s3://datalad-test0-versioned/1version-removed.txt?versionId=eZ5Hgwo8azfBv3QT7aW9dmm2sbLUY.QP 8
s3://datalad-test0-versioned/2versions-nonversioned1.txt?versionId=null 8
s3://datalad-test0-versioned/2versions-removed-recreated.txt?versionId=zwW0b567gYO3puJeLMZsOmETqowJnv6l 8
s3://datalad-test0-versioned/3versions-allversioned.txt?versionId=b.qCuh7Sg58VIYj8TVHzbRS97EvejzEl 8
s3://datalad-test0-versioned/3versions-allversioned.txt?versionId=pNsV5jJrnGATkmNrP8.i_xNH6CY4Mo5s 8
Invisible files: 6 / Size: 48 (48 Bytes)
Closes #470.