Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script for generating S3 versioned file stats #473

Merged
merged 5 commits into from
Mar 17, 2021
Merged

Script for generating S3 versioned file stats #473

merged 5 commits into from
Mar 17, 2021

Conversation

jwodder
Copy link
Member

@jwodder jwodder commented Mar 15, 2021

Closes #470.

@jwodder jwodder added the internal Changes only affect the internal API label Mar 15, 2021
@codecov
Copy link

codecov bot commented Mar 15, 2021

Codecov Report

Merging #473 (e44fa8c) into master (d7d8b2d) will increase coverage by 0.79%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #473      +/-   ##
==========================================
+ Coverage   82.27%   83.07%   +0.79%     
==========================================
  Files          55       55              
  Lines        5750     6043     +293     
==========================================
+ Hits         4731     5020     +289     
- Misses       1019     1023       +4     
Flag Coverage Δ
unittests 83.07% <ø> (+0.79%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
dandi/cli/command.py 42.52% <0.00%> (-2.68%) ⬇️
dandi/consts.py 100.00% <0.00%> (ø)
dandi/tests/test_metadata.py 100.00% <0.00%> (ø)
dandi/dandiapi.py 87.72% <0.00%> (+0.33%) ⬆️
dandi/__init__.py 73.91% <0.00%> (+1.18%) ⬆️
dandi/models.py 90.64% <0.00%> (+4.45%) ⬆️
dandi/metadata.py 82.82% <0.00%> (+5.37%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d7d8b2d...e44fa8c. Read the comment docs.

@jwodder jwodder marked this pull request as ready for review March 15, 2021 19:14
@yarikoptic
Copy link
Member

Please humanize.naturalsize sizes, e.g.

Invisible files: 14241 / Size: 8104409130145 (8.1 TB)

@yarikoptic
Copy link
Member

FTR, here is what I have got using 0.11.0-14-g3d246b8 version for s3://dandiarchive:

$> grep Size /tmp/dandiarchive-*
/tmp/dandiarchive-all.stat:All files: 197558 / Size: 26795152820245
/tmp/dandiarchive-invisible.stat:Invisible files: 14241 / Size: 8104409130145
/tmp/dandiarchive-old.stat:Old files: 538 / Size: 68846538
/tmp/dandiarchive-visible.stat:Visible files: 182779 / Size: 18690674843562

yet to analyze if kosher (invisible of 8TB sounds excessive but may be it is life)

@yarikoptic
Copy link
Member

for invisible there is a good number of such as

s3://dandiarchive/dandiarchive/dandiarchive/data/?versionId=xRkqMiJ_F_OwegaT_m8dtI1MQW6B7SF3 0
s3://dandiarchive/dandiarchive/dandiarchive/data/?versionId=TN8czMbM5omaaeuwEtSaarGsH6lOz.io 0

which are a bit odd to me -- seems to be "directories" due to having trailing / in them, did not know that key could exist like that... datalad fails to "download" them with 403:

$> datalad download-url s3://dandiarchive/dandiarchive/dandiarchive/data/?versionId=QHPjluoj8pQm0y7Lau7ldEiQ7rEiQXy5
[INFO   ] Downloading 's3://dandiarchive/dandiarchive/dandiarchive/data/?versionId=QHPjluoj8pQm0y7Lau7ldEiQ7rEiQXy5' into '/tmp/' 
[INFO   ] S3 session: Connecting to the bucket dandiarchive anonymously 
download_url(error): /tmp/ (file) [S3 refused to provide the key for dandiarchive/dandiarchive/data/ from url s3://dandiarchive/dandiarchive/dandiarchive/data/?versionId=QHPjluoj8pQm0y7Lau7ldEiQ7rEiQXy5: S3ResponseError: 403 Forbidden
 [s3.py:get_downloader_session:352]]

anyways -- I see that there was a fix added 3d246b8 so will rerun now to get a cleaner picture

@satra
Copy link
Member

satra commented Mar 16, 2021

we should also skip that directory: s3://dandiarchive/dandiarchive that's added as a private component related to AWS inventory function and is non public i believe.

@yarikoptic
Copy link
Member

@jwodder could you please add --exclude URL_REGEX so we could exclude some URLs?

@jwodder
Copy link
Member Author

jwodder commented Mar 16, 2021

@yarikoptic Is this for matching against just URLs of the form s3://BUCKET/KEY, or should versionId's be included in the URLs being matched against?

@yarikoptic
Copy link
Member

current use case is just the key (prefix) itself, not vesionIds which would be "fun" to match pragmatically via regexes anyways.

@jwodder
Copy link
Member Author

jwodder commented Mar 16, 2021

@yarikoptic --exclude option added.

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @jwodder ! Works like charm. Left some notes from review for posterity

Current stats:

$> grep -h Size /tmp/dandiarchive-3-*                                                                                                                                                             
All files: 195975 / Size: 26793923588824 (26.8 TB)
Invisible files: 13705 / Size: 8104409130145 (8.1 TB)
Old files: 2 / Size: 68846538 (68.8 MB)
Visible files: 182270 / Size: 18689514458679 (18.7 TB)

from the run of

for k in all visible invisible old ; do tools/s3-gc-stats --stat $k --exclude s3://dandiarchive/dandiarchive/dandiarchive/data/.* --list s3://dandiarchive >| /tmp/dandiarchive-3-$k.stat; done

# - list_object_versions sorts all versions (including delete markers) by key
# (in ascending order) and last modified date (in descending order, sometimes
# with consecutive equal timestamps) and returns them in chunks of 1000 (by
# default), with each chunk divided into proper versions and delete markers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for dm in page.get("DeleteMarkers", []):
if dm["IsLatest"]:
self.mark_deleted(dm["Key"])
for v in page["Versions"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: it seems to correctly work out the keys which were added before versioning was turned on -- look at 1version-nonversioned1.txt and 2versions-nonversioned1.txt . GREAT!
(git)lena:~/proj/dandi/dandi-cli-master[gh-470]git
$> tools/s3-gc-stats --stat visible --list s3://datalad-test0-versioned
s3://datalad-test0-versioned/1version-nonversioned1.txt?versionId=null 8
s3://datalad-test0-versioned/1version-removed-recreated.txt?versionId=vurmCZo5d3.iAzTV.YAfTcMr.LDVjQIK 8
s3://datalad-test0-versioned/2versions-nonversioned1.txt?versionId=V4Dqhu0QTEtxmvoNkCHGrjVZVomR1Ryo 8
s3://datalad-test0-versioned/2versions-nonversioned1.txt_sameprefix?versionId=QUOLbSrBHBmU3UNeEvdfcX2puqRffH9l 8
s3://datalad-test0-versioned/2versions-removed-recreated.txt?versionId=bBVSCB4MdBOeEXDQ2KwrjtrevpwFabaY 8
s3://datalad-test0-versioned/2versions-removed-recreated.txt_sameprefix?versionId=cfTdf3N8exZLFg.KcW5szQKrFNLUyCu1 8
s3://datalad-test0-versioned/3versions-allversioned.txt?versionId=Kvuind11HZh._dCPaDAb0OY9dRrQoTMn 8
s3://datalad-test0-versioned/3versions-allversioned.txt_sameprefix?versionId=Mvsc4FgJWc6gExwSw1d6wsLrnk6wdDVa 8
Visible files: 8 / Size: 64 (64 Bytes)
1 66166.....................................:Wed 17 Mar 2021 10:34:34 AM EDT:.
(git)lena:~/proj/dandi/dandi-cli-master[gh-470]git
$> tools/s3-gc-stats --stat invisible --list s3://datalad-test0-versioned
s3://datalad-test0-versioned/1version-removed-recreated.txt?versionId=B7cP2PKGRNeRZ7EktWN82UnUaGjb9dh_ 8
s3://datalad-test0-versioned/1version-removed.txt?versionId=eZ5Hgwo8azfBv3QT7aW9dmm2sbLUY.QP 8
s3://datalad-test0-versioned/2versions-nonversioned1.txt?versionId=null 8
s3://datalad-test0-versioned/2versions-removed-recreated.txt?versionId=zwW0b567gYO3puJeLMZsOmETqowJnv6l 8
s3://datalad-test0-versioned/3versions-allversioned.txt?versionId=b.qCuh7Sg58VIYj8TVHzbRS97EvejzEl 8
s3://datalad-test0-versioned/3versions-allversioned.txt?versionId=pNsV5jJrnGATkmNrP8.i_xNH6CY4Mo5s 8
Invisible files: 6 / Size: 48 (48 Bytes)

@yarikoptic yarikoptic merged commit b813347 into master Mar 17, 2021
@yarikoptic yarikoptic deleted the gh-470 branch March 17, 2021 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal Changes only affect the internal API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tools/s3-gc-stats helper script
3 participants