Question rather than issue #17

80kk · 2018-10-26T11:00:13Z

What will happen if I run your script with option to split to 4GB chunks against a folder containing large files >4GB and files <4GB? Will it fail on 'small' files and continue splitting large ones? Let's say that I have following folder structure:

/data/files/x/
/data/files/y/
/data/files/

all of them contains files from 300MB to 100GB big. Ideally I'd like to run script in tmux/screen session and check once a day.

numblr · 2018-10-26T11:24:47Z

Hi,

It should just work fine also if the (or some of) the files are smaller than the split size. The chunk size only gives a maximum size for each chunk in the multipart upload, but if there is just one chunk and that is less then the max should not break it. Btw, I changed the code in the meantime to support multiple files, so you can also use wildcards now to invoke the script. The wildcard expressions are expanded already by the shell, so you can test what they cover by just trying them with ls. In your case ./glacierupload -v myvault /data/files/**/* /data/files/* should work, but you can test the wildcards with ls /data/files/**/* /data/files/*.

Best,
Thomas

80kk · 2018-10-26T12:15:33Z

Why the maximum value of the split-size is 2^22 which means 4TB, whilst the maximum chunk accepted by glacier is 4GB (2^12)?

numblr · 2018-10-26T12:29:51Z

Sorry, that is a typo in the documentation, it should be indeed 12..

numblr · 2018-10-26T12:53:33Z

Btw, the advantage of a split size smaller than the file size is that the multiple parts are uploaded in parallell. If you specify multiple files in the glacierupload command those are, however, still processed in sequence (this might be a point for improvement). That means if you want to speed up the upload, you might consider not to upload all files in one command invocation, but to start several upload commands in parallel, each on a subset of your files.

80kk · 2018-10-26T13:03:29Z

What if I have 8TB of data to upload and total available space on local hard drive for cache/split 250GB? Will the script cleanup after each upload completed?

80kk · 2018-10-26T13:08:52Z

It will be also great if script could write to the log file. It can be txt not json necessarily.

numblr · 2018-10-26T13:14:21Z

That should be fine, it caches only the parts that are currently uploaded on disk, i.e. at most 4GB*(#of parallel uploads) and it will clean up after completion of each part (in case of error there might be some data left in tmp folders, but they should be cleaned on a restart of the operating system). The number of parallel uploads for a single invocation of glacierupload is determined by the parallel command and is I think the number of CPUs available.
To get a log file you can just redirect the output to a file ./glacierupload -v myvault * > result.txt 2> upload.log or ./glacierupload -v myvault * >upload.log 2>&1 (all in one file). Can't tell from the back of my mind though what output goes to stderr and what to stdout. The final result is already stored as json file in the folder from which you start the upload (see the docs).

80kk · 2018-10-26T13:48:24Z

Thanks. I am testing it now.

numblr · 2018-10-30T22:03:41Z

Only from curiosity, did it work properly or did you run into any problems? If it worked fine I'll tag the current state as a release :)

80kk · 2018-10-31T16:01:03Z

It is running currently (500GB out of 8TB so far). It is slow because source data is mounted using s3fs onto ec2 instance, and then files are being chunked to 2GB chunks.

numblr · 2018-10-31T17:23:21Z

Then fingers crossed ;) Btw, if you already have the data in S3(?) there might be easier options to get it into glacier. I'm not really an expert on this, but found for example this question.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question rather than issue #17

Question rather than issue #17

80kk commented Oct 26, 2018

numblr commented Oct 26, 2018 •

edited

Loading

80kk commented Oct 26, 2018

numblr commented Oct 26, 2018

numblr commented Oct 26, 2018

80kk commented Oct 26, 2018

80kk commented Oct 26, 2018

numblr commented Oct 26, 2018 •

edited

Loading

80kk commented Oct 26, 2018

numblr commented Oct 30, 2018 •

edited

Loading

80kk commented Oct 31, 2018

numblr commented Oct 31, 2018

Question rather than issue #17

Question rather than issue #17

Comments

80kk commented Oct 26, 2018

numblr commented Oct 26, 2018 • edited Loading

80kk commented Oct 26, 2018

numblr commented Oct 26, 2018

numblr commented Oct 26, 2018

80kk commented Oct 26, 2018

80kk commented Oct 26, 2018

numblr commented Oct 26, 2018 • edited Loading

80kk commented Oct 26, 2018

numblr commented Oct 30, 2018 • edited Loading

80kk commented Oct 31, 2018

numblr commented Oct 31, 2018

numblr commented Oct 26, 2018 •

edited

Loading

numblr commented Oct 26, 2018 •

edited

Loading

numblr commented Oct 30, 2018 •

edited

Loading