Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question rather than issue #17

Open
80kk opened this issue Oct 26, 2018 · 11 comments
Open

Question rather than issue #17

80kk opened this issue Oct 26, 2018 · 11 comments

Comments

@80kk
Copy link

80kk commented Oct 26, 2018

What will happen if I run your script with option to split to 4GB chunks against a folder containing large files >4GB and files <4GB? Will it fail on 'small' files and continue splitting large ones? Let's say that I have following folder structure:

/data/files/x/
/data/files/y/
/data/files/

all of them contains files from 300MB to 100GB big. Ideally I'd like to run script in tmux/screen session and check once a day.

@numblr
Copy link
Owner

numblr commented Oct 26, 2018

Hi,

It should just work fine also if the (or some of) the files are smaller than the split size. The chunk size only gives a maximum size for each chunk in the multipart upload, but if there is just one chunk and that is less then the max should not break it. Btw, I changed the code in the meantime to support multiple files, so you can also use wildcards now to invoke the script. The wildcard expressions are expanded already by the shell, so you can test what they cover by just trying them with ls. In your case ./glacierupload -v myvault /data/files/**/* /data/files/* should work, but you can test the wildcards with ls /data/files/**/* /data/files/*.

Best,
Thomas

@80kk
Copy link
Author

80kk commented Oct 26, 2018

Why the maximum value of the split-size is 2^22 which means 4TB, whilst the maximum chunk accepted by glacier is 4GB (2^12)?

@numblr
Copy link
Owner

numblr commented Oct 26, 2018

Sorry, that is a typo in the documentation, it should be indeed 12..

@numblr
Copy link
Owner

numblr commented Oct 26, 2018

Btw, the advantage of a split size smaller than the file size is that the multiple parts are uploaded in parallell. If you specify multiple files in the glacierupload command those are, however, still processed in sequence (this might be a point for improvement). That means if you want to speed up the upload, you might consider not to upload all files in one command invocation, but to start several upload commands in parallel, each on a subset of your files.

@80kk
Copy link
Author

80kk commented Oct 26, 2018

What if I have 8TB of data to upload and total available space on local hard drive for cache/split 250GB? Will the script cleanup after each upload completed?

@80kk
Copy link
Author

80kk commented Oct 26, 2018

It will be also great if script could write to the log file. It can be txt not json necessarily.

@numblr
Copy link
Owner

numblr commented Oct 26, 2018

That should be fine, it caches only the parts that are currently uploaded on disk, i.e. at most 4GB*(#of parallel uploads) and it will clean up after completion of each part (in case of error there might be some data left in tmp folders, but they should be cleaned on a restart of the operating system). The number of parallel uploads for a single invocation of glacierupload is determined by the parallel command and is I think the number of CPUs available.
To get a log file you can just redirect the output to a file ./glacierupload -v myvault * > result.txt 2> upload.log or ./glacierupload -v myvault * >upload.log 2>&1 (all in one file). Can't tell from the back of my mind though what output goes to stderr and what to stdout. The final result is already stored as json file in the folder from which you start the upload (see the docs).

@80kk
Copy link
Author

80kk commented Oct 26, 2018

Thanks. I am testing it now.

@numblr
Copy link
Owner

numblr commented Oct 30, 2018

Only from curiosity, did it work properly or did you run into any problems? If it worked fine I'll tag the current state as a release :)

@80kk
Copy link
Author

80kk commented Oct 31, 2018

It is running currently (500GB out of 8TB so far). It is slow because source data is mounted using s3fs onto ec2 instance, and then files are being chunked to 2GB chunks.

@numblr
Copy link
Owner

numblr commented Oct 31, 2018

Then fingers crossed ;) Btw, if you already have the data in S3(?) there might be easier options to get it into glacier. I'm not really an expert on this, but found for example this question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants