Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaning up duplicate files #95

Closed
smcgivern opened this issue Nov 4, 2014 · 4 comments
Closed

Cleaning up duplicate files #95

smcgivern opened this issue Nov 4, 2014 · 4 comments

Comments

@smcgivern
Copy link

I'm not sure how, but my vault has a lot of duplicate files (not all files have been duplicated, and not all have been duplicated the same amount of times):

$ cat Music.journal | cut -f 5,6,7,8 | sort | uniq -c | sort -nr | head -n 1
      7 43807793    1214763861  40206cb83dfdf929900013b839d69c2618b68fa6c47716bd65ec28b6176fcf8e    Björk/Volta/Björk-02-Wanderlust.flac

So all seven of these files in the archive have the same size, mtime, treehash, and filename. I don't see a way to delete files by archive ID, but if I do that manually and remove the entries from the journal, is that a safe operation?

(I won't be deleting the files for a little while as it's still < 3 months since I put them there.)

@vsespb
Copy link
Owner

vsespb commented Nov 4, 2014

but if I do that manually and remove the entries from the journal, is that a safe operation?

yes. you can extract files-to-delete to new journal and issue purge-vault with that new journal.
then you can wait 24h, retrieve-inventory+download-inventory https://github.com/vsespb/mt-aws-glacier#restoring-journal

I am not sure why you have duplicates - some make sure those are real duplicates on server side - i.e. they have different archive id, and that you keep at least one archive for each filename.

my vault has a lot of duplicate files

Possible ways to run into this:

  1. Use always-positive

  2. Drop your journal file, then start backup to new journal. Then download-inventory. Repeat 7 times.

If you could remember in details how you worked with journal I can try to investigate and find why this happened?

@smcgivern
Copy link
Author

These definitely appear to be duplicates on the server side, I retrieved a new journal file to check - and the vault size is bigger than expected (although not so big that it's really expensive, just annoying).

I'm sorry that I don't have more details on how this happened - I only noticed once I got my bill. My best guess at the moment is that I set up cron wrongly, and I was running multiple mtglacier instances against the same journal at once. The duplicate files are close to each other in the journal file, which might indicate that (I don't know if mtglacier locks the journal file).

Thanks so much for the quick response, let me know if there's anything else you want me to do to help debug this.

$ grep -n 'Björk/Volta/Björk-02-Wanderlust.flac' Music.journal
10007:B 1410599059  CREATED 0b7GVI-Lxjp_jVyYiMizZZb084aLN-uktjP6IbmG7iLlJxd289C6CHfKWwEP8IF_TDFbHI7KWZPr24paLrKPTzAIH8qYzAUcpDTjanSNBAxhjfcNbst4zMsPPm2edP3i5AODZibJBQ  43807793    1214763861  40206cb83dfdf929900013b839d69c2618b68fa6c47716bd65ec28b6176fcf8e    Björk/Volta/Björk-02-Wanderlust.flac
10016:B 1410599806  CREATED QkwdbLDjEeaS8D32x1wk0WIBcXfuSKd6kizW7rNEuAzu85f-eNO52XXQqS7i98RRNB8sDLRoLEnFmwpZ6d9NnYKR-JyfbxciUYbpcW-HKBLrdAtMtrPUtNNCoHJdjpq11L8s__ONNQ  43807793    1214763861  40206cb83dfdf929900013b839d69c2618b68fa6c47716bd65ec28b6176fcf8e    Björk/Volta/Björk-02-Wanderlust.flac
10018:B 1410600114  CREATED o3see3QYM4ikqWFEpgZyRT7T0OYAemlk2sxJvCIn7fJtxxGkNwNivg_G30_m4WXF1i81vZoNJzV0uKb_m1INOC42jQM-pfJ1lx17tdMpNol5b6qbqIRjGdiHX5U-O-h1zYA4CyGpdg  43807793    1214763861  40206cb83dfdf929900013b839d69c2618b68fa6c47716bd65ec28b6176fcf8e    Björk/Volta/Björk-02-Wanderlust.flac
10023:B 1410600344  CREATED ir7_cl3xLY3NhkLbCx6Ei69lg28kqj6cQeDcrTwQQnqKz-1i2n1aDJ78H8rCnm08g_fDiSD4kjxiL7GfT8sh5RKKt20t9-bopY1g4lMDHn_NrXCtVF6PbK15-hrJi6DtngeafNHUsA  43807793    1214763861  40206cb83dfdf929900013b839d69c2618b68fa6c47716bd65ec28b6176fcf8e    Björk/Volta/Björk-02-Wanderlust.flac
10033:B 1410601229  CREATED PVYLZscYQi_5mbPH-xDNklbXzXLHDIG3-SeONzUcGJioychyWWhoidvScZUtVTi3t0cuE-Q79TjQXGjM39TqCjvvt66iXl4ohdeEOQkdk8g3m4Q1KvYiwwGZ8aMCtTflX_MppnokoA  43807793    1214763861  40206cb83dfdf929900013b839d69c2618b68fa6c47716bd65ec28b6176fcf8e    Björk/Volta/Björk-02-Wanderlust.flac
10045:B 1410602363  CREATED Ckp7h9pP3oyM-cZLGfdXGyQ9-wY6U0qnDCDhXx0C2nwpvFEsY8vXFJjaLm3XyugENdfkuYWVj1w95Hab3LIKCBoOYK3jKk4ALekYMgS7z5Ez1WcmZpF6hlU0esD1nDFPtKsL67HZ8A  43807793    1214763861  40206cb83dfdf929900013b839d69c2618b68fa6c47716bd65ec28b6176fcf8e    Björk/Volta/Björk-02-Wanderlust.flac
10073:B 1410604494  CREATED NbrmfrGWwPkV_lS9_nFPbuzahskcvmoTvkoTGWgAq_K1pqu88SignldoWR-4hEj_m0L1LKJpWJNJjE0i3MkXVZzjR_tNPfS2-9LWa-Zv6gsJL3FLTdI7uHgJj5n2H6RYAAhOzXajXA  43807793    1214763861  40206cb83dfdf929900013b839d69c2618b68fa6c47716bd65ec28b6176fcf8e    Björk/Volta/Björk-02-Wanderlust.flac

@vsespb
Copy link
Owner

vsespb commented Nov 4, 2014

I was running multiple mtglacier instances against the same journal at once

yes, that could be a reason.

I don't know if mtglacier locks the journal file

no :( I opened issue #96 - enhancement. meantime you could use flock if run in situation when yo're not sure if concurrent processes run against same journal ( I use flock for my own backups using mtglacier)

The duplicate files are close to each other in the journal file

1410604494 - 1410599059 is 90 minutes range. That could be true if whole backup process is longer that 90 minutes and you've started 7 mtglaciers at a time.

let me know if there's anything else you want me to do to help debug this

probably no, concurrent access to journal explains this.

@vsespb vsespb closed this as completed Nov 4, 2014
@smcgivern
Copy link
Author

Thanks! I'll be more careful in future 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants