Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

download: download into a temp file + rename / clean upon failed attempt #198

Closed
yarikoptic opened this issue Aug 13, 2020 · 7 comments · Fixed by #247
Closed

download: download into a temp file + rename / clean upon failed attempt #198

yarikoptic opened this issue Aug 13, 2020 · 7 comments · Fixed by #247
Assignees

Comments

@yarikoptic
Copy link
Member

As of 0.6.0 implementation download downloads directly into the target location.

In common scenarios it is better to download into some temporary location and then move into the target location upon successful completion. That avoids users ending up with partial downloads which represent broken .nwb files without knowing that they are partial downloads. Similar strategy is used e.g. by chrome browser and girder client which we stopped used for download (only to initiate streamed download).

Unlike girder_client we do not want to download to a TMPDIR folder, since file could be large and /tmp could be lacking that much space and transfer into target location could entail additional delay. I think the best would be to download into a temp file in the target location folder but with some prefix or suffix. E.g. for file a/blah.nwb, keep original one (if present) file (until renaming "into" it) and download into a/blah.nwb-dandidownload to be renamed into a/blah.nwb upon successful completion.

What to do if download is interrupted (my preferences are emphasized). We could

  • remove partial download: but that might not happen if process is hard killed/power went down, so even if implemented we would need logic (below) to deal with them found on the drive
  • keep partial download: could be beneficial to e.g. establish resumed download, so I think we should proceed this way

At the beginning of the download, what to do if a partial download found:

  • that could mean that there is a 2nd dandi client running ATM for the same dandiset. I think that the easiest would be for a download to establish a lock file, e.g. using fasteners library, to reside probably along side in blah.nwb.lck (well, there is write_locked helpers so may be a separate lck file is not even needed)
  • easiest -- just remove and start a new: since we will be dealing with large files -- could be very wasteful/undesired
  • with Add "resume" functionality to download #189 we would be able to resume download. BUT before resuming, it would be necessary to establish first that the file on the remote end is still the same as original (interrupted) download. For that reason, while locking the file for the download ideally we should store somewhere (may be filename as in -dandidownload-<CHECKSUM>) checksum we obtain from the metadata? If no checksum was "recorded" - download from the start. If there was checksum, and matches what is about to be downloaded, just roll back a chunk and try to resume... if resume fails (due to read in "check" chunk differs as in Add "resume" functionality to download #189) - start from the beginning.

But may be if we rely on checksum we should just download into some .dandi/tmp/<CHECKSUM> and lock there? What do you think @jwodder ?

@jwodder
Copy link
Member

jwodder commented Aug 13, 2020

We could do what Safari and Firefox do: When downloading to a/blah.nwb, create a directory a/blah.nwb.download containing the output file plus a file of metadata about the download (checksum etc.). If we need a lockfile, we can put it in the directory as well. When the download is finished, move the file to its final location and delete the rest of the directory.

@yarikoptic
Copy link
Member Author

Sounds good @jwodder !

@yarikoptic
Copy link
Member Author

Please implement this one for current master. This logic is agnostic of what is underlying service (API vs girder) so we should be good even after going API all the way.

@jwodder
Copy link
Member

jwodder commented Sep 16, 2020

  • What should happen if the downloaded file already exists (in its final location, not in a special .download folder) at the start of a download?

  • What should happen if another process already has a lock on the lockfile?

@yarikoptic
Copy link
Member Author

What should happen if the downloaded file already exists (in its final location, not in a special .download folder) at the start of a download?

download into .download folder, when ready to replace existing ones - unlink existing, and move new download into its place

What should happen if another process already has a lock on the lockfile?

raise an exception. We already have LockingError which we use in girder related code, I think it is ok to reuse it for now - later girder code would go away.

@jwodder
Copy link
Member

jwodder commented Sep 23, 2020

Exactly what metadata should be stored in the directory and checked when resuming download? Just the checksum for the complete file?

@yarikoptic
Copy link
Member Author

Storing a checksum should be sufficient, but this issue PR does not necessarily should address resumed downloads (although could if only for API calls/downloader, girder will be gone anyways).

yarikoptic added a commit that referenced this issue Oct 2, 2020
Download files to temporary directory containing metadata
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants