Skip to content

Commit

Permalink
Merge pull request #1175 from dandi/update-zarr-checksums
Browse files Browse the repository at this point in the history
Update Zarr design doc for changes to checksum format
  • Loading branch information
jjnesbitt committed Jul 26, 2022
2 parents b0e593d + be31a75 commit 8c66fae
Showing 1 changed file with 12 additions and 9 deletions.
21 changes: 12 additions & 9 deletions doc/design/zarr-support-3.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,23 +122,26 @@ If a zarr archive `c1223302-aff4-44aa-bd4b-952aed997c78` contained only a single

The last `.checksum` contains the final tree hash for the entire zarr archive.

### .checksum file format
Each zarr file and directory in the archive has a path and a checksum (`md5`).
For files, this is simply the ETag.
### Zarr entry checksum format
Each zarr file and directory in the archive has a path and a checksum.
For files, this is simply the MD5 ETag.
For directories, it is calculated like so:
1. List all immediate child files and directories in JSON objects like `{"md5":"12345...67890","path":"foo/bar"}` (key order matters).
1. Create an object `{"directories":[...],"files":[...]}` (key order matters) containing those child objects, ordered alphabetically by `path`.
1. Take the resulting JSON list, serialize it to a string, and take the MD5 hash of that string.
1. List all immediate child files and directories in JSON objects like `{"digest":"12345...67890","name":"foo","size":69105}` (key order matters).
- The size of a directory is the sum of the sizes of all files recursively within it.
1. Create an object `{"directories":[...],"files":[...]}` (key order matters) containing those child objects, ordered alphabetically by `name`.
1. Take the resulting JSON object, serialize it to a string (escaping all non-ASCII characters and with no spaces between tokens), and take the MD5 hash of that string.
1. The final checksum for the directory is then a string of the form `{md5_digest}-{file_count}--{size}`, where `file_count` is the total number of files recursively within the directory and `size` is the sum of their sizes.

### .checksum file format
For every directory in the zarr archive, the API server maintains a `.checksum` file which contains the checksum of the directory, and also the checksums of the directory contents for easier updates.
The `.checksum` file is stored in JSON format, exactly like the format used to calculate the checksum:
```
{"checksums":{"directories":[{"md5":"abc...def","path":"foo/baz"},...],"files":[{"md5":"12345...67890","path":"foo/bar"},...]},"md5":"09876...54321"}
{"checksums":{"directories":[{"digest":"abc...def-10--23","name":"foo","size":69105},...],"files":[{"digest":"12345...67890","name":"bar","size":42},...]},"digest":"09876...54321-501--65537"}
```

To update a `.checksum` file, the API server simply needs to read the existing contents, modify `checksums` to reflect the new state of the zarr archive, serialize and calculate the MD5, then save the new contents.

Every update to a `.checksum` file also requires updating the `.checksum` of the parent directory, since the `md5` of the child has change.
Every update to a `.checksum` file also requires updating the `.checksum` of the parent directory, since the `digest` of the child has changed.
This bubbles up to the top of the zarr archive, where the final `.checksum` for the entire archive can be found.

No spaces are used in JSON encodings.
Expand All @@ -149,4 +152,4 @@ This ensures that published dandisets are truly immutable.

Immutability is enforced by disabled the upload and delete endpoints for the zarr archive.

The client needs to agressively inform users that publishing a dandiset with a zarr archive will render that zarr archive immutable.
The client needs to agressively inform users that publishing a dandiset with a zarr archive will render that zarr archive immutable.

0 comments on commit 8c66fae

Please sign in to comment.