Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify specification around endianness and header padding #306

Merged
merged 3 commits into from
Aug 8, 2023

Conversation

akx
Copy link
Contributor

@akx akx commented Aug 1, 2023

What does this PR do?

This PR clarifies the readme around the spec a little; the endianness of the header size was not specified.

Further, as discussed in #291 (comment) serde_json (and in fact any spec-compliant JSON parser, I suppose) allows padding the JSON data with whitespace characters, so note that – and add a test for that, too.

EDIT: As discussed in the comments below, this also nails down the spec that the padding may only be trailing.


As an aside: it's not very clear for a contributor that you shouldn't modify README.md directly... 😁

@akx akx force-pushed the spec-clarifications branch 2 times, most recently from e3e11f7 to 8dc7f5f Compare August 1, 2023 14:13
@Narsil
Copy link
Collaborator

Narsil commented Aug 2, 2023

Hi @akx ,

I'm very fine adding a test, and the endianness precision.
I don't feel easy adding the whitespace to the SPEC. the spec says VALID JSON which should imo be plenty.
The additional notes do mention that there is no one single valid JSON spec, so weird tricks like this are not really defined on purpose (so we can revoke clearly ill defined JSON that cause actual security issues).
The space padding seems ok in the long term but it doesn't really suit my notion of "obviously valid json".
So adding in the notes feels better instead.

Currently the space padding is used to get alignment and is a "trick" and implementation detail, not something that should ever be required.
Mentioning in the docs that this particular implementation uses that is fine in my book ! I would keep only right hand side padding maybe (despite many more options being available as you mention) just so that implementors try to stick with that.

Are you ok with the proposed change about padding ?

@akx
Copy link
Contributor Author

akx commented Aug 2, 2023

I'm totally okay with specifying that only trailing whitespace (or null, but that's presently not accepted by serde-json) padding is allowed, but that doesn't reflect the reality of this current canonical implementation (which evidently allows leading whitespace padding).

Whether it's okay that this implementation is more lenient than the spec is up to you - if we do hammer it down in the spec that only trailing spacing is allowed, I would suggest we add a check that the first byte of the header is { in this implementation too, to make it less probable that leading-padded safetensors files begin appearing in the wild.

Ps. I'm just boarding a plane so I'm not able to reply immediately 😁

@Narsil
Copy link
Collaborator

Narsil commented Aug 2, 2023

null, but that's presently not accepted by serde-json

Educated guess but this prevents polyglot files because all C-strings are null terminated.

we do hammer it down in the spec that only trailing spacing is allowed,
I would suggest we add a check that the first byte of the header is { i

That's actually a neat idea.

@akx
Copy link
Contributor Author

akx commented Aug 2, 2023

Educated guess but this prevents polyglot files because all C-strings are null terminated.

Or maybe just that NUL isn't at all in the JSON spec whereas the documented ws characters are, and it's easier for a parser to just consume them when outside a string literal regardless of other parser state :)

Slightly sleep-addled idea - flights are being late - but the { check also opens up the possibility for other header encodings provided they don't start with the same character.

@akx
Copy link
Contributor Author

akx commented Aug 3, 2023

@Narsil Okiedoke, I amended the second commit to

  • write the spec more clearly
  • add the constraint that the first character of the decoded UTF-8 string must be { (i.e. leading padding is disallowed)

WDYT?

Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just the nit of removing other paddings in the spec.

README.md Outdated Show resolved Hide resolved
@akx akx requested a review from Narsil August 3, 2023 14:58
akx added 2 commits August 7, 2023 11:23
* note the data must begin with a `{` byte
* note the data may be padded at the end with whitespace
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
@akx
Copy link
Contributor Author

akx commented Aug 7, 2023

@Narsil Rebased + ran make.

@Narsil
Copy link
Collaborator

Narsil commented Aug 7, 2023

I'm making a release without this in. So we can have the deepspeed fix (DS shares tensors without actually shared tensors).

I will then go ahead and merge this so and advertise to other safetensors implementors see if what we're doing here is actually break free for them. (Which it shouldn't , if it is we can just revert).

@Narsil Narsil merged commit 27fa035 into huggingface:main Aug 8, 2023
9 checks passed
akx added a commit to akx/safetensors that referenced this pull request Dec 18, 2023
akx added a commit to akx/safetensors that referenced this pull request Jan 27, 2024
akx added a commit to akx/safetensors that referenced this pull request Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants