Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undefined behavior with Vec::set_len #111

Closed
stoneeric opened this issue Apr 6, 2021 · 6 comments
Closed

Undefined behavior with Vec::set_len #111

stoneeric opened this issue Apr 6, 2021 · 6 comments

Comments

@stoneeric
Copy link

Hello,

There are several instances within zstd-rs where Vec::set_len are used:

buffer.set_len(buffer_len);

buffer.set_len(capacity);

self.buffer.set_len(capacity);

result.set_len(max_size);

Documentation for set_len safety states:

Safety
new_len must be less than or equal to capacity().
The elements at old_len..new_len must be initialized.

The second condition is not met in the instances linked above.

It would also be great if there were more documentation about safety invariants - the samples in the documentation above include examples. That would help establish confidence in the set_len calls following the FFI calls.

@gyscos
Copy link
Owner

gyscos commented Apr 6, 2021

Hi, and thanks for the report!

The safety requirements listed in the documentation are necessary (and sufficient) to guarantee the safety of calling Vec::set_len in isolation. Indeed, I agree that if Vec::set_len were alone in the unsafe block, uninitialised memory would become observable and indeed cause unsoundness.

However in our case we do not end the unsafe block there. Instead, we write to this buffer without reading from it, and then, for the final call, sets the length to the initialized portion.

The question is then: is it safe to have &mut [u8] to uninitialized data if we only write to it? It turns out that it is, and it is the current logic behind Initializer::nop for some std::io::Read implementations that never read from the given buffers.

@anp
Copy link

anp commented Apr 7, 2021

IIUC, what actually happens in these examples is a bit more subtle:

  1. set_len(max_buffer_size)
  2. initialize some of the backing bytes
  3. set_len(actually_initialized_size)

My concern is that even if std uses this technique today, I think it's UB according to the reference, which says that it's UB to "Producing an invalid value, even in private fields and locals. "Producing" a value happens any time a value is assigned to or read from a place, passed to a function/primitive operation or returned from a function/primitive operation."

It goes on to list some ways a value can be invalid, including "An integer (i*/u*), floating point value (f*), or raw pointer obtained from uninitialized memory, or uninitialized memory in a str." and "A reference or Box that is dangling, unaligned, or points to an invalid value."

My reasoning here is that in order to perform (2) above we produce a &mut [u8], some of which refers to invalid (uninitialized) memory based on Vec's docs. References pointing to invalid values are themselves invalid, which (I think) means it's UB to produce this reference in the process of writing to its backing bytes for initialization.

I think it's possible that the unsafe code guidelines will eventually say that this is allowed behavior. I think that to conform to the current stable language rules these functions would need to operate on &[mut MaybeUninit<u8>] to be sound in light of other possible future language changes where this is not only illegal but exploited for optimizations.

While I feel pretty confident in my read of the reference here, I should note that miri does not currently complain at the pattern you describe @gyscos. In particular, see the commented out line 20 which will cause miri to barf if uncommented. I'm nearly positive that the current Rust aliasing rules allow the compiler to insert reads like that one without needing a programmer to write it in the source (because &mut T is always readable). miri might not error without an explicit read here because it doesn't understand Vec's semantics yet, or it may not understand &mut [u8]s validity invariants, or it may be because the UCG effort's decision process is ahead of the reference and they're about to make me wrong :P.

EDIT: I filed rust-lang/miri#1762 for clarification on miri not catching this.

EDIT2: Ah, miri doesn't attempt to catch this: "In particular, Miri does currently not check that integers/floats are initialized or that references point to valid data."

@gyscos
Copy link
Owner

gyscos commented Apr 7, 2021

Well, that's unfortunate.

I suppose if we want to keep the interface from zstd-safe actually safe to use, we can either:

  • Zero out the buffer at the beginning. Should be relatively low overhead for long-lived contexts, but might be more noticeable when creating lots of contexts. Also not strictly zero-cost, makes me sad. :(
  • Require unsafe in the caller (we already use unsafe in the call-site anyway) by making the zstd-safe API work on &mut [MaybeUninit<u8>]. Haven't come up with a clean way to do that yet without being terrible to use.
  • Add functions to zstd-safe (under the std feature) taking a &mut Vec, taking a raw pointer directly rather than going through slices, and setting the size there. Con: it duplicates most of the current zstd-safe API using OutBuffer, and is specific to Vec (so other hypothetical resizable containers wouldn't work). I suppose we could define a trait for "resizable containers around uninitialized bytes" and use that instead? Something like that. At least it'd still be usable in no-std environments, and would not require duplicating code.

@anp
Copy link

anp commented Apr 7, 2021

Yeah, these all seem like valid options to me. Your third point reminded me of the ReadBuf RFC, which apparently has a candidate implementation now.

@gyscos
Copy link
Owner

gyscos commented Apr 8, 2021

I started a container branch to try out this trait idea: #113

It removes most of the unsafe calls from the top-level crate (but on the other hand it now forces unsafe for the implementation, and therefore adds 2 minor unsafe calls to the - mostly unused - NoOp operation).

It does make the zstd-safe API a bit less readable with generics everywhere, but does help simplify a bit the top-level zstd crate.

Yeah, these all seem like valid options to me. Your third point reminded me of the ReadBuf RFC, which apparently has a candidate implementation now.

Indeed, though here we want to directly write to the user's container (like Vec), so using ReadBuf or another dedicated type wouldn't help directly. The underlying design is very similar however (only we have a simplified case here where we don't need to distinguish "filled" and "initialized", which is why Vec, with a single "cursor", still works for us).

@gyscos
Copy link
Owner

gyscos commented May 4, 2021

That branch has now been merged, the potential UB should no longer apply.

@gyscos gyscos closed this as completed May 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants