Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interaction of isize / usize with value layout #102

Closed
gnzlbg opened this issue Mar 15, 2019 · 19 comments
Closed

Interaction of isize / usize with value layout #102

gnzlbg opened this issue Mar 15, 2019 · 19 comments
Labels
A-layout Topic: Related to data structure layout (`#[repr]`) C-open-question Category: An open question that we should revisit own-rfc-worthy This discussion should likely turn into an RFC of its own

Comments

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 15, 2019

EDIT: I'm recycling this issue to track how isize and usize's layout interact with value layout in general. The old comment can be read below.

The reference of isize/usize currently documents the limits of the current implementation in a non-normative way:

  • the maximum value size is isize::max_value(),
  • the maximum number of elements in an array is usize::max_value(),
  • the largest pointer offset is usize::max_value().

These limits are unspecified (not implementation-defined), that is, we don't guarantee anything about these and the Rust implementation does not have to document them at all.

Changing these to either implementation defined, or to some guarantee that all implementations must satisfy probably would require an RFC on its own that answers the questions:

  • What's the size of the largest Rust value?
  • What's the length of the largest Rust array / slice ? Do we differentiatiate between array / slices to ZSTs and non-ZSTs ?
  • What's the largest pointer offset ?

These limits could be implementation-defined, unspecified, or set to some value that all implementations must uphold (isize::MAX, usize::MAX, etc.). And the trade-offs involved here are, among others, the kinds of code-generation backends Rust can easily support (e.g. LLVM and GCC limit some of these to isize::MAX, although these might be worked around), the kinds of optimizations these backends can do, and language complexity (e.g. how easy it is for users to write portable code, etc.).


old comment:

Layout isize/usize

The current reference says that the layout of usize determines the maximum size of Rust objects (size_of and size_of_val return usize), the maximum number of elements in an array ([T; N: usize]), and how much a pointer of a certain type can be offseted. It also says that the maximum size of any single value must fit within usize to ensure that pointer diff is representable.

There are two issues with this:

  • unclear what's meant by "offseted". Pointer add operates on usize, but pointer offset operates on isize. If by "offseted" we mean ptr.add we should probably be more explicit about this.

  • contradiction: we state that the layout of usize determines the maximum size of an object, and then argue that this is to ensure that pointer diff is representable, but AFAICT if the maximum size of an object is larger than isize then pointer diff will overflow and is not representable. IIRC @nikomatsakis argued that currently all Rust objects can be at most isize::max_value() large because of this, but this is not what the text says.

@gnzlbg gnzlbg added the A-layout Topic: Related to data structure layout (`#[repr]`) label Mar 15, 2019
@RalfJung
Copy link
Member

RalfJung commented Mar 15, 2019

Yeah, that statement in the reference is incorrect. The maximal size of an object, even of an allocation, in LLVM is isize::MAX. Cc @rkruppe

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Mar 15, 2019

Is isize::MAX is also the maximum number of elements that an array can have? AFAICT not necessarily (e.g. an array of ZSTs is able to have usize::MAX elements without issues).

I wonder whether these are things that we should just "document" vs "guarantee".

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Mar 15, 2019

So I chatted with @eddyb and they recommended also pinging @sunfishcode to ask whether Cranelift has these limitations as well.

@eddyb recommended not guaranteeing this, the maximum object size should be implementation-defined, and we identified the need for an intrinsic to compute the offset of a field within a struct, similar to C's offsetof intrinsic, which returns the offset as an usize (this would allow the maximum object size to be usize::MAX). ptr.offset_from is intended to be bidirectional (which is why returns isize) and is not intended to be used to compute field offsets.

EDIT: it seems that the need for offset_of is not new: rust-lang/rfcs#2421 , https://internals.rust-lang.org/t/discussion-on-offset-of/7440, https://internals.rust-lang.org/t/pre-rfc-add-a-new-offset-of-macro-to-core-mem/9273, https://internals.rust-lang.org/t/pre-pre-rfc-field-offsets/7845, etc. A lack of offset_of should not motivate guaranteeing all objects to be limited by isize::MAX.

@hanna-kruppe
Copy link

In my book offset_from and related concerns are not why allocations can't be larger than isize::MAX. Instead, the first-order reason is that we can't use inbounds GEP instructions in the translation to LLVM: the inbounds requires that the address calculation does not wrap in signed arithmetic. There are multiple reasons for why LLVM wants that and I'm not sure if I can do them justice, but in any case it's useful for enabling optimizations at IR level and at codegen level, so it seems like something that's generally a reasonable restriction for optimizing compilers, not a LLVM-ism. Indeed GCC doesn't support object sizes larger than isize::MAX either.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Mar 15, 2019

I'm not very familiar with GEP, but can't we just pass it wider signed integer types as offsets? E.g. if the base pointer is an unsigned i64, instead of using signed i64 offsets can't we use signed i128 offsets?

@RalfJung
Copy link
Member

Is isize::MAX is also the maximum number of elements that an array can have? AFAICT not necessarily (e.g. an array of ZSTs is able to have usize::MAX elements without issues).

Agreed, arrays of ZSTs can have usize::MAX elements.

I'm not very familiar with GEP, but can't we just pass it wider signed integer types as offsets? E.g. if the base pointer is an unsigned i64, instead of using signed i64 offsets can't we use signed i128 offsets?

No, LLVM fixes the offset to be isize, basically. In hindsight this looks like a design mistake to me, but well, here we are.

gnzlbg added a commit to gnzlbg/unsafe-code-guidelines that referenced this issue Mar 16, 2019
gnzlbg added a commit to gnzlbg/unsafe-code-guidelines that referenced this issue Mar 16, 2019
@gnzlbg
Copy link
Contributor Author

gnzlbg commented Mar 16, 2019

So #98 should now fix this issue. It document these things as implementation-defined, mentioning that the upper limit on object size, array length, and pointer offset is usize::max_value(). It then documents these implementation-defined limits for the current implementation (isize::max_value() for object size, and usize::max_value() for the others).

Instead of making array length and pointer offset implementation defined, we could guarantee that usize::max_value is always ok - but we can always add that guarantee later AFAICT. cc @eddyb

@hanna-kruppe
Copy link

I believe there's two distinct open questions here:

  1. Do we want to lift this restriction at all in some implementations? It's far from obvious to me that we do, it enables some optimizations and AFAIK has never actually been an issue for any programs except deliberately pathological toy ones. Even if e.g. Cranelift doesn't actually do anything that benefits from limiting allocation sizes, it may not be worth the conceptual complexity of carrying around another implementation-defined parameter.
  2. If we wanted to, how could do that even for the LLVM backend? Larger integer types don't work off the bat as @RalfJung said, but even if they did I think it would be a bad idea: using larger-than-native integers invites tons of potential for code quality problems. What would work today is simply not emitting inbounds. We'd lose the other aspect of inbounds, that the resulting pointer is poison if it would be out-of-bounds, but it's unclear how useful that is (even ordinary GEP without inbounds has provenance rules that give a fair amount of aliasing information). But in any case we could also try to salvage it by changing LLVM to separate the two aspects, e.g. splitting off the signed overflow part into an nsw flag.

@gnzlbg gnzlbg changed the title Layout of isize / usize Interaction of isize / usize with value layout Mar 16, 2019
@gnzlbg gnzlbg added the own-rfc-worthy This discussion should likely turn into an RFC of its own label Mar 16, 2019
@gnzlbg
Copy link
Contributor Author

gnzlbg commented Mar 16, 2019

So I've just documented in #98 what the implementation currently does, but the behavior remains unspecified.

Guaranteeing anything about the maximum size of Rust objects is something that's probably worth doing on its own RFC (see the top comment for a brief summary). It would be nice to make progress towards that, and we can keep discussing the trade-offs here, but I feel that it is not worth blocking an MVP of the UCGs RFC on this issue so I'm tagging this as such.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Mar 25, 2019

@rkruppe

Do we want to lift this restriction at all in some implementations? It's far from obvious to me that we do, it enables some optimizations and AFAIK has never actually been an issue for any programs except deliberately pathological toy ones. Even if e.g. Cranelift doesn't actually do anything that benefits from limiting allocation sizes, it may not be worth the conceptual complexity of carrying around another implementation-defined parameter.

I agree that the pathological toy programs, e.g., that create arrays with usize::MAX elements just to see the compiler fail, are worth fixing. There is unsafe generic code in the standard collections trying to prevent UB even when users do so, which I think is more interesting.

There, whether the limit can actually achieved in practice on a particular platform (e.g. due to LLVM, lack of memory / stack space / etc.), or whether the limit is usize::MAX or isize::MAX, might not be as relevant for unsafe code writers as the fact that there is a limit that they want their unsafe code to be robust against (e.g. Vec and the other collections contain code to handle these limits).


@joshtriplett @briansmith Another constraint when choosing Rust largest object size is being able to map to all C types in C FFI.

With usize = uintptr_t, and if we fix the largest allowed layout to usize::MAX, then because uintptr_t can be larger than C's size_t, then repr(Rust) types can be larger than repr(C) types. AFAICT this is ok, we'll just have to enforce an extra layout constraint for repr(C) types (e.g. that their size is at most size_t::MAX). CHERI would be such a platform (size_t = u64::MAX and uintptr_t = u128::MAX).

OTOH if we fix the maximum allowed value size to isize::MAX, then there is a subset of C types (those with a larger size) that we can't interface with via C FFI. This might be a trade-off worth making, e.g., if choosing isize::MAX simplifies Rust unsafe code, or if the usize::MAX strategy is not implementable.

@RalfJung
Copy link
Member

OTOH if we fix the maximum allowed value size to isize::MAX, then there is a subset of C types (those with a larger size) that we can't interface with via C FFI.

Notice that at least with LLVM, it is impossible in C as well to have values/typs larger than isize::MAX. Creating such objects in assembly and handing them to LLVM-generated code is effectively UB.

This doesn't change the lawyering much, but it means in practice this is not just a Rust limitation.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Mar 26, 2019

Lawyering-wise, we could guarantee that Rust does not allow objects larger than X, which is not the same as guaranteeing that all Rust implementations must allow objects of size X. I think we might have been conflating the two.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Mar 26, 2019

For example, we could guarantee that objects larger than usize::MAX (or isize::MAX) are illegal (compilation error required), and just noting that the largest object size supported by an implementation must be smaller than that value, but not necessarily equal. E.g. isize::MAX or 42 would both be ok, but if you exceed that, you get a compilation error.

Guaranteeing how big the largest supported object is on all implementations would be a different guarantee that sounds unnecessarily constraining to me, so I doubt that's worth doing. Being able to query this information is, however, useful, so it might just be better to provide a core::mem::LARGEST_VALUE_SIZE: usize = isize::MAX as usize; or something and guaranteeing that this is the correct value for that.

@gnzlbg gnzlbg closed this as completed in 228258b Apr 11, 2019
@comex
Copy link

comex commented Apr 12, 2019

Wait, so we're saying that this code would now not be considered portable?

unsafe fn get_elem(slice: &[u8], idx: usize) -> *const u8 {
    return slice.as_ptr().offset(idx as isize);
}

But the equivalent with add would?

unsafe fn get_elem(slice: &[u8], idx: usize) -> *const u8 {
    return slice.as_ptr().add(idx);
}

This is somewhat concerning, since

  • add was introduced much later than offset, so there's presumably still some old code that uses offset. (There is a clippy lint suggesting converting offset to add, but not everyone uses clippy.)
  • add is currently documented as being "convenience for .offset(count as isize)".

At minimum the documentation needs to be changed: the documentation for add should remove the "convenience" remark, and the documentation for offset should warn against using it. Perhaps offset should also be marked as deprecated, to encourage the aforementioned old code to be updated. It is still useful if you have a negative index, but we could make a new function as in the before_exec/pre_exec switcheroo, and in any case, code that uses negative offsets has a decent chance of being incorrect, as it may have been assuming you can always form an isize offset between two array indices.

@gnzlbg gnzlbg reopened this Apr 12, 2019
@gnzlbg
Copy link
Contributor Author

gnzlbg commented Apr 12, 2019

Wait, so we're saying that this code would now not be considered portable? [...] But the equivalent with add would?

Why do you think that? I don't think so, the docs of offset and add both say:

The computed offset, in bytes, cannot overflow an isize.

This isize::MAX limit is a limit of the current LLVM implementation. We guarantee that this code works correctly, so if anything, this means that Rust can't support backends where the maximum allowed object size is smaller than isize::MAX without breaking all unsafe code that relies on this.

This also probably means that guaranteeing that usize and isize are pointer-sized prevents us from supporting CHERI, since CHERI has 128-bit wide pointers, but we can't use 128-bit integers with getelementptr, so we can't allow offset/add to go up to isize::MAX. C does not have this problem because uintptr_t/intptr_t != size_t/ptrdiff_t.

@comex
Copy link

comex commented Apr 12, 2019

Why do you think that? I don't think so, the docs of offset and add both say:

I'm just saying that based on:

  • the maximum number of elements in an array is implementation-defined, but
    can at most be usize::max_value() since [T; N: usize],
  • the maximum value by which a pointer can be offseted is
    implementation-defined, but can at most be usize::max_value() since
    ptr.add(count: usize).

which implies that a Rust implementation could support slices larger than isize::max_value(), therefore any (non-standard-library) code that assumes they can't be is relying on implementation-defined behavior and is thus non-portable.

Edit: And my claim about there being a difference between add and offset is because the above quote specifically calls out add as potentially (in some implementation) accepting up to usize::max_value(). I'm assuming that in such an implementation, if a valid offset is positive but greater than isize::max_value(), it would not be legal to just cast it to isize and pass the resulting negative number to offset, since that would be considered wrapping around.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Apr 12, 2019

@comex I think i see your point now (@rkruppe helped a lot). The document says "implementation-defined" in those two clauses, but IMO we should change that to "unspecified".

Specifying those to be isize::MAX, usize::MAX, or "implementation-defined" would probably require an RFC, and @comex example shows some consequences of how the choice affects some of the pointer methods.

add is currently documented as being "convenience for .offset(count as isize)".

"unspecified" would mean that we don't know whether using .offset is worse than using .add here yet. If we fix the maximum object size to isize::MAX, then using .offset is ok, and saying that .add is a convenience for .offset(value as isize) is also ok, but it hasn't been decided whether that will be the case.

@JakobDegen
Copy link
Contributor

Closing. It's not clear what parts of this remain unanswered, folks are free to open new issues with precise questions

@RalfJung
Copy link
Member

The part about max. object size is answered these days, isize::MAX being the limit is documented (e.g. in slice::from_raw_parts).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-layout Topic: Related to data structure layout (`#[repr]`) C-open-question Category: An open question that we should revisit own-rfc-worthy This discussion should likely turn into an RFC of its own
Projects
None yet
Development

No branches or pull requests

5 participants