Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_mm_movemask_epi8 unexpectedly triggers Undefined Behavior #2617

Closed
Nugine opened this issue Oct 25, 2022 · 42 comments
Closed

_mm_movemask_epi8 unexpectedly triggers Undefined Behavior #2617

Nugine opened this issue Oct 25, 2022 · 42 comments

Comments

@Nugine
Copy link

Nugine commented Oct 25, 2022

playground

use core::arch::x86_64::*;

#[target_feature(enable = "sse2")]
pub unsafe fn is_ascii_sse2(a: &[u8; 16]) -> bool {
    let a = _mm_loadu_si128(a.as_ptr().cast());
    let m = _mm_movemask_epi8(a);
    m == 0
}

fn main() {
    assert!(cfg!(target_feature = "sse2"));
    let s = b"helloworld123456";
    assert!(unsafe { is_ascii_sse2(s) });
}
error: Undefined Behavior: each element of a SIMD mask must be all-0-bits or all-1-bits
    --> /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/../../stdarch/crates/core_arch/src/x86/sse2.rs:1381:5
     |
1381 |     simd_bitmask::<_, u16>(a.as_i8x16()) as u32 as i32
     |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ each element of a SIMD mask must be all-0-bits or all-1-bits
     |
     = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
     = help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information
     = note: BACKTRACE:
     = note: inside `std::arch::x86_64::_mm_movemask_epi8` at /playground/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/../../stdarch/crates/core_arch/src/x86/sse2.rs:1381:5
note: inside `is_ascii_sse2` at src/main.rs:6:13

See also:

@RalfJung
Copy link
Member

I am fairly sure I was told that simd_bitmask requires each element to be all-0 or all-1... so is that _mm_movemask_epi8 buggy? Cc @workingjubilee

@Nugine
Copy link
Author

Nugine commented Oct 25, 2022

@bjorn3
Copy link
Member

bjorn3 commented Oct 25, 2022

I am fairly sure I was told that simd_bitmask requires each element to be all-0 or all-1... so is that _mm_movemask_epi8 buggy?

Didn't know that. If that is indeed the case please revert rust-lang/stdarch#1331.

@RalfJung
Copy link
Member

RalfJung commented Oct 25, 2022

https://doc.rust-lang.org/nightly/std/simd/struct.Mask.html#method.from_int_unchecked indicates that this is a preconditon of from_int_unchecked, which I took as a sign that the underlying intrinsic has that requirement.

Cc @calebzulawski

@calebzulawski
Copy link
Member

It isn't documented anywhere (none of the SIMD intrinsics are, which isn't good) but yes, my understanding is that the various bitmask intrinsics expect all 0 or 1 lanes.

@Nugine
Copy link
Author

Nugine commented Oct 26, 2022

_mm[256]_movemask_epi8 has defined behavior:

Returns a mask of the most significant bit of each element in a.

(regardless of the other bits)

Changing the requirement may introduce unsoundness into existing crates.

@calebzulawski
Copy link
Member

To clarify, I mean the simd_* compiler intrinsics, not the vendor intrinsics. So I don't think it's appropriate to use it here.

@Nugine
Copy link
Author

Nugine commented Oct 26, 2022

Close as it is not an issue of miri.
Moved to rust-lang/stdarch#1347

@Nugine Nugine closed this as not planned Won't fix, can't repro, duplicate, stale Oct 26, 2022
@Amanieu
Copy link
Member

Amanieu commented Oct 26, 2022

The documentation in rustc's codegen says that simd_bitmask only looks at the MSB:

https://github.com/rust-lang/rust/blob/bed4ad65bf7a1cef39e3d66b3670189581b3b073/compiler/rustc_codegen_llvm/src/intrinsic.rs#L1081-L1089

@Nugine Nugine reopened this Oct 26, 2022
@RalfJung
Copy link
Member

That's an internal documentation, not necessarily a stable API-side promise.

https://github.com/rust-lang/portable-simd/blob/master/crates/core_simd/src/intrinsics.rs is the closest thing we have to API docs for these intrinsics. It's not terribly detailed, but it is clear on this question

    // truncate integer vector to bitmask
    // `fn simd_bitmask(vector) -> unsigned integer` takes a vector of integers and
    // returns either an unsigned integer or array of `u8`.
    // Every element in the vector becomes a single bit in the returned bitmask.
    // If the vector has less than 8 lanes, a u8 is returned with zeroed trailing bits.
    // The bit order of the result depends on the byte endianness. LSB-first for little
    // endian and MSB-first for big endian.
    //
    // UB if called on a vector with values other than 0 and -1.
    #[allow(unused)]
    pub(crate) fn simd_bitmask<T, U>(x: T) -> U;

@RalfJung
Copy link
Member

I'm not sure who is in charge of these intrinsics, btw -- usually intrinsics are t-lang territory, but I don't think they are involved here. This is so far just @rust-lang/project-portable-simd experimenting. As long as that is the case, the intrinsics should probably not be used outside of https://github.com/rust-lang/portable-simd/ . Everybody else should use the APIs exposed in core::simd instead.

@thomcc
Copy link
Member

thomcc commented Oct 26, 2022

I don't think core::simd should have an explicit _mm_movemask_epi8-style operation (IOW, I don't think core::arch should be implemented in terms of core::simd), since that implies a lot about the underlying representation of the mask -- we want the Mask types to be usable on architectures that use bitmasks instead of full-width masks.

I think we'd say just to use core::arch for this... Using the simd_bitmask intrinsic is probably fine as far as portable-simd is concerned (although I can't speak for everybody), so long as it has tests for the behavior (that we'd trip if we changed something there).

@hsivonen
Copy link
Member

hsivonen commented Oct 26, 2022

I think the precondition for _mm_movemask_epi8 is simply wrong. The intrinsic is for an architecture-specific operation, the architecture-specific operation is documented to only look at the high bit of each byte, and there are legitimate use cases for relying on the architecture-specific behavior. In particular, the architecture-specific documented behavior works for checking if every byte in the vector is an ASCII byte.

@thomcc
Copy link
Member

thomcc commented Oct 26, 2022

Yes, definitely enforcing this on _mm_movemask_epi8 is wrong.

@RalfJung
Copy link
Member

RalfJung commented Oct 26, 2022

Yeah I don't think anyone meant to say that _mm_movemask_epi8 has such a precondition.

The question is whether simd_bitmsak has such a precondition, which then impacts whether _mm_movemask_epi8 can be implemented using simd_bitmask. And I would say

  • Yes, it does, based on this.
  • And regardless, no code outside of portable-simd should be calling these intrinsics directly.

@Amanieu
Copy link
Member

Amanieu commented Oct 26, 2022

I'm just saying that simd_bitmask was always defined to look at the MSB since it was introduced in rust-lang/rust#57269. It's fine if we want to redefine it to only accept proper bitmasks (it's unstable after all), but then we should also update codegen and all places that use it first.

@workingjubilee
Copy link
Member

The reason it does that is because that's how _mm_movemask_epi8 is defined, but unless every single architecture implements their equivalent of that function in the exact same way, we don't want the generic intrinsics to be defined in terms of that. It's no secret that until very recently the SIMD implementations have been heavily Intel slanted for what at the time were very understandable reasons... but those are reasons that are becoming charmingly antiquated as we speak.

For note, I'm fine with saying the definition for simd_bitmask is "UB if you feed it anything else, but in practice architecture-defined", since that's really the obvious logical definition and encapsulates the obvious reason why it only looks at the MSB in the current implementation. I agree that we should not impose on _mm_movemask_epi8 further. I don't know what we expect of Miri here, however: do we really expect Miri to know how different architectures work? I don't really think so?

@saethlin
Copy link
Member

Miri accepts a --target so if Miri doesn't accurately the emulate the behavior of the indicated target, that would be rather strange.

@RalfJung
Copy link
Member

"UB if you feed it anything else, but in practice architecture-defined"

That sounds terrible, since it muddies the water around what UB is. We have two choices here:

  • We say it's UB to pass anything but 0 and -1, and optimizations are allowed to exploit that.
  • We say that the result of passing in anything else is target-specific. This means optimizations can not deduce that the input is 0 or -1. Note that I think this would be the first bit of target-specific behavior in Rust, aside from endianess and pointer size which are much more general concepts. We currently don't have good infrastructure for handling target-specific behavior, such as documenting exhaustively what the behavior actually is for each target (which we would have to do in this case).

@thomcc
Copy link
Member

thomcc commented Oct 27, 2022

It's an internal function never intended for stabilization, so we have some more options I think, including defining it to be the MSB for now and changing that in the future if needed. Both miri and stdarch are tied to releases of the compiler, so they can rely on internal details.

I'm necessarily not saying this is the approach we should take (it has maintenance downsides), but it's definitely an option.

(I also think calling this the only piece of target-specific behavior in Rust is fairly dubious, and requires ignoring the behavior of many builtin attributes, floating point numbers, much of std, nearly everything in core::arch, and so on. I can see how you could argue that many of these aren't parts of the language and are just related to it, but I don't think it would be very convincing in all cases. That said, this debate would be off topic).

@Nugine
Copy link
Author

Nugine commented Oct 27, 2022

For now:

  1. simd_bitmask reserves the UB statement.
  2. _mm_movemask_epi8 uses generic compiler intrinsics to support different codegen.
  3. Miri reports UB as it is.

The three parts work well but can not be put together. What should we do to solve this?

BTW there is another potential conflict: _mm_shuffle_epi8, aarch64's vqtbl1q_u8 and wasm's u8x16_swizzle. They can be used to implement "swizzle" or "table lookup". But _mm_shuffle_epi8 has different behavior with the others.

@thomcc
Copy link
Member

thomcc commented Oct 27, 2022

BTW there is another potential conflict: _mm_shuffle_epi8, aarch64's vqtbl1q_u8 and wasm's u8x16_swizzle. They can be used to implement "swizzle" or "table lookup". But _mm_shuffle_epi8 has different behavior with the others.

Do we implement these using a common intrinsic? I believe they all have different underlying behavior in some cases.

@Nugine
Copy link
Author

Nugine commented Oct 27, 2022

BTW there is another potential conflict: _mm_shuffle_epi8, aarch64's vqtbl1q_u8 and wasm's u8x16_swizzle. They can be used to implement "swizzle" or "table lookup". But _mm_shuffle_epi8 has different behavior with the others.

Do we implement these using a common intrinsic? I believe they all have different underlying behavior in some cases.

No. They are implemented by special llvm intrinsics now.

@thomcc
Copy link
Member

thomcc commented Oct 27, 2022

Then there's no conflict?

@Nugine
Copy link
Author

Nugine commented Oct 27, 2022

Then there's no conflict?

Yes. I mean a conflict like simd_bitmask and _mm_movemask_epi8 may be not a special case.

@thomcc
Copy link
Member

thomcc commented Oct 27, 2022

The simd_foo intrinsics are definitely not intended to behave in all cases like the underlying platform intrinsics. There are plenty of situations where doing so would be undesirable or impossible, and has caused problems in the past.

@RalfJung
Copy link
Member

RalfJung commented Oct 27, 2022

I also think calling this the only piece of target-specific behavior in Rust is fairly dubious

Fair -- but it is the only piece of target-specific behavior that a MIR-level Rust interpreter or verifier or model like MiniRust has to worry about, so far. Miri is literally entirely target-agnostics except for endianess and pointer size. (The differences in type layouts are factored into a separate part of rustc so Miri just sees the results of that computation -- this is hiding some other target-specific bits like the alignment of u64, but arguably those are just part of the generally unspecified data layout.)

It's an internal function never intended for stabilization, so we have some more options I think, including defining it to be the MSB for now and changing that in the future if needed. Both miri and stdarch are tied to releases of the compiler, so they can rely on internal details.

So is that the behavior of LLVM on all targets right now? Also what about other 'boolean' SIMD intrinsics? IMO they should all behave the same.

  • simd_reduce_{any,all}
  • simd_select
  • simd_gather
  • simd_scatter
  • simd_bitmask

Do they all look at the MSB only on all targets?

@thomcc
Copy link
Member

thomcc commented Oct 27, 2022

IMO they should all behave the same.

Most/all of the others would suffer from significantly pessimized codegen on some/all platforms by defining them this way, so I think that's a non-starter.

@thomcc
Copy link
Member

thomcc commented Oct 27, 2022

I think my stance is that ideally we wouldn't be implementing stdarch functions in terms of core::intrinsics::simd_.*, although I know that we already do in several places, and I know that pushes maintenance burden onto @bjorn3's cranelift backend (and possibly more burden than is reasonable).

I can't speak for anybody else, but as a member of wg-portable-simd (not one that implemented these functions, though), it wouldn't really bother me that much to take an approach of carving out a set that's useful for both in a piecemeal fashion, e.g. possibly starting with what I described here.

What would bother me to define the simd_.* intrinsics so that they were forced to be less efficient for the sake of consistency with other simd_.* intrinsics, since this not only negatively impacts core::simd (discouraging its use), it also defeats some of the purpose of these being internal impl details (which are allowed to be messier than what is exposed as public API).

@RalfJung
Copy link
Member

RalfJung commented Oct 27, 2022

Most/all of the others would suffer from significantly pessimized codegen on some/all platforms by defining them this way, so I think that's a non-starter.

Ah I see. I think it'd be odd if different SIMD intrinsics had different rules for what the representation of a "SIMD boolean stored as an int vector" is. To me as Miri maintainer and language specifier, these intrinsics are the interface and I don't like seeing them become messy.

We could, however, have a simd_msb2bool (or whatever) that has the effect of turning an int vector into a bool vector (i.e. a vector where all elements are 0 or -1) where all bits are set to the MSB. That intrinsic combined with simd_bitmask would be able to express the desired intel intrinsic semantics. (We could have another simd_int2bool where the bool vector becomes -1 for any non-0 elements, if that's useful in other situations.) Not sure if LLVM has a good way to express that though -- ideally simd_msb2bool followed by simd_bitmask would compile to _mm_movemask_epi8 on intel.

I think my stance is that ideally we wouldn't be implementing stdarch functions in terms of core::intrinsics::simd_.*,

Yeah I think so too.

pushes maintenance burden onto @bjorn3's cranelift backend

Ah, fair... I didn't realize that these changes were done to have cranelift (and Miri!) support more of stdarch.

@bjorn3
Copy link
Member

bjorn3 commented Oct 27, 2022

Would it make sense to split simd_bitmask into one intrinsic which only looks at the MSB of every lane and one which makes it UB to have lanes that are not all zero or all one? If not as I already said I'm fine with reverting this specific stdarch change. For most simd_* cases in stdarch there is only one possible behavior of the simd_* intrinsic that makes sense. For example for simd_add anything other than a regular twos complement lanewise addition for integer vector types should in all likelyhood use a different name like simd_add_unchecked. For those cases IMO stdarch should keep using them. I have tried to use simd_* intrinsics for float ops in stdarch, but there behavior is more ambiguous and in fact tests failed around NaN. As such I think the float operations in stdarch should keep using LLVM intrinsics. They aren't all that important for cg_clif anyway as integer and bitwise operations are much more common.

@Amanieu
Copy link
Member

Amanieu commented Oct 27, 2022

Would it make sense to split simd_bitmask into one intrinsic which only looks at the MSB of every lane and one which makes it UB to have lanes that are not all zero or all one?

I think that's a good solution for now.

@RalfJung
Copy link
Member

RalfJung commented Oct 27, 2022

If having a "convert MSB-bool-vector to full-width-bool-vector" intrinsic does not give the desired result then that seems like the 2nd best thing, yeah.

@Amanieu
Copy link
Member

Amanieu commented Oct 27, 2022

If having a "convert MSB-bool-vector to full-width-bool-vector" intrinsic does not give the desired result then that seems like the 2nd best thing, yeah.

Shift right by N where N is the element width. But I'm not sure how well LLVM can optimize through that.

@thomcc
Copy link
Member

thomcc commented Oct 27, 2022

Yeah, I'm not sure LLVM can see through that. My experience is that LLVM often gets confused with bool vector operations, frustratingly.

Note that this is often a very hot operation in inner loops, and people tend to reach for simd in code that is performance critical, so I think having two intrinsics would be a better option. As having LLVM codegen two instructions here when it should emit one would be... highly undesirable.

To me as Miri maintainer and language specifier, these intrinsics are the interface

That's kind of unfortunate. The nature of SIMD is that it's inherently extremely performance sensitive, and fairly platform specific. There is some amount of consistency between the actual operations exposed by different SIMD ISAs, but it's somewhat expected that those might not have the internal consistency that you desire.

I also think that coming up with an efficient and consistent user-facing API surface for this is hard enough (the semantics of boolean vectors and masks in particular took us a very long time to work out, with a lot of heated debate) that I would really want to push back on adding the similar consistency requirements to private internals APIs.

@calebzulawski
Copy link
Member

As an example, we're already facing a performance issue with masks in rust-lang/portable-simd#312 which is likely going to require codegen change to one of the simd_* intrinsics. It would be unfortunate if we didn't have the freedom to make aggressive changes because of stricter internal behavior than required by std::simd's API. I'd definitely favor having separate intrinsics with different invariants.

@RalfJung
Copy link
Member

RalfJung commented Oct 27, 2022

private internals APIs

I think these are not quite as private as you might like -- any tool that wants to analyze or verify Rust code for correctness (e.g. Miri but also all the static analysis and verification tools being built) need to have proper models of these functions. So they are about as private as the rest of MIR I would say. (Intrinsics are part of the MIR syntax for all intents and purposes.) We are certainly not making stability promises, but we should keep in mind that not all consumers of this API are in-tree.

For SIMD specifically, I don't know for sure that there is an out-of-tree consumer, but I know for sure that there are people that would like to run their static analysis and verification tools on SIMD code and they asked about how to best handle that. Given that modeling all of stdarch is not feasible, the portable-simd intrinsics are a great way to increase the amount of code these tools can handle. That's why Miri supports them.

Anyway, two intrinsics seem like a reasonable compromise to me for this particular case.

@workingjubilee
Copy link
Member

What we want ultimately is for there to be specified-in-Rust models of how these intrinsics can be executed using scalar primitives, and for this case, I think we should easily have the option to wrap more basal intrinsics with higher level abstractions. In this case, the x86 movemask definitely isn't appropriate for all platforms but can be expressed as a thin wrapper around simd_bitmask. Ideally we would have ways of indicating "just... use this asm, please" in a more codegen neutral way (as the only reason to use anything but the asm instruction is because it was optimized out).

@Nugine
Copy link
Author

Nugine commented Oct 28, 2022

It looks like the pattern simd_lt then simd_bitmask can be correctly optimized to one instruction. I won't be surprised if _mm_movemask_epi8 is implemented by this way. But I'm not sure whether it will affect other optimizations.

https://godbolt.org/z/7KKrcojY9

#![feature(portable_simd)]

extern crate core;

use core::arch::x86_64::*;
use core::simd::*;

#[inline(always)]
fn i8xn_bitmask<const N: usize>(x: Simd<i8, N>) -> <Mask<i8, N> as ToBitMask>::BitMask
where
    LaneCount<N>: SupportedLaneCount,
    Mask<i8, N>: ToBitMask,
{
    x.simd_lt(Simd::splat(0)).to_bitmask()
}

#[target_feature(enable = "sse2")]
pub unsafe fn is_ascii_sse2(a: &[u8; 16]) -> bool {
    let a = _mm_loadu_si128(a.as_ptr().cast());
    let m = i8xn_bitmask(a.into());
    m == 0
}

#[target_feature(enable = "avx2")]
pub unsafe fn is_ascii_avx2(a: &[u8; 32]) -> bool {
    let a = _mm256_loadu_si256(a.as_ptr().cast());
    let m = i8xn_bitmask(a.into());
    m == 0
}

@workingjubilee
Copy link
Member

Hm, if memory serves, as it's part of std, the result will be precompiled anyways, won't it?

@thomcc
Copy link
Member

thomcc commented Oct 28, 2022

I'd like us to have codegen tests that ensure this is optimized appropriately if we're going to go this route. I wouldn't feel good about shipping a perf regression to all _mm_movemask_epi8 users, since there are a lot (that's quite an important function).

@Nugine
Copy link
Author

Nugine commented Oct 31, 2022

Close as Miri does not need to change for this.
Moved to rust-lang/stdarch#1347

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants