Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strings RFC #1848

Closed
wants to merge 1 commit into from
Closed

Strings RFC #1848

wants to merge 1 commit into from

Conversation

rylev
Copy link
Contributor

@rylev rylev commented Jun 24, 2022

This is an RFC for how we'll handle the many types of Strings in Rust and Windows and how they interact with each other.

This is a complicated question, so I imagine we'll probably not get it right, but I do believe this is an improvement over the status quo not only in terms of useability but also safety.

fn as_wide(&self) -> &[u16] {}
// String data with the trailing 0
fn as_wide_with_nul(&self) -> &[u16] {}
fn to_string(&self) -> Result<String, FromUtf16Error> {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This function collides with ToString::to_string, since HSTRING implements Display below.

Perhaps this could be renamed try_to_string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. That's a shame. We could use into_string which mirrors CString::into_string directly. I originally didn't do this as it's not exactly trivial to reuse the String's buffer for the HSTRING buffer and thus I didn't want to take self.


/// The wide equivalent to std::ffi::CString;
#[repr(transparent)]
struct CWString(..);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a matching borrowed CWStr, like CString and CStr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps... I'm not sure how often this will be useful in practice though it might be. PCWSTR would just be like a raw pointer so ideally the user would only rarely use it, but I'm unsure how often you'd want a &CWStr and not a &CWString

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rylev
One reason would be so there isn't a double-indirection.
It can also be used to convert a PCWSTR into a safely-used reference without having to copy or own the data. (See for example CStr::from_ptr)


These function build null terminated string data of the appropriate width (`u8` in the case of `PCSTR` and `u16` in the other cases).

**question**: for all these types built from static strings, could we special case them so that they don't use reference counting at all?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just special case for a string literal, then use that with concat!($str, "\0"), etc. For wide strings, you'd probably need something like const_heap (nightly api very far from stable) or compiler support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refers to whether we need a special version of HSTRING and the like that doesn't track reference counts since its backing memory is statically allocated. Actually statically allocating enough memory for a wide string is already a solved problem (e.g., see here).

Copy link
Contributor

@mbartlett21 mbartlett21 Jun 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I didn't realise that the const on stable had got that far. EDIT: And I forgot that you could use it to get a length for a constant array...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So essentially, you'd get a wide string reference (PCWSTR), then pass that to WindowsCreateStringReference (in windows::Win32::System::WinRT), then use the resultant HSTRING structure.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using WindowsCreateStringReference is generally a pessimization unless the calling language natively supports null terminated wide string literals, like C# and C++. For languages like Rust, that's not worth it as the resultant HSTRING typically ends up being copied almost immediately.


It's clear that "IN" params are the scenario where ownership differs from the status quo. Therefore, the `windows` crate will treat the types as the following.

* Introduce a new type `CWString` which is the wide string equivalent to `CString`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll review more carefully when I get a moment but my most obvious question is what CWString gives you that HSTRING doesn't already provide. Obviously, there are already a dizzying number of string types and I'd love to not make that dizzier. 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? The difference is that HSTRING has the added complexity of being reference counted. CWString would have the added benefit of being symmetrical with the Rust std library CString type, but I agree that we want to try to limit the explosion of types if we can...


#### Windows

Windows has the following string types:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to look at this is that Windows has a variety of string types that need to be supported for interop with various APIs but that languages can and should only use them on the ABI boundaries as much as is practical. The extreme example of this is C# where only System.String exists in C# and all the different string types are marshaled away behind the scenes. While I'm not suggesting we go quite that far, we should try to think of these as interop types needed only for the ABI boundary and put our energy into making it as efficient and seamless as possible to go from and get back to native and idiomatic Rust string types, and not try to make Windows replacements with all the bells and whistles of Rust's string types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PCSTR, we can just use *const c_char directly, and CStr provides from_ptr and as_ptr methods to convert to and from it. For PSTR, we would need CString and CStr to allow mutability.

For PWSTR and PCWSTR, it would need roughly the same, except with wide characters. PCWSTR (I think) should be able to be created from a BSTR or HSTRING.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be the ideal situation.

However, the unfortunate situation with Rust is, that Rust does not provide a zero-overhead string type that's compatible with Windows' native string encoding. Vec<u16>/&[u16] is the best Rust has to offer in this respect. I don't see any way "to go from and get back to native and idiomatic Rust string types" at all.

Being pragmatic here, and assuming that unmatched surrogate pairs won't appear in practice is a dangerous route to go down. Doing so will open up code based on the windows crate to attacks, even if only DoS attacks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed that's the ideal hence the "as much as is practical". So we need a null terminated wide char but we need such things as little as humanly possible with the emphasis on interop rather than string features. And since we need HSTRING anyway and HSTRING is a null terminated wide char string that should cover those needs. Even if we end up calling it CWString, for symmetry, that just happens to be implemented as an HSTRING we can at least minimize the number of implementations we're carrying around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I am a bit skeptical that we'll actually save much code by using an HSTRING as a backing implementation detail, but if this is truly just an implementation detail then it doesn't really matter. We can keep playing with the implementation without impacting users.


* `PSTR`: a mutable, nul-terminated string (i.e., composed of "characters" a.k.a. `u8`s) that is often but not always ANSI encoded.
* `PCSTR`: an immutable version of `PSTR`.
* **QUESTION**: what are the practical differences between this and `PSTR`?
Copy link
Contributor

@mbartlett21 mbartlett21 Jun 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is that PCSTR can be created from a CStr reference, and can also be converted back.
EDIT: That is because CStr doesn't allow mutability.

@ChrisDenton
Copy link
Collaborator

I would also add that there are some more rough edges with functions that take even more variations on strings.

There's UNICODE_STRING which is roughly equivalent to the Rust types &[u16], &mut [u16] or Vec<u16> depending on the function being used and the value of MaximumLength. This can support interior nulls.

Also there are functions such as CompareStringOrdinal that can take a null terminated LPCWCH string but also optionally take a length so the strings need not be null terminated (and may therefore include nulls).

And some functions such as SetFileInformationByHandle that take structures which require both a null terminated string and the length of the string.

While these should not be too difficult to support, I think it's worth noting there are a number of variations on the basic ptr + length string.

@kennykerr
Copy link
Collaborator

Right, which is why I'm reluctant to add some new synthetic string type like CWString when we already have so many to contend with. If anything, something like CWString belongs in the standard library alongside CString while the windows crate should focus on those that are specific to Windows, like HSTRING and BSTR, as it does.

To @ChrisDenton's point, it is also not the goal of the window crate to smooth over every last detail of string handling in the Windows API. There will always be APIs that are harder to call directly and possibly deserve some specialized crate to provide a more idiomatic experience if the API is that important. Many are not. 😏

```rust
let x: CWString = w!("hello");
let y: CString = c!("hello");
let z: HSTRING = h!("hello");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need these macros to produce allocated strings. That requires runtime support. They should all be able to return const string literals of PCWSTR and PCSTR respectively. Then they can simply be used to call functions like MessageBoxA/W directly and efficiently.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Common Windows string types like HSTRING and BSTR can then just provide From<&T> implementations for PCWSTR for cases where a string needs to be computed at runtime and in future if/when CString and CWString are available in the std library we can just provide From implementations for those as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we shouldn't need the Into<T> below, unless I'm forgetting something.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this is compatible with the existing state of the windows crate so would make it easier to land the Borrowed PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal is a subset of what I had in mind so I think it's fine to try it out. We can always expand from there to have dedicated CWString types and such.

* `PSTR`: a mutable, nul-terminated string (i.e., composed of "characters" a.k.a. `u8`s) that is often but not always ANSI encoded.
* `PCSTR`: an immutable version of `PSTR`.
* **QUESTION**: what are the practical differences between this and `PSTR`?
* `PWSTR`: a mutable, nul-terminated "wide" string (i.e., composed of "wide characters" a.k.a. `u16`s) that is often but not always UTF-16 encoded.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] that is often but not always UTF-16 encoded.

Can you expand on this? I'm not aware of any non-UTF-16 encoded PWSTRs. Are you instead trying to convey that a grapheme may span 1 or 2 code 16-bit units (2-4 bytes)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to create invalid UTF-16 strings because the kernel does not check validity. This is exceedingly rare in practice but it is possible, especially if there's a malicious app causing mischief.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prime candidate here is the filesystem. NTFS allows names to be just about any sequence of 16-bit values (with a few reserved code units). Those names will inevitably be returned from FindFirstFile/FindNextFile, so PWSTR/PWCSTR cannot make any assumptions about the validity of the UTF-16 encoding.

Likewise, named kernel objects (events, mutexes, pipes, ...) can have names that aren't valid UTF-16.

Not quite the norm, but invalid UTF-16 sequences are observed in the wild, usually as a result of an accident/bug, or, more frequently, in attacks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no use trying to define what these are. They're just pointer typedefs, which is why windows-rs defines them as simply as possible. We can provide safe ways to create them and pass them to API calls, but we cannot make any guarantees about them or provide generally safe conversions from them.

@kennykerr
Copy link
Collaborator

This PR's pretty old. Can we move it along or just close it for now?

@kennykerr kennykerr closed this Aug 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants