Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Z3_get_lstring now adds escaping of some bytes. #5615

Closed
carstimon opened this issue Oct 21, 2021 · 5 comments
Closed

Z3_get_lstring now adds escaping of some bytes. #5615

carstimon opened this issue Oct 21, 2021 · 5 comments

Comments

@carstimon
Copy link

For #2286 the function Z3_get_lstring was introduced, and its documentation says it returns an unescaped string. We use this in the C++ API via "get_string" which was introduced in a second commit. Our team uses this function to get the string in a C++ program, and we need to get the string to be exactly the string produced by z3 (e.g. if we are asserting the length is 3 we expect the length of the produced string to be 3). Furthermore, the result is now ambiguous: "\u{a}" could mean that Z3 found the one-character string "\n" or the 5 character string "\u{a}".

As of this commit and this commit the function now does escaping. In particular c.string_val("\n", 1).get_string() == "\n" is no longer true. Is there a version to preserve the round-trip?

I don't have the context of the first commit that escaped non-ascii bytes, but I would expect that that round trip works for even unicode if that is now supported.

@NikolajBjorner
Copy link
Contributor

The get_lstring API didn't age well: Z3 used to only support ASCII (8 bits per character). It now supports up to 32 bits and defaults to a Unicode range as defined by SMTLIB string definition.
A proper API that returns unescaped characters would have to return an array of unsigned numbers.
Internally, we currently use get_lstring only in the Python API where it is wrapped with decoding.
To fix this, I would add a different API function altogether and leave it to consumers to convert unsigned into wchar or char or other encoding.

NikolajBjorner added a commit that referenced this issue Oct 21, 2021
…string, #5615

Signed-off-by: Nikolaj Bjorner <nbjorner@microsoft.com>
NikolajBjorner added a commit that referenced this issue Oct 21, 2021
Signed-off-by: Nikolaj Bjorner <nbjorner@microsoft.com>
@NikolajBjorner
Copy link
Contributor

I updated the C++ API by removing the "escaped" getter because they are both escaped (and there was another bug in one of the escape conversions to fix). There is a new function get_wstring() that returns a vector of unsigned. I am not sure if there is any more useful API (maybe one of the C++ versions for building Unicode strings works, but wchar definitely doesn't have enough bits with only 16).

@carstimon
Copy link
Author

Thanks for the quick turnaround, this is great!

I am only half familiar with this, but C++ has

  • std::string (sequence of char): Array of bytes
  • std::wstring (sequence of wchar_t): Array of an unpromised number of bits (depend on platform, ugh)
  • std::u16string (sequence of char16_t): Array of 16 bits (thankfully independent of platform)
  • std::u32string (sequence of char32_t): Array of 32 bits (thankfully independent of platform)

In my (again limit) experience most situations have std::string with an encoded utf8 rather then a std::u32string.

The conversion between a utf8-encoded std::string and a std::u32string of codepoints in the stl is shown here: https://stackoverflow.com/a/43302460/9517687
This is relatively small amount of code so I think it is reasonable to not do it in Z3, applying the principle of "making the fewest choices in a library" since the data is already basically in the std::u32string format.

If you think it makes sense, I can make the changes to implement something like

  std::u32string expr::get_unescaped_string() // Maybe instead of std::vector<unsigned> version?
  expr context::unescaped_utf8_string_val(const std::u32string& str);

The change between std::u32string and std::vector versions is not a big difference, just saves an extra conversion step if you want to use the conversion code I linked above.

The new version for creating exprs avoids some questions about how to properly call the string_val method with a potentially unicode string, I think? If I understand correctly the current methods force it to ASCII, and I'm not sure how that interacts with other UTF8 strings. (E.g. if i made an ascii constant and compared it to a free utf8 variable what happens).

NikolajBjorner added a commit that referenced this issue Oct 22, 2021
Signed-off-by: Nikolaj Bjorner <nbjorner@microsoft.com>
NikolajBjorner added a commit that referenced this issue Oct 23, 2021
…haracters in get_lstring

Signed-off-by: Nikolaj Bjorner <nbjorner@microsoft.com>
@NikolajBjorner
Copy link
Contributor

I have iterated on this with a few updates. While I haven't adapted the more descriptive naming convention with "unescaped_utf8_string_val" I believe the functionality now addresses accessing characters properly.

@NikolajBjorner
Copy link
Contributor

I am closing this for now. If there is something that needs fixing add a comment, but otherwise I have to assume it is handled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants