Z3_get_lstring now adds escaping of some bytes. #5615

carstimon · 2021-10-21T14:29:24Z

For #2286 the function Z3_get_lstring was introduced, and its documentation says it returns an unescaped string. We use this in the C++ API via "get_string" which was introduced in a second commit. Our team uses this function to get the string in a C++ program, and we need to get the string to be exactly the string produced by z3 (e.g. if we are asserting the length is 3 we expect the length of the produced string to be 3). Furthermore, the result is now ambiguous: "\u{a}" could mean that Z3 found the one-character string "\n" or the 5 character string "\u{a}".

As of this commit and this commit the function now does escaping. In particular c.string_val("\n", 1).get_string() == "\n" is no longer true. Is there a version to preserve the round-trip?

I don't have the context of the first commit that escaped non-ascii bytes, but I would expect that that round trip works for even unicode if that is now supported.

The text was updated successfully, but these errors were encountered:

NikolajBjorner · 2021-10-21T15:06:11Z

The get_lstring API didn't age well: Z3 used to only support ASCII (8 bits per character). It now supports up to 32 bits and defaults to a Unicode range as defined by SMTLIB string definition.
A proper API that returns unescaped characters would have to return an array of unsigned numbers.
Internally, we currently use get_lstring only in the Python API where it is wrapped with decoding.
To fix this, I would add a different API function altogether and leave it to consumers to convert unsigned into wchar or char or other encoding.

…string, #5615 Signed-off-by: Nikolaj Bjorner <nbjorner@microsoft.com>

Signed-off-by: Nikolaj Bjorner <nbjorner@microsoft.com>

NikolajBjorner · 2021-10-21T22:32:01Z

I updated the C++ API by removing the "escaped" getter because they are both escaped (and there was another bug in one of the escape conversions to fix). There is a new function get_wstring() that returns a vector of unsigned. I am not sure if there is any more useful API (maybe one of the C++ versions for building Unicode strings works, but wchar definitely doesn't have enough bits with only 16).

carstimon · 2021-10-22T00:17:58Z

Thanks for the quick turnaround, this is great!

I am only half familiar with this, but C++ has

std::string (sequence of char): Array of bytes
std::wstring (sequence of wchar_t): Array of an unpromised number of bits (depend on platform, ugh)
std::u16string (sequence of char16_t): Array of 16 bits (thankfully independent of platform)
std::u32string (sequence of char32_t): Array of 32 bits (thankfully independent of platform)

In my (again limit) experience most situations have std::string with an encoded utf8 rather then a std::u32string.

The conversion between a utf8-encoded std::string and a std::u32string of codepoints in the stl is shown here: https://stackoverflow.com/a/43302460/9517687
This is relatively small amount of code so I think it is reasonable to not do it in Z3, applying the principle of "making the fewest choices in a library" since the data is already basically in the std::u32string format.

If you think it makes sense, I can make the changes to implement something like

  std::u32string expr::get_unescaped_string() // Maybe instead of std::vector<unsigned> version?
  expr context::unescaped_utf8_string_val(const std::u32string& str);

The change between std::u32string and std::vector versions is not a big difference, just saves an extra conversion step if you want to use the conversion code I linked above.

The new version for creating exprs avoids some questions about how to properly call the string_val method with a potentially unicode string, I think? If I understand correctly the current methods force it to ASCII, and I'm not sure how that interacts with other UTF8 strings. (E.g. if i made an ascii constant and compared it to a free utf8 variable what happens).

Signed-off-by: Nikolaj Bjorner <nbjorner@microsoft.com>

…haracters in get_lstring Signed-off-by: Nikolaj Bjorner <nbjorner@microsoft.com>

NikolajBjorner · 2021-10-23T16:23:08Z

I have iterated on this with a few updates. While I haven't adapted the more descriptive naming convention with "unescaped_utf8_string_val" I believe the functionality now addresses accessing characters properly.

NikolajBjorner · 2021-10-26T16:01:08Z

I am closing this for now. If there is something that needs fixing add a comment, but otherwise I have to assume it is handled.

NikolajBjorner added a commit that referenced this issue Oct 21, 2021

add API to access unescaped strings, update documentation of Z3_get_l…

05e7ed9

…string, #5615 Signed-off-by: Nikolaj Bjorner <nbjorner@microsoft.com>

NikolajBjorner added a commit that referenced this issue Oct 21, 2021

updated C++ API for escaped and unescaped strings #5615

f05ac8a

Signed-off-by: Nikolaj Bjorner <nbjorner@microsoft.com>

NikolajBjorner added a commit that referenced this issue Oct 22, 2021

use some suggestions from #5615

7f41d61

Signed-off-by: Nikolaj Bjorner <nbjorner@microsoft.com>

NikolajBjorner added a commit that referenced this issue Oct 23, 2021

#5615 - update documentation and use non-encoded versions for ASCII c…

3a3cef8

…haracters in get_lstring Signed-off-by: Nikolaj Bjorner <nbjorner@microsoft.com>

NikolajBjorner closed this as completed Oct 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Z3_get_lstring now adds escaping of some bytes. #5615

Z3_get_lstring now adds escaping of some bytes. #5615

carstimon commented Oct 21, 2021

NikolajBjorner commented Oct 21, 2021

NikolajBjorner commented Oct 21, 2021

carstimon commented Oct 22, 2021

NikolajBjorner commented Oct 23, 2021

NikolajBjorner commented Oct 26, 2021

Z3_get_lstring now adds escaping of some bytes. #5615

Z3_get_lstring now adds escaping of some bytes. #5615

Comments

carstimon commented Oct 21, 2021

NikolajBjorner commented Oct 21, 2021

NikolajBjorner commented Oct 21, 2021

carstimon commented Oct 22, 2021

NikolajBjorner commented Oct 23, 2021

NikolajBjorner commented Oct 26, 2021