Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Inconsistent std::regex_replace results on x64 Linux and aarch64 Android #1911

Closed
zheng-yu-yang opened this issue Aug 3, 2023 · 5 comments
Assignees
Labels

Comments

@zheng-yu-yang
Copy link

Description

The code to reproduce the issue:
https://gist.github.com/zheng-yu-yang/a225cc68350ae828cf68b2591730871c

With NDK r25c/r26b1 x64, the output is:

src: e0 b8 81 e0 b8 b3 e0 b8 a5 e0 b8 b1 e0 b8 87 e0 b8 88 e0 b8 b0 e0 b8 a1 e0 b8 b2 e0 b8 96 e0 b8 b6 e0 b8 87 e0 b9 83 e0 b8 99 20 e0 b9 80 e0 b8 a7 e0 b8 a5 e0 b8 b2 20 31 35 3a 34 38 20 20
dst: e0 b8 81 e0 b8 b3 e0 b8 a5 e0 b8 b1 e0 b8 87 e0 b8 88 e0 b8 b0 e0 b8 a1 e0 b8 b2 e0 b8 96 e0 b8 b6 e0 b8 87 e0 b9 83 e0 b8 99 20 e0 b9 80 e0 b8 a7 e0 b8 a5 e0 b8 b2 20 31 35 3a 34 38

where the 2 trailing spaces (20 20) are removed from the source content.

With NDK r25c/r26b1 aarch64, the output is:

src: e0 b8 81 e0 b8 b3 e0 b8 a5 e0 b8 b1 e0 b8 87 e0 b8 88 e0 b8 b0 e0 b8 a1 e0 b8 b2 e0 b8 96 e0 b8 b6 e0 b8 87 e0 b9 83 e0 b8 99 20 e0 b9 80 e0 b8 a7 e0 b8 a5 e0 b8 b2 20 31 35 3a 34 38 20 20
dst: e0 b8 b3 e0 b8 a5 e0 b8 b1 e0 b8 87 e0 b8 88 e0 b8 b0 e0 b8 a1 e0 b8 b2 e0 b8 96 e0 b8 b6 e0 b8 87 e0 b9 83 e0 b8 99 20 e0 b9 80 e0 b8 a7 e0 b8 a5 e0 b8 b2 20 31 35 3a 34 38

where the trailing 2 spaces (20, 20), as well as the first 3 bytes (e0 b8 81) from the source content are removed.

I did not use any building system but manually compiled the source code to static ELF binary.
clang++ regex_test.cpp -o regex_test -static -std=c++11 or
g++ regex_test.cpp -o regex_test -static -std=c++11

I also tried cross gcc (arm-linux-gnueabihf-g++, 11.4.0) and native gcc (g++, 11.4.0), and the outputs are the same (only trailing spaces were removed).

I run the test program on MI MAX 3 (Android 9) and Ubuntu 22.04 in WSL.

Affected versions

r25, r26

Canary version

No response

Host OS

Linux

Host OS version

Ubuntu 22.04 in WSL

Affected ABIs

arm64-v8a

Build system

Other (specify below)

Other build system

manual build from bash command line

minSdkVersion

30

Device API level

28

@zheng-yu-yang
Copy link
Author

BTW, this issue is not specific to wchar_t. When I keep the source string to UTF8 encoded std::string and modify everything else accordingly, the results are still inconsistent, only in a different way: the source string has some illegal UTF8 bytes inserted besides removing the trailing spaces. I can provide the source code to reproduce if needed.

@rprichard
Copy link
Collaborator

So far I'm not seeing a difference between arm64 and x86_64 behavior. On both a P and an Sv2 emulator, I see this:

src: e0 b8 81 e0 b8 b3 e0 b8 a5 e0 b8 b1 e0 b8 87 e0 b8 88 e0 b8 b0 e0 b8 a1 e0 b8 b2 e0 b8 96 e0 b8 b6 e0 b8 87 e0 b9 83 e0 b8 99 20 e0 b9 80 e0 b8 a7 e0 b8 a5 e0 b8 b2 20 31 35 3a 34 38 20 20 
dst: e0 b8 b3 e0 b8 a5 e0 b8 b1 e0 b8 87 e0 b8 88 e0 b8 b0 e0 b8 a1 e0 b8 b2 e0 b8 96 e0 b8 b6 e0 b8 87 e0 b9 83 e0 b8 99 20 e0 b9 80 e0 b8 a7 e0 b8 a5 e0 b8 b2 20 31 35 3a 34 38 

That does seem wrong though? I reduced it to:

#include <cstdio>
#include <regex>
int main() {
    std::string src = "A";
    std::string dst = std::regex_replace(src, std::regex(""), "x");
    printf("[%s]\n", dst.c_str());
    return 0;
}
// libstdc++ output: [xAx]
// libc++ output:    [xx]

I see some comments in http://eel.is/c++draft/re about a "zero-length match", so I'm guessing these test cases have defined behavior, and maybe there's a libc++ bug here.

@zheng-yu-yang
Copy link
Author

Prichard,

Thanks for the try.

Actually you got consistent but incorrect results. They are incorrect because the first 3 bytes (e0 b8 81) should not be removed according to the regular expression "^\s*|\s*$" which matches the leading or trailing spaces while, e0 b8 81 are the encoding for the Thai character 'ก'.

@rprichard
Copy link
Collaborator

I reported the issue to LLVM, llvm/llvm-project#64451.

@pirama-arumuga-nainar
Copy link
Collaborator

Upstream issue was fixed in llvm/llvm-project#94550. Will try to cherry-pick to the next prebuilt drop into r27.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Prebuilts submitted
Status: Merged
Development

No branches or pull requests

4 participants