Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting "Invalid UTF-8 character for UTF control characters' #362

Closed
paresh-panda opened this issue Jul 20, 2022 · 7 comments
Closed

Getting "Invalid UTF-8 character for UTF control characters' #362

paresh-panda opened this issue Jul 20, 2022 · 7 comments

Comments

@paresh-panda
Copy link

Hi,
While compiling expression along with utf 8 chars getting below error. If I remove HS_FLAG_UTF8 flag then it compiles fine. IS there any restriction for utf8 control characters?
"bob logged in from �"

code snippet
if (hs_compile(test1, HS_FLAG_DOTALL|HS_FLAG_UTF8, HS_MODE_BLOCK, NULL, &database,
&compile_err) != HS_SUCCESS) {
fprintf(stderr, "ERROR: Unable to compile pattern "%s": %s\n",
test1, compile_err->message);
hs_free_compile_error(compile_err);

ERROR: Unable to compile pattern "bob logged in from ": Expression is not valid UTF-8.

@hongyang7
Copy link
Contributor

Hyperscan supports UTF8 patterns in 2 ways: has utf8 flag set, or has (*UTF8) control verbs at the beginning of a pattern.
Both ways will let Hyperscan know user intends to compile a UTF8 pattern.
However, only the expression itself is a valid UTF8 string, can Hyperscan finally handles it in UTF8.

The following code just checks UTF8 validity of an expression body. FYI.

if (expr.utf8 && !isValidUtf8(expression, len)) {

@paresh-panda
Copy link
Author

Thank you for the quick response!
I have added HS_FLAG_UTF8 flag , and the UTF-8 control character, whether at the end, middle, or end, is giving me an invalid expression error.

if (hs_compile(test1, HS_FLAG_DOTALL|HS_FLAG_UTF8, HS_MODE_BLOCK, NULL, &database,

Selection_071

Can you please suggest how to proceed?
Thank you!

@hongyang7
Copy link
Contributor

hongyang7 commented Jul 25, 2022

Can you provide us the full test code? Better in .txt attachment.
By simply copying the expression in your original question seems cannot produce the error.

@paresh-panda
Copy link
Author

Hi,
Please find the attached txt file for the sample code.

Thank you!
HSPoc_cpp.txt

@paresh-panda
Copy link
Author

Hi
Can you please provide your comments over the sample code, the first byte of the char string is the utf 8 control-del character?

@hongyang7
Copy link
Contributor

Hey sorry for being late. Your code should be fine, because the error comes from a bug in our utf8 validity function, where we mistreat 0x7f as an invalid one-byte utf8 case:

// One octet.
if (s[i] < 0x7f) {
DEBUG_PRINTF("one octet\n");

Should be "s[i] <= 0x7f" here.

Your first byte of char string happens to fall into the corner cases. We'll push the fix recently. You might currently do manually modification if needed.

fatchanghao pushed a commit that referenced this issue Oct 27, 2022
@hongyang7
Copy link
Contributor

Please refer to latest develop branch.
Commit id: 062c390

fatchanghao pushed a commit that referenced this issue Feb 15, 2023
markos pushed a commit to VectorCamp/vectorscan that referenced this issue Sep 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants