Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upper and Lower do not handle invalid UTF-8 #2148

Closed
Riolku opened this issue Oct 4, 2023 · 7 comments · Fixed by #2208
Closed

Upper and Lower do not handle invalid UTF-8 #2148

Riolku opened this issue Oct 4, 2023 · 7 comments · Fixed by #2208
Assignees
Labels
bug Something isn't working high-priority

Comments

@Riolku
Copy link
Contributor

Riolku commented Oct 4, 2023

If we call Upper() or Lower() with binary data (anything non-UTF-8), our application simply hangs.

This is because we don't check the return value of utf8proc_codepoint, which is -1 when UTF-8 is invalid. Our size goes unmodified, and we loop forever. Worse, the assert we have simply does nothing, since (as far as I can tell), it doesn't check that the codepoint is non-negative.

Reproduce with:

LOAD FROM "test.csv" RETURN upper(column0);

Where you can create test.csv by running:

python -c "open('test.csv', 'wb').write(b'\xff')"

@Riolku Riolku assigned Riolku and andyfengHKU and unassigned Riolku Oct 4, 2023
Riolku added a commit that referenced this issue Oct 5, 2023
clang-tidy is a static analyzer, and can find many different issues.
So far I have integrated a few simple checks, and disabled the remaining
ones. They will be enabled in future PRs.

Closes #2152. Ref #2148.
Riolku added a commit that referenced this issue Oct 5, 2023
clang-tidy is a static analyzer, and can find many different issues.
So far I have integrated a few simple checks, and disabled the remaining
ones. They will be enabled in future PRs.

Closes #2152. Ref #2148.
Riolku added a commit that referenced this issue Oct 5, 2023
clang-tidy is a static analyzer, and can find many different issues.
So far I have integrated a few simple checks, and disabled the remaining
ones. They will be enabled in future PRs.

Closes #2152. Ref #2148.
@Riolku
Copy link
Contributor Author

Riolku commented Oct 6, 2023

#2156 adds a hack to make sure we don't hang in this case, but we should fix this properly by ensuring STRING types are always valid UTF-8, which is a much bigger task. This is compliant with what postgres does.

We should still keep this check just in case, it's always better to throw an unexpected error than crash.

Riolku added a commit that referenced this issue Oct 6, 2023
clang-tidy is a static analyzer, and can find many different issues.
So far I have integrated a few simple checks, and disabled the remaining
ones. They will be enabled in future PRs.

Closes #2152. Ref #2148.
Riolku added a commit that referenced this issue Oct 6, 2023
clang-tidy is a static analyzer, and can find many different issues.
So far I have integrated a few simple checks, and disabled the remaining
ones. They will be enabled in future PRs.

Closes #2152. Ref #2148.
Riolku added a commit that referenced this issue Oct 6, 2023
clang-tidy is a static analyzer, and can find many different issues.
So far I have integrated a few simple checks, and disabled the remaining
ones. They will be enabled in future PRs.

Closes #2152. Ref #2148.
Riolku added a commit that referenced this issue Oct 6, 2023
clang-tidy is a static analyzer, and can find many different issues.
So far I have integrated a few simple checks, and disabled the remaining
ones. They will be enabled in future PRs.

Closes #2152. Ref #2148.
Riolku added a commit that referenced this issue Oct 6, 2023
clang-tidy is a static analyzer, and can find many different issues.
So far I have integrated a few simple checks, and disabled the remaining
ones. They will be enabled in future PRs.

Closes #2152. Ref #2148.
Riolku added a commit that referenced this issue Oct 7, 2023
clang-tidy is a static analyzer, and can find many different issues.
So far I have integrated a few simple checks, and disabled the remaining
ones. They will be enabled in future PRs.

Closes #2152. Ref #2148.
Riolku added a commit that referenced this issue Oct 10, 2023
clang-tidy is a static analyzer, and can find many different issues.
So far I have integrated a few simple checks, and disabled the remaining
ones. They will be enabled in future PRs.

Closes #2152. Ref #2148.
Riolku added a commit that referenced this issue Oct 10, 2023
clang-tidy is a static analyzer, and can find many different issues.
So far I have integrated a few simple checks, and disabled the remaining
ones. They will be enabled in future PRs.

Closes #2152. Ref #2148.
Riolku added a commit that referenced this issue Oct 10, 2023
clang-tidy is a static analyzer, and can find many different issues.
So far I have integrated a few simple checks, and disabled the remaining
ones. They will be enabled in future PRs.

Closes #2152. Ref #2148.
Riolku added a commit that referenced this issue Oct 10, 2023
clang-tidy is a static analyzer, and can find many different issues.
So far I have integrated a few simple checks, and disabled the remaining
ones. They will be enabled in future PRs.

Closes #2152. Ref #2148.
@ray6080 ray6080 mentioned this issue Oct 10, 2023
10 tasks
Riolku added a commit that referenced this issue Oct 10, 2023
clang-tidy is a static analyzer, and can find many different issues.
So far I have integrated a few simple checks, and disabled the remaining
ones. They will be enabled in future PRs.

Closes #2152. Ref #2148.
Riolku added a commit that referenced this issue Oct 10, 2023
clang-tidy is a static analyzer, and can find many different issues.
So far I have integrated a few simple checks, and disabled the remaining
ones. They will be enabled in future PRs.

Closes #2152. Ref #2148.
Riolku added a commit that referenced this issue Oct 10, 2023
clang-tidy is a static analyzer, and can find many different issues.
So far I have integrated a few simple checks, and disabled the remaining
ones. They will be enabled in future PRs.

Closes #2152. Ref #2148.
Riolku added a commit that referenced this issue Oct 10, 2023
clang-tidy is a static analyzer, and can find many different issues.
So far I have integrated a few simple checks, and disabled the remaining
ones. They will be enabled in future PRs.

Closes #2152. Ref #2148.
Riolku added a commit that referenced this issue Oct 10, 2023
clang-tidy is a static analyzer, and can find many different issues.
So far I have integrated a few simple checks, and disabled the remaining
ones. They will be enabled in future PRs.

Closes #2152. Ref #2148.
@acquamarin
Copy link
Collaborator

Solved in #2208

@Riolku
Copy link
Contributor Author

Riolku commented Oct 13, 2023

After a bunch of investigation, here's the state of things:

  • Copy from CSV is broken and could be fixed by Throw copy exception on invalid utf8 string #2208.
  • Copy from parquet doesn't support strings.
  • Copy from NPY also doesn't support strings.
  • ANTLR4 throws on invalid UTF-8, so we can't insert bad strings via any of the APIs (right now).

This implies that #2208 would fix the issue. However, I would instead propose that we treat all data in CSV and other files as BLOB, and then call an appropriate cast function. We already have DECODE for the latter task. It'd be good to make UTF-8 checks centralized, instead of duplicating and inevitably forgetting them here and there in the codebase.

Finally, our current LIST_SLICE function doesn't respect UTF-8, and thus means that this issue is not fixed by #2208.

kuzu> MATCH (u:user) set u.name = "成績評価の甘い授業が高く評価";
-
(0 tuples)
Time: 0.76ms (compiling), 3.70ms (executing)
kuzu> MATCH (u:user) RETURN upper(u.name[:4]);
-------------------------------------------
| UPPER(LIST_SLICE(u.name,TO_INT64(0),4)) |
-------------------------------------------
| 成侀                                    |
-------------------------------------------
(1 tuple)

@acquamarin
Copy link
Collaborator

acquamarin commented Oct 13, 2023

However, I would instead propose that we treat all data in CSV and other files as BLOB, and then call an appropriate cast function.
Does this mean we should implement casting functions from blob to all other types that we support?

@Riolku
Copy link
Contributor Author

Riolku commented Oct 13, 2023

I don't know, because that seems silly. It definitely makes sense to implement BLOB -> String. I suppose BLOB -> INT64 only makes sense internally. We don't need a UTF-8 check for those cases, anything non-ascii isn't a valid integer anyway.

On second thought I'm no longer sure what to do. Maybe having per-filetype checks is reasonable since they are so different (and we don't want to pay the price of a copy).

@semihsalihoglu-uw semihsalihoglu-uw added bug Something isn't working high-priority labels Jan 8, 2024
@semihsalihoglu-uw
Copy link
Contributor

My 2 cents is to have per filetype checks but I'll leave the solution to the implementer. I suggest we prioritize this. Calling upper and lower functions on UTF-8 could be a common scenario.

@andyfengHKU
Copy link
Contributor

Test for all string functions on utf-8 is done in #3287

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high-priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants