UTF-8 string #1037

anuchak · 2022-11-16T13:46:38Z

utf8 changes to be merged with master

acquamarin

There is a bug in lpad/rpad operation when the count is a negative number. Also add tests to cover these two cases.
Shell bug:
a. If we are typing utf-8 strings in the shell, the ctrl-left,ctrl-right is not working.
b. If we are typing utf-8 strings in the shell, the number of characters that can fit in the screen seems incorrect:
If you are typing a long utf-8 string in the shell, the left most utf-8 characters will be hidden even if there are still spaces left in the screen.

acquamarin · 2022-11-16T20:01:13Z

dataset/tinysnb/schema.cypher

@@ -1,5 +1,6 @@
 create node table person (ID INt64, fName StRING, gender INT64, isStudent BoOLEAN, isWorker BOOLEAN, age INT64, eyeSight DOUBLE, birthdate DATE, registerTime TIMESTAMP, lastJobDuration interval, workedHours INT64[], usedNames STRING[], courseScoresPerTerm INT64[][], PRIMARY KEY (ID));
 create node table organisation (ID INT64, name STRING, orgCode INT64, mark DOUBLE, score INT64, history STRING, licenseValidInterval INTERVAL, rating DOUBLE, PRIMARY KEY (ID));
+create node table movies (ID INT64, name STRING, PRIMARY KEY (ID));


I think this dataset is only used to test utf-8 strings. To make the testing simpler, you can have one simply column name and use name as primary key.

Yes previously ID didn't support chars so had to add that.
I think 2 columns are fine, or else I'll have to go back and change all the tests for this.

We always use the minimum code/dataset for testing, so it is better for you to just keep one column in movies table.

I've dropped the extra column, now name is the only column and primary key.

third_party/utf8proc/BUILD.bazel

src/function/string/operations/include/base_lower_upper_operation.h

src/function/string/operations/include/lpad_operation.h

test/test_files/tinySNB/function/string.test

anuchak · 2022-11-18T02:05:07Z

The bug for lpad & rpad has been handled.

The ctrl + left / right bug has been handled (now we are jumping properly between words).

For the characters rendering issue, I have opened an issue here #1042

acquamarin

I want to take another look after you fixed the bug in left & right

acquamarin · 2022-11-18T03:51:53Z

src/function/string/operations/include/left_operation.h

-                                ((uint32_t)max(left.len + right, (int64_t)0));
+        int64_t leftLen;
+        Length::operation(left, leftLen);
+        int64_t len = (right > 0) ? min(leftLen, right) : max(leftLen + right, (int64_t)0);


I think this code still has a bug:

kuzu> return left('123456', 0); --------------------- | left('123456', 0) | --------------------- | 123456 | --------------------- (1 tuple) Time: 1.18ms (compiling), 0.68ms (executing)

Duckdb:

D select left('123456', 0) from test; ┌───────────────────┐ │ left('123456', 0) │ ├───────────────────┤ │ │ └───────────────────┘

fixed the issue, it is printing empty val now.

acquamarin · 2022-11-18T03:52:21Z

src/function/string/operations/include/length_operation.h

+        for (auto i = 0; i < totalByteLength; i++) {
+            if (inputString[i] & 0x80) {
+                int64_t length = 0;
+                // use grapheme iterator to identify bytes of utf8 char and increment once for each


Capitalize the first character 'u', and add a period at the end of the comment

acquamarin · 2022-11-18T03:54:36Z

src/function/string/operations/include/right_operation.h

+        int64_t leftLen;
+        Length::operation(left, leftLen);
+        int64_t len = (right > 0) ? min(leftLen, right) : max(leftLen + right, (int64_t)0);
+        SubStr::operation(left, leftLen - len + 1, len, result, resultValueVector);


Right has similar bugs to left, see my comments in left to reproduce it.

acquamarin · 2022-11-18T03:56:02Z

src/function/string/operations/include/upper_operation.h

-            str[i] = toupper(str[i]);
-        }
-        return len;
+        BaseLowerUpperOperation::operation(input, result, resultValueVector, /* isUpper */ true);


The comment should appear after the const true

acquamarin

kuzu> match (t:test) return t;
----------
| t.name |
----------
| alice  |
----------
| 人    |
----------

I think we need to change the result printer as well. If it is easy, you can do in your current PR.
Otherwise, open an issue for this, and let aziz handle this.

anuchak · 2022-11-18T18:43:32Z

I've fixed the left & right operation bugs.

For the result printer, I tried to fix it but can't find the solution for an issue, even if we count A and 大 as 1 character, when printing them, the later takes up more space than A. We can use the utf8proc library to count chars properly, but it provides no facility find how much space a utf8 character takes up.

I've raised an issue here #1047

anuchak · 2022-11-18T18:49:39Z

I ran some queries with duckdb for long utf8 characters and it seems they don't print the characters.
They just cut it short with ellipsis ...

acquamarin

kuzu> return array_extract('都是大叔大的撒的', 4);
------------------------------------------------
| array_extract('都是大叔大的撒的', 4) |
------------------------------------------------
| ?                                            |
------------------------------------------------
(1 tuple)
Time: 14.64ms (compiling), 0.70ms (executing)
kuzu>

Have you checked the array_extract function? Please be patient and check the correctness every string functions carefully. All functions that supported by our system are under: src/function/string/operations/include.
BTW: we also need to update the find operation to make it compatible with UTF8.

anuchak · 2022-11-18T23:38:00Z

Both list_extract and array_extract didn't have UTF8 char support, that's why they were not working.
I've fixed both of them:

    ------------------------------------------------
    | array_extract('都是大叔大的撒的', 4) |
    ------------------------------------------------
    | 叔                                          |
    ------------------------------------------------
    (1 tuple)
    Time: 19.24ms (compiling), 1.39ms (executing)
    
    -----------------------------------------------
    | list_extract('都是大叔大的撒的', 4) |
    -----------------------------------------------
    | 叔                                         |
    -----------------------------------------------
    (1 tuple)
    Time: 8.62ms (compiling), 0.60ms (executing)

Also added tests for +ve, -ve & zero index for extract functions.

For the find function, not sure which function are you referring to ? There is no such function in the catalog currently.
Also I checked DuckDB's find function code (the needle haystack algorithm) and they have no UTF8 specific code there.

acquamarin · 2022-11-20T01:18:54Z

tools/shell/linenoise.cpp

-                buf[l.pos] = aux;
-                if (l.pos != l.len - 1)
-                    l.pos++;
+                char tempBuffer[128];


we always use tmp instead of temp for variable names

anuchak force-pushed the utf8 branch from 3bedc52 to 8926807 Compare November 16, 2022 13:50

anuchak marked this pull request as ready for review November 16, 2022 13:51

anuchak requested a review from acquamarin November 16, 2022 13:51

acquamarin requested changes Nov 16, 2022

View reviewed changes

anuchak force-pushed the utf8 branch from 6361d9f to caa5e3a Compare November 18, 2022 01:56

anuchak requested a review from acquamarin November 18, 2022 02:06

acquamarin requested changes Nov 18, 2022

View reviewed changes

acquamarin reviewed Nov 18, 2022

View reviewed changes

anuchak force-pushed the utf8 branch from caa5e3a to 38c0b92 Compare November 18, 2022 18:37

anuchak requested a review from acquamarin November 18, 2022 18:50

acquamarin requested changes Nov 18, 2022

View reviewed changes

anuchak force-pushed the utf8 branch from 38c0b92 to fc7d0f5 Compare November 18, 2022 23:33

anuchak requested a review from acquamarin November 18, 2022 23:42

anuchak force-pushed the utf8 branch from fc7d0f5 to 20fb489 Compare November 20, 2022 00:23

acquamarin approved these changes Nov 20, 2022

View reviewed changes

merge utf8 to master

a8d46b6

anuchak force-pushed the utf8 branch from 20fb489 to a8d46b6 Compare November 20, 2022 14:42

anuchak merged commit 025be59 into master Nov 20, 2022

anuchak deleted the utf8 branch November 20, 2022 14:43

anuchak restored the utf8 branch November 20, 2022 14:43

anuchak deleted the utf8 branch November 21, 2022 01:14

ray6080 changed the title ~~merge utf8 to master~~ UTF-8 string Jan 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 string #1037

UTF-8 string #1037

anuchak commented Nov 16, 2022

acquamarin left a comment

acquamarin Nov 16, 2022

anuchak Nov 18, 2022

acquamarin Nov 18, 2022

anuchak Nov 18, 2022

anuchak commented Nov 18, 2022

acquamarin left a comment

acquamarin Nov 18, 2022

anuchak Nov 18, 2022

acquamarin Nov 18, 2022

anuchak Nov 18, 2022

acquamarin Nov 18, 2022

anuchak Nov 18, 2022

acquamarin Nov 18, 2022

anuchak Nov 18, 2022

acquamarin left a comment

anuchak commented Nov 18, 2022

anuchak commented Nov 18, 2022

acquamarin left a comment

anuchak commented Nov 18, 2022 •

edited

Loading

acquamarin Nov 20, 2022

anuchak Nov 20, 2022

UTF-8 string #1037

UTF-8 string #1037

Conversation

anuchak commented Nov 16, 2022

acquamarin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anuchak commented Nov 18, 2022

acquamarin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acquamarin left a comment

Choose a reason for hiding this comment

anuchak commented Nov 18, 2022

anuchak commented Nov 18, 2022

acquamarin left a comment

Choose a reason for hiding this comment

anuchak commented Nov 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anuchak commented Nov 18, 2022 •

edited

Loading