Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 string #1037

Merged
merged 1 commit into from
Nov 20, 2022
Merged

UTF-8 string #1037

merged 1 commit into from
Nov 20, 2022

Conversation

anuchak
Copy link
Collaborator

@anuchak anuchak commented Nov 16, 2022

utf8 changes to be merged with master

Copy link
Collaborator

@acquamarin acquamarin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. There is a bug in lpad/rpad operation when the count is a negative number. Also add tests to cover these two cases.
  2. Shell bug:
    a. If we are typing utf-8 strings in the shell, the ctrl-left,ctrl-right is not working.
    b. If we are typing utf-8 strings in the shell, the number of characters that can fit in the screen seems incorrect:
    If you are typing a long utf-8 string in the shell, the left most utf-8 characters will be hidden even if there are still spaces left in the screen.

@@ -1,5 +1,6 @@
create node table person (ID INt64, fName StRING, gender INT64, isStudent BoOLEAN, isWorker BOOLEAN, age INT64, eyeSight DOUBLE, birthdate DATE, registerTime TIMESTAMP, lastJobDuration interval, workedHours INT64[], usedNames STRING[], courseScoresPerTerm INT64[][], PRIMARY KEY (ID));
create node table organisation (ID INT64, name STRING, orgCode INT64, mark DOUBLE, score INT64, history STRING, licenseValidInterval INTERVAL, rating DOUBLE, PRIMARY KEY (ID));
create node table movies (ID INT64, name STRING, PRIMARY KEY (ID));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this dataset is only used to test utf-8 strings. To make the testing simpler, you can have one simply column name and use name as primary key.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes previously ID didn't support chars so had to add that.
I think 2 columns are fine, or else I'll have to go back and change all the tests for this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always use the minimum code/dataset for testing, so it is better for you to just keep one column in movies table.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've dropped the extra column, now name is the only column and primary key.

third_party/utf8proc/BUILD.bazel Outdated Show resolved Hide resolved
test/test_files/tinySNB/function/string.test Show resolved Hide resolved
@anuchak
Copy link
Collaborator Author

anuchak commented Nov 18, 2022

The bug for lpad & rpad has been handled.

The ctrl + left / right bug has been handled (now we are jumping properly between words).

For the characters rendering issue, I have opened an issue here #1042

Copy link
Collaborator

@acquamarin acquamarin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to take another look after you fixed the bug in left & right

((uint32_t)max(left.len + right, (int64_t)0));
int64_t leftLen;
Length::operation(left, leftLen);
int64_t len = (right > 0) ? min(leftLen, right) : max(leftLen + right, (int64_t)0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code still has a bug:

kuzu> return left('123456', 0);
---------------------
| left('123456', 0) |
---------------------
| 123456            |
---------------------
(1 tuple)
Time: 1.18ms (compiling), 0.68ms (executing)

Duckdb:

D select left('123456', 0) from test;
┌───────────────────┐
│ left('123456', 0) │
├───────────────────┤
│                   │
└───────────────────┘

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed the issue, it is printing empty val now.

for (auto i = 0; i < totalByteLength; i++) {
if (inputString[i] & 0x80) {
int64_t length = 0;
// use grapheme iterator to identify bytes of utf8 char and increment once for each
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize the first character 'u', and add a period at the end of the comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

int64_t leftLen;
Length::operation(left, leftLen);
int64_t len = (right > 0) ? min(leftLen, right) : max(leftLen + right, (int64_t)0);
SubStr::operation(left, leftLen - len + 1, len, result, resultValueVector);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right has similar bugs to left, see my comments in left to reproduce it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed it

str[i] = toupper(str[i]);
}
return len;
BaseLowerUpperOperation::operation(input, result, resultValueVector, /* isUpper */ true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment should appear after the const true

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

Copy link
Collaborator

@acquamarin acquamarin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kuzu> match (t:test) return t;
----------
| t.name |
----------
| alice  |
----------
| 人    |
----------

I think we need to change the result printer as well. If it is easy, you can do in your current PR.
Otherwise, open an issue for this, and let aziz handle this.

@anuchak
Copy link
Collaborator Author

anuchak commented Nov 18, 2022

I've fixed the left & right operation bugs.

For the result printer, I tried to fix it but can't find the solution for an issue, even if we count A and as 1 character, when printing them, the later takes up more space than A. We can use the utf8proc library to count chars properly, but it provides no facility find how much space a utf8 character takes up.

I've raised an issue here #1047

@anuchak
Copy link
Collaborator Author

anuchak commented Nov 18, 2022

I ran some queries with duckdb for long utf8 characters and it seems they don't print the characters.
They just cut it short with ellipsis ...

Copy link
Collaborator

@acquamarin acquamarin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kuzu> return array_extract('都是大叔大的撒的', 4);
------------------------------------------------
| array_extract('都是大叔大的撒的', 4) |
------------------------------------------------
| ?                                            |
------------------------------------------------
(1 tuple)
Time: 14.64ms (compiling), 0.70ms (executing)
kuzu> 

Have you checked the array_extract function? Please be patient and check the correctness every string functions carefully. All functions that supported by our system are under: src/function/string/operations/include.
BTW: we also need to update the find operation to make it compatible with UTF8.

@anuchak
Copy link
Collaborator Author

anuchak commented Nov 18, 2022

Both list_extract and array_extract didn't have UTF8 char support, that's why they were not working.
I've fixed both of them:

    ------------------------------------------------
    | array_extract('都是大叔大的撒的', 4) |
    ------------------------------------------------
    | 叔                                          |
    ------------------------------------------------
    (1 tuple)
    Time: 19.24ms (compiling), 1.39ms (executing)
    
    -----------------------------------------------
    | list_extract('都是大叔大的撒的', 4) |
    -----------------------------------------------
    | 叔                                         |
    -----------------------------------------------
    (1 tuple)
    Time: 8.62ms (compiling), 0.60ms (executing)

Also added tests for +ve, -ve & zero index for extract functions.

For the find function, not sure which function are you referring to ? There is no such function in the catalog currently.
Also I checked DuckDB's find function code (the needle haystack algorithm) and they have no UTF8 specific code there.

buf[l.pos] = aux;
if (l.pos != l.len - 1)
l.pos++;
char tempBuffer[128];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we always use tmp instead of temp for variable names

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed it

@anuchak anuchak merged commit 025be59 into master Nov 20, 2022
@anuchak anuchak deleted the utf8 branch November 20, 2022 14:43
@anuchak anuchak restored the utf8 branch November 20, 2022 14:43
@anuchak anuchak deleted the utf8 branch November 21, 2022 01:14
@ray6080 ray6080 changed the title merge utf8 to master UTF-8 string Jan 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants