-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF-8 string #1037
UTF-8 string #1037
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- There is a bug in lpad/rpad operation when the count is a negative number. Also add tests to cover these two cases.
- Shell bug:
a. If we are typing utf-8 strings in the shell, the ctrl-left,ctrl-right is not working.
b. If we are typing utf-8 strings in the shell, the number of characters that can fit in the screen seems incorrect:
If you are typing a long utf-8 string in the shell, the left most utf-8 characters will be hidden even if there are still spaces left in the screen.
dataset/tinysnb/schema.cypher
Outdated
@@ -1,5 +1,6 @@ | |||
create node table person (ID INt64, fName StRING, gender INT64, isStudent BoOLEAN, isWorker BOOLEAN, age INT64, eyeSight DOUBLE, birthdate DATE, registerTime TIMESTAMP, lastJobDuration interval, workedHours INT64[], usedNames STRING[], courseScoresPerTerm INT64[][], PRIMARY KEY (ID)); | |||
create node table organisation (ID INT64, name STRING, orgCode INT64, mark DOUBLE, score INT64, history STRING, licenseValidInterval INTERVAL, rating DOUBLE, PRIMARY KEY (ID)); | |||
create node table movies (ID INT64, name STRING, PRIMARY KEY (ID)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this dataset is only used to test utf-8 strings. To make the testing simpler, you can have one simply column name and use name as primary key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes previously ID didn't support chars so had to add that.
I think 2 columns are fine, or else I'll have to go back and change all the tests for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always use the minimum code/dataset for testing, so it is better for you to just keep one column in movies table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've dropped the extra column, now name is the only column and primary key.
The bug for lpad & rpad has been handled. The ctrl + left / right bug has been handled (now we are jumping properly between words). For the characters rendering issue, I have opened an issue here #1042 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to take another look after you fixed the bug in left & right
((uint32_t)max(left.len + right, (int64_t)0)); | ||
int64_t leftLen; | ||
Length::operation(left, leftLen); | ||
int64_t len = (right > 0) ? min(leftLen, right) : max(leftLen + right, (int64_t)0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code still has a bug:
kuzu> return left('123456', 0);
---------------------
| left('123456', 0) |
---------------------
| 123456 |
---------------------
(1 tuple)
Time: 1.18ms (compiling), 0.68ms (executing)
Duckdb:
D select left('123456', 0) from test;
┌───────────────────┐
│ left('123456', 0) │
├───────────────────┤
│ │
└───────────────────┘
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed the issue, it is printing empty val now.
for (auto i = 0; i < totalByteLength; i++) { | ||
if (inputString[i] & 0x80) { | ||
int64_t length = 0; | ||
// use grapheme iterator to identify bytes of utf8 char and increment once for each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capitalize the first character 'u', and add a period at the end of the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
int64_t leftLen; | ||
Length::operation(left, leftLen); | ||
int64_t len = (right > 0) ? min(leftLen, right) : max(leftLen + right, (int64_t)0); | ||
SubStr::operation(left, leftLen - len + 1, len, result, resultValueVector); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right has similar bugs to left, see my comments in left to reproduce it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed it
str[i] = toupper(str[i]); | ||
} | ||
return len; | ||
BaseLowerUpperOperation::operation(input, result, resultValueVector, /* isUpper */ true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment should appear after the const true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kuzu> match (t:test) return t;
----------
| t.name |
----------
| alice |
----------
| 人 |
----------
I think we need to change the result printer as well. If it is easy, you can do in your current PR.
Otherwise, open an issue for this, and let aziz handle this.
I've fixed the left & right operation bugs. For the result printer, I tried to fix it but can't find the solution for an issue, even if we count I've raised an issue here #1047 |
I ran some queries with duckdb for long utf8 characters and it seems they don't print the characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kuzu> return array_extract('都是大叔大的撒的', 4);
------------------------------------------------
| array_extract('都是大叔大的撒的', 4) |
------------------------------------------------
| ? |
------------------------------------------------
(1 tuple)
Time: 14.64ms (compiling), 0.70ms (executing)
kuzu>
Have you checked the array_extract
function? Please be patient and check the correctness every string functions carefully. All functions that supported by our system are under: src/function/string/operations/include
.
BTW: we also need to update the find
operation to make it compatible with UTF8.
Both list_extract and array_extract didn't have UTF8 char support, that's why they were not working.
Also added tests for +ve, -ve & zero index for extract functions. For the |
tools/shell/linenoise.cpp
Outdated
buf[l.pos] = aux; | ||
if (l.pos != l.len - 1) | ||
l.pos++; | ||
char tempBuffer[128]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we always use tmp
instead of temp
for variable names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed it
utf8 changes to be merged with master