Improve UTF-8 decoding and encoding functions #410

chqrlie · 2024-05-19T12:46:49Z

Ensure proper UTF-8 encoding (1 to 4 bytes).
Handle invalid encodings (return 0xFFFD and consume a single byte) Individually encoded surrogate code points are accepted.

add utf8_scan() to analyze a byte array for UTF-8 contents detects invalid encoding, computes number of codepoints and content kind: plain ASCII, 8-bit, 16-bit or larger codepoints.
add utf8_encode_len(c) to compute the number of bytes to encode c
rename unicode_to_utf8 as utf8_encode
rename unicode_from_utf8 as utf8_decode
add utf8_decode_buf8(dest, size, src, len) to decode a UTF-8 encoded byte array known to contain only ASCII and 8-bit codepoints.
add utf8_decode_buf16(dest, size, src, len) to decode a UTF-8 encoded byte array into an array of 16-bit codepoints using UTF-16 surrogate pairs for non-BMP1 codepoints.
add utf8_encode_buf8(dest, size, src, len) to encode an array of 8-bit codepoints as a UTF-8 encoded null terminated string
add utf16_encode_buf8(dest, size, src, len) to decode an array of 16-bit codepoints (including surrogate pairs) as a UTF-8 encoded null terminated string
detect invalid UTF-8 encoding in RegExp parser
simplify JS_AtomGetStrRT, JS_NewStringLen using the above functions
simplify UTF-8 decoding and error testing

This commit is preliminary for another PR fixing some JSAtom creation inconsistencies and inefficiencies.

Ensure proper UTF-8 encoding (1 to 4 bytes). Handle invalid encodings (return 0xFFFD and consume a single byte) Individually encoded surrogate code points are accepted. - add `utf8_scan()` to analyze a byte array for UTF-8 contents detects invalid encoding, computes number of codepoints and content kind: plain ASCII, 8-bit, 16-bit or larger codepoints. - add `utf8_encode_len(c)` to compute the number of bytes to encode `c` - rename `unicode_to_utf8` as `utf8_encode` - rename `unicode_from_utf8` as `utf8_decode` - add `utf8_decode_buf8(dest, size, src, len)` to decode a UTF-8 encoded byte array known to contain only ASCII and 8-bit codepoints. - add `utf8_decode_buf16(dest, size, src, len)` to decode a UTF-8 encoded byte array into an array of 16-bit codepoints using UTF-16 surrogate pairs for non-BMP1 codepoints. - add `utf8_encode_buf8(dest, size, src, len)` to encode an array of 8-bit codepoints as a UTF-8 encoded null terminated string - add `utf16_encode_buf8(dest, size, src, len)` to decode an array of 16-bit codepoints (including surrogate pairs) as a UTF-8 encoded null terminated string - detect invalid UTF-8 encoding in RegExp parser - simplify `JS_AtomGetStrRT`, `JS_NewStringLen` using the above functions - simplify UTF-8 decoding and error testing

saghul

I only did a shallow review, but I trust you and the tests are happy :-)

chqrlie force-pushed the improve-utf8-functions branch from 50da583 to 1c6a98a Compare May 19, 2024 12:50

saghul approved these changes May 21, 2024

View reviewed changes

chqrlie merged commit 1baa676 into quickjs-ng:master May 21, 2024
47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve UTF-8 decoding and encoding functions #410

Improve UTF-8 decoding and encoding functions #410

chqrlie commented May 19, 2024

saghul left a comment

Improve UTF-8 decoding and encoding functions #410

Improve UTF-8 decoding and encoding functions #410

Conversation

chqrlie commented May 19, 2024

saghul left a comment

Choose a reason for hiding this comment