Add support for utf8mb3 #3007

danielbeardsley · 2024-09-03T22:42:53Z

Yes, it's an old encoding that mysql is phasing out. However, some DBs out there still use it and this library shouldn't crash in unexpected ways. Some servers (like ours) have some default connection settings that are still set to utf8mb3 even though all our columns / tables are in utf8mb4.

Ironically, if you run a query that has no results (REPLACE, DELETE, ...) then the metadata in the empty resultset is set to the server's default charset. If that happens to be utf8mb3, this library crashes.

Closes #1398

Co-Author @davidrans

Yes, it's an old encoding that myswl is phasing out. However, some DBs out there still use it. Some servers (like ours) have some default connection setting that is still set to utf8mb3 even though all our columns / tables are in utf8mb4. Ironically, if you run a query that has *no results* (REPLACE, DELETE, ...) then the metadata in the empty result set is set to the server's default charset. If that happens to utf8mb3, this library crashes.

codecov · 2024-09-03T23:02:14Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.13%. Comparing base (30064f4) to head (264dfdd).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #3007   +/-   ##
=======================================
  Coverage   88.13%   88.13%           
=======================================
  Files          71       71           
  Lines       12889    12890    +1     
  Branches     1352     1353    +1     
=======================================
+ Hits        11360    11361    +1     
  Misses       1529     1529

Flag	Coverage Δ
compression-0	`88.13% <100.00%> (+<0.01%)`	⬆️
compression-1	`88.13% <100.00%> (+<0.01%)`	⬆️
tls-0	`87.55% <100.00%> (+<0.01%)`	⬆️
tls-1	`87.89% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wellwelwel · 2024-09-03T23:08:04Z

Thanks, @danielbeardsley 🙋🏻‍♂️

Closes #2640

Could you explain how these changes affect #2640?

danielbeardsley · 2024-09-03T23:32:22Z

Could you explain how these changes affect

I realize there likely could be several different causes of that error, but that's the error I got when I had this same utf8mb3 problem in single-connection mode. While using a pool connection, I received the error from #1398.

You're welcome to drop that "Closes" if you think it should stay open.

sidorares · 2024-09-03T23:42:48Z

let's drop #2640 , I'm not convinced it's directly related
thanks for the PR @danielbeardsley !

sidorares · 2024-09-03T23:51:36Z

I'm trying to find some references to confirm mysql charset name <-> code <-> iconv charset name mapping

looking at https://github.com/mysql/mysql-server/blob/596f0d238489a9cf9f43ce1ff905984f58d227b6/sql/protocol_classic.cc#L406

  MySQL has a very flexible character set support as documented in
  [Character Set Support](http://dev.mysql.com/doc/refman/5.7/en/charset.html).
  The list of character sets and their IDs can be queried as follows:

<pre>
  SELECT id, collation_name FROM information_schema.collations ORDER BY id;
  +----+-------------------+
  | id | collation_name    |
  +----+-------------------+
  |  1 | big5_chinese_ci   |
  |  2 | latin2_czech_cs   |
  |  3 | dec8_swedish_ci   |
  |  4 | cp850_general_ci  |
  |  5 | latin1_german1_ci |
  |  6 | hp8_english_ci    |
  |  7 | koi8r_general_ci  |
  |  8 | latin1_swedish_ci |
  |  9 | latin2_general_ci |
  | 10 | swe7_swedish_ci   |
  +----+-------------------+
</pre>

  The following table shows a few common character sets.

  Number |  Hex  | Character Set Name
  -------|-------|-------------------
       8 |  0x08 | @ref my_charset_latin1 "latin1_swedish_ci"
      33 |  0x21 | @ref my_charset_utf8mb3_general_ci "utf8mb3_general_ci"
      63 |  0x3f | @ref my_charset_bin "binary"

utf8mb3_general_ci has a code 33, which we currently map to cesu8

sidorares · 2024-09-03T23:53:01Z

for a context, cesu8: https://en.wikipedia.org/wiki/CESU-8
also see discussion in #374 (comment)

davidrans · 2024-09-04T15:21:16Z

This is what I get on our DB. utf8mb3_unicode_ci maps to 192 which is where I got that number:

mysql> show collation WHERE charset = 'utf8mb3';
+-----------------------------+---------+-----+---------+----------+---------+---------------+
| Collation                   | Charset | Id  | Default | Compiled | Sortlen | Pad_attribute |
+-----------------------------+---------+-----+---------+----------+---------+---------------+
| utf8mb3_bin                 | utf8mb3 |  83 |         | Yes      |       1 | PAD SPACE     |
| utf8mb3_croatian_ci         | utf8mb3 | 213 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_czech_ci            | utf8mb3 | 202 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_danish_ci           | utf8mb3 | 203 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_esperanto_ci        | utf8mb3 | 209 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_estonian_ci         | utf8mb3 | 198 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_general_ci          | utf8mb3 |  33 | Yes     | Yes      |       1 | PAD SPACE     |
| utf8mb3_general_mysql500_ci | utf8mb3 | 223 |         | Yes      |       1 | PAD SPACE     |
| utf8mb3_german2_ci          | utf8mb3 | 212 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_hungarian_ci        | utf8mb3 | 210 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_icelandic_ci        | utf8mb3 | 193 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_latvian_ci          | utf8mb3 | 194 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_lithuanian_ci       | utf8mb3 | 204 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_persian_ci          | utf8mb3 | 208 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_polish_ci           | utf8mb3 | 197 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_romanian_ci         | utf8mb3 | 195 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_roman_ci            | utf8mb3 | 207 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_sinhala_ci          | utf8mb3 | 211 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_slovak_ci           | utf8mb3 | 205 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_slovenian_ci        | utf8mb3 | 196 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_spanish2_ci         | utf8mb3 | 206 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_spanish_ci          | utf8mb3 | 199 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_swedish_ci          | utf8mb3 | 200 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_tolower_ci          | utf8mb3 |  76 |         | Yes      |       1 | PAD SPACE     |
| utf8mb3_turkish_ci          | utf8mb3 | 201 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_unicode_520_ci      | utf8mb3 | 214 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_unicode_ci          | utf8mb3 | 192 |         | Yes      |       8 | PAD SPACE     |
| utf8mb3_vietnamese_ci       | utf8mb3 | 215 |         | Yes      |       8 | PAD SPACE     |
+-----------------------------+---------+-----+---------+----------+---------+---------------+

utf8mb3_unicode_ci is the one our db is using:

mysql> SHOW VARIABLES LIKE 'character_set_server';
SHOW VARIABLES LIKE 'collation_server';
+----------------------+---------+
| Variable_name        | Value   |
+----------------------+---------+
| character_set_server | utf8mb3 |
+----------------------+---------+
1 row in set (0.00 sec)

+------------------+--------------------+
| Variable_name    | Value              |
+------------------+--------------------+
| collation_server | utf8mb3_unicode_ci |
+------------------+--------------------+
1 row in set (0.00 sec)

sidorares · 2024-09-04T22:41:28Z

@danielbeardsley could you check if there is any missing charset id in addition to 192? Everything from your table needs to map to utf8 I believe, they only differ in collation / case sensitivity which does not apply to the driver

danielbeardsley · 2024-09-04T23:55:51Z

could you check if there is any missing charset id in addition to 192

I'm not sure I understand. Missing from where? The lib/constants/encoding_charset.js file only has 40 or so out of 300, so yes.

Oh wait, you mean utilize tools/generate... to print the missing mappings.

Here, I included their ids too:
{"dec8":3,"eucjpms":97,"geostd8":92,"hp8":6,"keybcs2":37,"swe7":10}

Are you suggesting I add these with their ids to encoding_charset.js? I'm not entirely sure I understand this system and what I'm adding.

danielbeardsley mentioned this pull request Sep 3, 2024

mysql2 ignores charset on pool options #1325

Open

danielbeardsley mentioned this pull request Sep 3, 2024

DB: fix DB connection and query format iFixit/pulldasher#413

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for utf8mb3 #3007

Add support for utf8mb3 #3007

danielbeardsley commented Sep 3, 2024 •

edited by sidorares

Loading

codecov bot commented Sep 3, 2024

wellwelwel commented Sep 3, 2024 •

edited

Loading

danielbeardsley commented Sep 3, 2024

sidorares commented Sep 3, 2024

sidorares commented Sep 3, 2024

sidorares commented Sep 3, 2024

davidrans commented Sep 4, 2024 •

edited

Loading

sidorares commented Sep 4, 2024

danielbeardsley commented Sep 4, 2024

Add support for utf8mb3 #3007

Are you sure you want to change the base?

Add support for utf8mb3 #3007

Conversation

danielbeardsley commented Sep 3, 2024 • edited by sidorares Loading

codecov bot commented Sep 3, 2024

Codecov Report

wellwelwel commented Sep 3, 2024 • edited Loading

danielbeardsley commented Sep 3, 2024

sidorares commented Sep 3, 2024

sidorares commented Sep 3, 2024

sidorares commented Sep 3, 2024

davidrans commented Sep 4, 2024 • edited Loading

sidorares commented Sep 4, 2024

danielbeardsley commented Sep 4, 2024

danielbeardsley commented Sep 3, 2024 •

edited by sidorares

Loading

wellwelwel commented Sep 3, 2024 •

edited

Loading

davidrans commented Sep 4, 2024 •

edited

Loading