Allow attribute DISABLE-COMPRESSIONS in UNICODE collations #6915

asfernandes · 2021-08-02T20:52:18Z

Written by @javihonza in firebird-support list https://groups.google.com/g/firebird-support/c/VCXnWp0IZVw:

Hello,
we have a speed problem when using national COLLATE on UTF8 columns.

If we use UTF8 without COLLATE the speed problem does not arise. The problem is not when using ANSI with national COLLATE.

The speed problem is both in CASE SENSITIVE and CASE INSENSITIVE. Tested on FB3 (UCI 6.9) and FB 4 (default UCI).

The speed problem prevents us from switching to unicode (ANSI -> UTF8).
Please who should we contact to solve the problem?

How to simulate the problem:

create collation UNICODE_CSCZ_CI
   for UTF8  
   from UNICODE  
   case insensitive  
   'LOCALE=cs_CZ'
;
   
create collation UNICODE_CSCZ_CS
   for UTF8  
   from UNICODE  
   case sensitive  
   'LOCALE=cs_CZ'
;

CREATE TABLE TEST1M (
  ANSI_CZ VARCHAR(10)  CHARACTER SET WIN1250 COLLATE PXW_CSY,
  UNICODE_CS_CZ VARCHAR(10)  CHARACTER SET UTF8 COLLATE UNICODE_CSCZ_CS,
  UNICODE_CI_CZ VARCHAR(10) CHARACTER SET UTF8 COLLATE UNICODE_CSCZ_CI,
  UNICODE_CS VARCHAR(10) CHARACTER SET UTF8,
  UNICODE_CI VARCHAR(10) CHARACTER SET UTF8 COLLATE UNICODE_CI
);

CREATE OR ALTER PROCEDURE GetStr(AORDERID BIGINT)
RETURNS (AResult CHAR(10)) AS
declare variable Base36Chars CHAR(36);
declare variable mResult VARCHAR(10);
declare variable ID BIGINT;
declare variable I INT;
BEGIN
    Base36Chars = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ';
    mResult = '';
    AResult = mResult;
    ID = AORDERID;
    WHILE (ID > 0) DO
    BEGIN
      I = MOD(ID, 36);
      ID = ID / 36;
      mResult = mResult || SubString(Base36Chars from I + 1 for 1);
    END
    AResult = LEFT(mResult || '0000000', 7);
  Suspend;
END;

-- Generate test string data
-- 000000, 100000...900000...A00000...Z00000,
-- 010000, 110000...910000...A10000...Z10000,
-- ...

EXECUTE BLOCK
AS
  DECLARE ROWSCOUNT INT = 1000000;
  DECLARE I INT = 0;
  DECLARE C INT = 0;
  DECLARE Str VARCHAR(10);
BEGIN
  WHILE (C < ROWSCOUNT) DO
  BEGIN
    SELECT AResult from GetStr(:I) into :Str;
    -- Skip Y, Z
    IF ((LEFT(Str, 1) <> 'Y') AND (LEFT(Str, 1) <> 'Z')) THEN BEGIN
      INSERT INTO TEST1M(ANSI_CZ, UNICODE_CS_CZ, UNICODE_CI_CZ, UNICODE_CS, UNICODE_CI) VALUES (:Str, :Str, :Str, :Str, :Str);
      C = C + 1;
    END
    I = I + 1;
  END
END;

CREATE INDEX TEST1M_ANSI_CZ ON TEST1M (ANSI_CZ);
CREATE INDEX TEST1M_UNICODE_CS_CZ ON TEST1M (UNICODE_CS_CZ);
CREATE INDEX TEST1M_UNICODE_CI_CZ ON TEST1M (UNICODE_CI_CZ);
CREATE INDEX TEST1M_UNICODE_CS ON TEST1M (UNICODE_CS);
CREATE INDEX TEST1M_UNICODE_CI ON TEST1M (UNICODE_CI);

SELECT COUNT(*) FROM TEST1M;
  -- Time 269ms                           <<<<<<< OK >>>>>>>
  -- Result 1000000

--######################################### Scenario use WHERE >= #########################################
-- Problem only in CZECH collate case insensitive
   
  SELECT ANSI_CZ FROM TEST1M
  WHERE ANSI_CZ >= 'Z'
  ORDER BY ANSI_CZ;
    -- Time: 0ms                          <<<<<<< OK >>>>>>>
    -- Result nothing                     
 
  SELECT UNICODE_CS FROM TEST1M
  WHERE UNICODE_CS >= 'Z'
  ORDER BY UNICODE_CS;
    -- Time: 0ms                          <<<<<<< OK >>>>>>>
    -- Result nothing                     
 
  SELECT UNICODE_CI FROM TEST1M
  WHERE UNICODE_CI >= 'Z'
  ORDER BY UNICODE_CI;
    -- Time: 0ms                          <<<<<<< OK >>>>>>>
    -- Result nothing

  SELECT UNICODE_CS_CZ FROM TEST1M
  WHERE UNICODE_CS_CZ >= 'Z'
  ORDER BY UNICODE_CS_CZ;
    -- Time: 0ms                          <<<<<<< OK >>>>>>>
    -- Result nothing

  SELECT UNICODE_CI_CZ FROM TEST1M        
  WHERE UNICODE_CI_CZ >= 'Z'
  ORDER BY UNICODE_CI_CZ;
    -- Time: 4,294s                       <<<<<<< COLLATE CASE INSENSITIVE HEAR IS PERFORMANCE PROBLEM FOR VALUE "Z" 4,294s WHY? VALUE "Y" IS OK 0ms >>>>>>>
    -- Result nothing

  SELECT UNICODE_CI_CZ FROM TEST1M
  WHERE UNICODE_CI_CZ >= 'Y'
  ORDER BY UNICODE_CI_CZ;
    -- Time: 0ms                          <<<<<<< OK >>>>>>>
    -- Result nothing
 
  SELECT FIRST 1 UNICODE_CI_CZ FROM TEST1M        
  WHERE UNICODE_CI_CZ >= 'C'
  ORDER BY UNICODE_CI_CZ;
    -- Time: 1,531s                       <<<<<<< collate case INSENSITIVE NEXT PERFOMANCE PROBLEM HAVE VALUE "C" 1,531s WHY? VALUE "D" IS OK 0ms >>>>>>>
    -- Result C000000   

  SELECT FIRST 1 UNICODE_CI_CZ FROM TEST1M        
  WHERE UNICODE_CI_CZ >= 'D'
  ORDER BY UNICODE_CI_CZ;
    -- Time: 0ms                          <<<<<<< OK >>>>>>>
    -- Result D000000      
 
--######################################### Scenario use WHERE like #########################################   
-- Problem in CZECH collate case sensitive and insensitive    
  SELECT ANSI_CZ FROM TEST1M
  WHERE ANSI_CZ LIKE 'Z%'
  ORDER BY ANSI_CZ;
    -- Time: 0ms                          <<<<<<< OK >>>>>>>
    -- Result nothing      
                   
  SELECT UNICODE_CS FROM TEST1M
  WHERE UNICODE_CS LIKE 'Z%'
  ORDER BY UNICODE_CS;
    -- Time: 0ms                          <<<<<<< OK >>>>>>>
    -- Result nothing                     

  SELECT UNICODE_CI FROM TEST1M
  WHERE UNICODE_CI LIKE 'Z%'
  ORDER BY UNICODE_CI;
    -- Time: 0ms                          <<<<<<< OK >>>>>>>
    -- Result nothing

  SELECT UNICODE_CS_CZ FROM TEST1M
  WHERE UNICODE_CS_CZ LIKE 'Z%'
  ORDER BY UNICODE_CS_CZ;
    -- Time: 4,247s                       <<<<<<< collate case SENSITIVE HEAR IS PERFOMANCE PROBLEM FOR VALUE "Z" 4,247s WHY? VALUE "Y" IS OK 0ms >>>>>>>
    -- Result nothing

  SELECT UNICODE_CS_CZ FROM TEST1M
  WHERE UNICODE_CS_CZ LIKE 'Y%'
  ORDER BY UNICODE_CS_CZ;
    -- Time: 0ms                          <<<<<<< FOR letter "Y" IS NOT PROBLEM :) WHY? >>>>>>>
    -- Result nothing
 
  SELECT UNICODE_CI_CZ FROM TEST1M        
  WHERE UNICODE_CI_CZ LIKE 'Z%'
  ORDER BY UNICODE_CI_CZ;
    -- Time: 4,52s                        <<<<<<< HEAR IS PERFORMANCE PROBLEM WHY? >>>>>>>
    -- Result nothing

  SELECT UNICODE_CI_CZ FROM TEST1M
  WHERE UNICODE_CI_CZ LIKE 'Y%'
  ORDER BY UNICODE_CI_CZ;
    -- Time: 0ms                          <<<<<<< FOR letter "Y" IS NOT PROBLEM :) WHY? >>>>>>>
    -- Result nothing

The text was updated successfully, but these errors were encountered:

asfernandes · 2021-08-04T14:57:08Z

I've changed the title of this issue so it may be considered an improvement instead of a bug.

Attribute DISABLE-COMPRESSIONS will be allowed for UNICODE collations, hence, without compressions (contractions), the problem with search keys do not exist.

Of course, that will change sort keys. One must analyze pros and cons of this.

Usage will be:

create collation UNICODE_CSCZ_CI
   for UTF8  
   from UNICODE  
   case insensitive  
   'LOCALE=cs_CZ;DISABLE-COMPRESSIONS=1';

create collation UNICODE_CSCZ_CS
   for UTF8  
   from UNICODE  
   case sensitive  
   'LOCALE=cs_CZ;DISABLE-COMPRESSIONS=1';

…ollations.

javihonza · 2021-08-23T13:07:00Z

Hi,
i try use DISABLE-COMPRESSIONS=1 (Snapshot Builds Firebird-4.0.1.2571-0_x64).

Speed is now OK.

Unfortunately, we cannot use the "DISABLE-COMPRESSIONS=1" option because it changes the sorting behavior, which is then incorrect (https://firebirdsql.org/refdocs/langrefupd21-ddl-collation.html Disables compressions (aka contractions). Compressions cause certain character sequences to be sorted as atomic units, e.g. Spanish c+h as a single character ch.).

How to simulate the problem:
`
create collation UNICODE_CSCZ_CI
for UTF8
from UNICODE
case insensitive
'LOCALE=cs_CZ;DISABLE-COMPRESSIONS=1';

create collation UNICODE_CSCZ_CS
for UTF8
from UNICODE
case sensitive
'LOCALE=cs_CZ;DISABLE-COMPRESSIONS=1';

CREATE TABLE TEST_UNICODE_COLLATE (
FIELD_WIN1250_PXW_CSY_CS VARCHAR(30) CHARACTER SET WIN1250 COLLATE PXW_CSY,
FIELD_UTF8_UNICODE_CSCZ_CI VARCHAR(30) CHARACTER SET UTF8 COLLATE UNICODE_CSCZ_CI,
FIELD_UTF8_UNICODE_CSCZ_CS VARCHAR(30) CHARACTER SET UTF8 COLLATE UNICODE_CSCZ_CI
);

INSERT INTO TEST_UNICODE_COLLATE (FIELD_WIN1250_PXW_CSY_CS, FIELD_UTF8_UNICODE_CSCZ_CI, FIELD_UTF8_UNICODE_CSCZ_CS)
VALUES ('centimetr', 'centimetr', 'centimetr');

INSERT INTO TEST_UNICODE_COLLATE (FIELD_WIN1250_PXW_CSY_CS, FIELD_UTF8_UNICODE_CSCZ_CI, FIELD_UTF8_UNICODE_CSCZ_CS)
VALUES ('číslo', 'číslo', 'číslo');

INSERT INTO TEST_UNICODE_COLLATE (FIELD_WIN1250_PXW_CSY_CS, FIELD_UTF8_UNICODE_CSCZ_CI, FIELD_UTF8_UNICODE_CSCZ_CS)
VALUES ('daň', 'daň', 'daň');

INSERT INTO TEST_UNICODE_COLLATE (FIELD_WIN1250_PXW_CSY_CS, FIELD_UTF8_UNICODE_CSCZ_CI, FIELD_UTF8_UNICODE_CSCZ_CS)
VALUES ('chléb', 'chléb', 'chléb');

INSERT INTO TEST_UNICODE_COLLATE (FIELD_WIN1250_PXW_CSY_CS, FIELD_UTF8_UNICODE_CSCZ_CI, FIELD_UTF8_UNICODE_CSCZ_CS)
VALUES ('efekt', 'efekt', 'efekt');

SELECT FIELD_WIN1250_PXW_CSY_CS
FROM TEST_UNICODE_COLLATE
ORDER BY FIELD_WIN1250_PXW_CSY_CS;
-- Result OK
-- centimetr
-- číslo
-- daň
-- efekt
-- chléb <<<<<<< OK >>>>>>>

SELECT FIELD_UTF8_UNICODE_CSCZ_CI
FROM TEST_UNICODE_COLLATE
ORDER BY FIELD_UTF8_UNICODE_CSCZ_CI;
-- Result bad sorting text "chléb"
-- centimetr
-- chléb <<<<<<< WRONG POSSITION >>>>>>>
-- číslo
-- daň
-- efekt

SELECT FIELD_UTF8_UNICODE_CSCZ_CS
FROM TEST_UNICODE_COLLATE
ORDER BY FIELD_UTF8_UNICODE_CSCZ_CS;
-- Result bad sorting text "chléb"
-- centimetr
-- chléb <<<<<<< WRONG POSSITION >>>>>>>
-- číslo
-- daň
-- efekt

`

asfernandes · 2021-08-24T19:55:24Z

I explained the problem and the trade-off.

Your initial problem (test) was also a bit artificial, when you wanted to search starting with a single letter.

Depending on your data volume (or your artificial test data), you will have problem with whatever implementation.

pcisar · 2021-08-25T09:39:11Z

Adriano, the problem is that the trade-off is unacceptable for certain languages. So it's more like a workaround band-aid around "speed problem" than real solution.

pcisar · 2021-09-03T15:12:10Z

@asfernandes, did you consider a better solution? Changing sort keys is really unacceptable for any language that uses contractions (like Czech), so users that use it don't have a real fix as the trade-off decision is between very bad performance or incorrect sorting. It would be much appreciated if better solution would appear in next maintenance release.

pavel-zotov · 2021-11-04T08:48:35Z

::: test details :::
Only ability to use 'DISABLE-COMPRESSION' in attributes list is checked here.
Performance comparison with and without this attribute will be checked in separate test.

asfernandes added affect-version: 3.0.7 affect-version: 4.0.0 affect-version: 5.0 Initial labels Aug 2, 2021

asfernandes self-assigned this Aug 2, 2021

asfernandes mentioned this issue Aug 2, 2021

Faster execution of indexed STARTING WITH with UNICODE collation #6872

Closed

asfernandes changed the title ~~Slow indexed access when last characters of the key is prefix of collation's contractions with UNICODE collation~~ Allow attribute DISABLE-COMPRESSIONS in UNICODE collations Aug 4, 2021

asfernandes added the type: improvement label Aug 4, 2021

asfernandes added a commit that referenced this issue Aug 4, 2021

Improvement #6915 - Allow attribute DISABLE-COMPRESSIONS in UNICODE c…

2af9ded

…ollations.

asfernandes added a commit that referenced this issue Aug 4, 2021

Improvement #6915 - Allow attribute DISABLE-COMPRESSIONS in UNICODE c…

cfc09f7

…ollations.

asfernandes added fix-version: 4.0.1 fix-version: 5.0 Beta 1 labels Aug 4, 2021

asfernandes closed this as completed Aug 4, 2021

pavel-zotov added the qa: done successfully label Aug 5, 2021

pavel-zotov added qa: done with caveats and removed qa: done successfully labels Nov 4, 2021

asfernandes mentioned this issue Jan 4, 2022

Improve indexed lookup speed of strings when the last keys characters are part of collated contractions #7093

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow attribute DISABLE-COMPRESSIONS in UNICODE collations #6915

Allow attribute DISABLE-COMPRESSIONS in UNICODE collations #6915

asfernandes commented Aug 2, 2021 •

edited

Loading

asfernandes commented Aug 4, 2021

javihonza commented Aug 23, 2021

asfernandes commented Aug 24, 2021

pcisar commented Aug 25, 2021

pcisar commented Sep 3, 2021

pavel-zotov commented Nov 4, 2021

Allow attribute DISABLE-COMPRESSIONS in UNICODE collations #6915

Allow attribute DISABLE-COMPRESSIONS in UNICODE collations #6915

Comments

asfernandes commented Aug 2, 2021 • edited Loading

asfernandes commented Aug 4, 2021

javihonza commented Aug 23, 2021

asfernandes commented Aug 24, 2021

pcisar commented Aug 25, 2021

pcisar commented Sep 3, 2021

pavel-zotov commented Nov 4, 2021

asfernandes commented Aug 2, 2021 •

edited

Loading