Issue-630: Changes to define the "CASEI" function for supporting case-insensitive string comparisons. #641

pvretano · 2021-10-25T03:27:30Z

Still in progress pending discussion in the SWG. Need to find out:

Is the name CASEI OK?
Do we need a similar function named ACCENTI for accent-insensitive matching? I have included it in the BNF and JSON encodings of CQL but I have not discussed it in the specification.
If we do need ACCENTI, should that be in its own conformance class or parse of the same class as CASEI?

NOTE: In the processed of adding the CASEI function I noticed that the way we had defined the UPPER/LOWER functions in the JSON was not correct. UPPER/LOWER and CASEI are suppose to be built-in functions and so they should be encoded as a function would be but that was not the case. So, I fixed that.

…-insensitive string comparisons.

cportele

I have added a few topics for discussion.

Regarding the questions:

The name CASEI works for me.
Good question whether we want to support ACCENTI, too. Probably yes. If we add it, I think it should be in its own conformance class. Some backends may not support accent collation.

cql2/standard/schema/cql2.bnf

cql2/standard/clause_7_enhanced.adoc

cql2/standard/schema/cql2.yml

cportele · 2021-10-25T14:24:42Z

Meeting 2021-10-25: See the review comments for their resolution. Regarding the open questions:

We will use the name CASEI.
We will include ACCENTI in its own requirement/conformance class. (ACCENTI includes all kinds of diacritics.)
We also discussed including a combined operator, but decided to not do this in this version. Combining both requires CASEI(ACCENTI("Größer")).
@pvretano will update the PR based on the decision. After that it should be ready to merge.

philvarner · 2021-12-03T17:08:25Z

I have a concern about the semantics of and complexity of implementing CASEI. My main concern is that the use of CASEI in an expression has a non-local effect on the evaluation of the expression, such that it's not actually a function. With any user-defined function, the function evaluates to a value. However, CASEI doesn't evaluate to a single value, but rather modifies the semantics of the operators in the expression it's used in.

One of the examples is:

CASEI(road_class) IN (CASEI('Οδος'),CASEI('Straße'))

but this has the same semantics as:

CASEI(road_class) IN ('Οδος','Straße')

since road_class=straße would evaluate "true" in either expression.

The presence of CASEI changes the semantics of the operators, rather than the value of the argument to the function. I feel like this is going to make it very difficult for implementers to get right, much more so than just having UPPER and LOWER functions. This gets complicated, especially when multiple functions are involved. For example, I'm not sure what the semantics of this expression would be:

FOO(BAR(CASEI(road_class))) IN ('Οδος','Straße')

The CASEI should do something, but it seems odd that FOO(BAR(CASEI(road_class))) and CASEI(FOO(BAR(road_class))) would have the same semantics.

pvretano · 2021-12-03T17:29:39Z

@philvarner where did you get that CASEI() has a non-local effect? That is not the intent and CASEI() has no effect on the semantics of the operator. The intent of CASEI() is to convert its argument into a case insensitive form (using the appropriate case folding rules) that can be compared to another similarly converted value. So, CASEI('straße')<> 'Straße'.

philvarner · 2021-12-03T18:55:33Z

Ah, so CASEI returns a new type that is this insensitive format, and then that can only be compared against other values of that same type. So something like CASEI('straße') == 'Straße' would yield a type-checking error because the left hand is a CASEI object and the right is a string.

philvarner · 2021-12-03T18:58:30Z

One place in the spec that made me think it was a non-local effect is "Although the CASEI function is used in CQL2 to trigger case insensitive matching" -- the use of the word "trigger" here makes me think it's putting the expression into a different mode, rather than the CASEI function returning a (new) special type.

pvretano · 2021-12-03T19:22:27Z

@philvarner CASEI('straße') == 'Straße' would evaluate to false. The CASEI() function returns a string but one to which case folding rules have been applied. So CASEI('straße') would generate the string 'STRASSE' and CASEI('Οδος') would generate the string 'ΟΔΟΣ'. Clearly I have more explanation to add to the text of CQL2! ... and also to clarify the bit that made you think that the scope of CASEI was beyond its argument. I will endeavour to clarify a bit more.

philvarner · 2021-12-06T13:21:59Z

Yes, I think describing case folding would be good, since I've been at this a long time I've never used it, and instead just always done lowercasing, which is probably a fault of my focus primarily on English. Glad I know about it now!

The way I would describe it is that it returns a string typed representation of the input string that is guaranteed to be equal to any other case insensitive representation of that string. The Unicode specs I read say this is typically lowercase (in contrast to your example of it being uppercase 'STRASSE'), but I would stress to users that they should ONLY compare the results of two CASEI function expressions if they actually want to ensure a correct comparison, because the exact rules of conversion are opaque to them. So the only "durable" comparison is:

CASEI(prop1) == CASEI('Straße')

CASEI(prop1) == 'strasse' might work, but it's not guaranteed to work across implementations or between versions of the same implemenation.

cql2/standard/schema/examples/example27.txt

Co-authored-by: Phil Varner <phil@philvarner.com>

comparisons.

pvretano · 2021-12-06T18:41:15Z

06-DEC-2021: There is an open question about whether CASEI() and ACCENTI() should be in their own conformance classes and we need feedback on that. @cportele and @pvretano feel that English implementations probably only care about CASEI() while European implementors would want ACCENTI() too. So based on that, separate conformance classes would make sense. If anyone has any feeling about that, please add your comments here and the SWG can circle back in the a future meeting.

philvarner · 2021-12-06T20:04:32Z

I'm usually an advocate finer-grained conformance classes, but I don't see a strong case here for separating them. English-only speakers may only see the use case for needing case-insensitive comparison, but the implementations they make will likely also have European language users and data who need the accent-insensitive comparison. I don't know of any databases that only have support for case and not accent. There's always the fall-back of simply implementing CASEI or ACCENTI as a custom function.

m-mohr · 2021-12-07T11:50:44Z

From a German POV, I find the differentiation really weird. ß is supported in CASEI, but ä, ö and ü are not, which is surprising as from a German POV they are basically equivalent. The naming is also not ideal. In German (might be biased), I usually think about the French accents like á and à etc and not about ä and ö if I speak about accents. We call these umlauts and I doubt people will really make the connection.

Overall, it feels like an implementation detail is shifted towards the user. I'd strongly step back from the English centric standpoint and think more global here. Aren't most environments nowadays supporting Unicode related functions? And if a back-end really can't support it and thinks he only delivers in English, then he can still simply not support it in CASEI and will not run into problems.

So my +1 to merging ACCENTI and CASEI.

m-mohr · 2021-12-07T12:55:21Z

I'm coming from a (non-developer) user point of view here. Why does a novice user need to understand the difference between CASEI and ACCENTI? As a normal user I just want to say that I don't care about casing and all the details should be handled by the implementation. If I let the users think about it, I make a complexity of the implementation (to support accents etc) be a complexity for the user... But maybe I'm oversimplifying things?

rsmith013 · 2021-12-07T12:57:58Z

@m-mohr for a user, this would definitely be nicer.

cportele · 2021-12-07T13:06:35Z

But maybe I'm oversimplifying things?

I think so. Normalization of characters is orthogonal to case folding. If we merge the two, this would create issues for the cases where one wants one or the other, but not both.

Do databases usually offer a convenience function that combines both? Even if not, should we provide one?

aznan2 · 2021-12-07T13:12:53Z

In Swedish, ÅÄÖ are not just A and O with accents but their own separate letters. The words "Al" and "Ål" have completely different meaning, so it would make sense to me to want to want to search in a case insensitive way while still keeping the accents, so that I don't get a hit on "Ål" when searching for "al".

m-mohr · 2021-12-07T13:25:28Z

Wait, I was only talking about casing, not about allowing ä for a etc. I wasn't even aware that the latter is tackled by those functions and it's indeed completely different thing and should likely be a separate function, but casing is casing regardless of what I pass in (I think).

Maybe I misunderstood what the functions do.

I'd expect that with CASEI:

Süd is equal to SÜD, but is NOT equal to Sud (or Sued)

What I could imagine is that ACCENTI (or whatever it's then called) allows:

Süd is equal to Sud, but is NOT equal to SÜD (or SUED)

That seperates two different use cases clearly and makes it easy to use (which might not be easy to implement).
That's just my not very informed opinion on how I'd expect it to work from a German speaker POV. This might not be what others expect...

I don't have a strong preference whether it's one or two conformance classes.

pvretano · 2021-12-07T13:45:35Z

@cportele et all, FYI, most databases that I surveyed before creating this PR handle case insensitivity separately from accent insensitivity and many databases have a convention where letters such as '_AI' and '_CI' are appended to the collation name to indicate whether the collation is case insensitive, accent insensitive or both. The default collation for MySQL, for example, is 'latin1_swedish_ci' so it is a case insensitive -- but not accent insensitive -- collation.

aznan2 · 2021-12-07T14:01:27Z

@m-mohr Sorry, my bad. I read "+1 to merging ACCENTI and CASEI" as merging the two functions, not the conformance classes.

pvretano · 2021-12-07T14:10:27Z

I'm agnostic about whether CASEI and ACCENTI are in one conformance class or two. I can see both sides. On the one hand these are related functions and so existing in one conformance class makes sense. One the other hand, languages have a mix of case and accent requirements and that sorta points to separate conformance classes ... although it seems kinda short sighted to develop software for a specific language or set of languages ... no!? ;) I'm happy to go with the consensus.

philvarner · 2021-12-07T16:13:15Z

To clarify, I think they must be two distinct functions, but I think it's okay to put them in the same conformance class, but I also don't have a problem with putting them in different conformance classes because of the uncertainty about whether users can or want to support both of them.

philvarner · 2021-12-07T16:23:59Z

One other comment to add about what the purpose of these two functions is. Both are useful to operate across data that has not been normalized or has been normalized to values that are different than they should be.

CASEI is useful when I have different data that sets a field value to "PLANET", "Planet", or "planet" and I want to match either without having to enumerate all the variations.

ACCENTI is useful when either:
(1) accents (or, more generally, diacritics not available in ASCII) were dropped when indexing the data, where the value is actually "wrong" in many cases -- e.g., even in Spanish (the language with the smallest set of characters outside ASCII), papa is potato and papá is father, so they're not interchangeable. Some of my data could have either been rewritten to ASCII and dropped the á, so I want to match on either even though what I really want is only "papá", or
(2) I'm an English-only speaker who doesn't know how to type a á and just want to use ASCII because I know that my data doesn't have both of these words in it.

heidivanparys · 2022-01-17T10:37:57Z

cql2/standard/clause_4_terms_and_definitions.adoc

+[[collation-def]]
+==== collation
+a set of rules that indicate how to compare and sort character string data
+the process of ordering units of textual information https://www.unicode.org/glossary/#collation[Glossary of Unicode Terms]


The term “collation” does not seem to be used in the remainder of the document. Should it be removed from the terms & definitions?

Meeting 2022-01-17: Use the term in the description and keep the definition.

cql2/standard/requirements/accent-case-insensitive-comparison/REQ_accenti-builtin-function.adoc

heidivanparys · 2022-01-17T10:42:57Z

cql2/standard/requirements/accent-case-insensitive-comparison/REQ_accenti-builtin-function.adoc

+^|A |The server SHALL support a built-in function named `ACCENTI`.
+^|B |The function SHALL accept one argument that can be a character string literal, the name of a property that evaluates to a string literal or a function that returns a string literal (see rules `characterLiteral`, `propertyName`, `function`).
+^|C |The function SHALL return a character string literal.
+^|D |The function SHALL implement https://www.w3.org/TR/charmod-norm/#unicodeNormalization[unicode normalization] described in the implementation guidelines of https://www.unicode.org/versions/Unicode14.0.0[The Unicode Standard, Version 14.0] (see https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf[clause 5.6 Normalization]).


Consider adding the term “Unicode normalization” to the terms & definitions.

Unicode normalization
normalization
process of removing alternate representations of equivalent sequences from textual data, to convert the data into a form that can be binary-compared for equivalence

Source: https://www.unicode.org/glossary/#normalization

Meeting 2022-01-17: Agreed.

cql2/standard/requirements/accent-case-insensitive-comparison/REQ_casei-builtin-function.adoc

heidivanparys · 2022-01-17T10:46:22Z

cql2/standard/requirements/accent-case-insensitive-comparison/REQ_casei-builtin-function.adoc

+^|A |The server SHALL support a built-in function named `CASEI`.
+^|B |The function SHALL accept one argument that can be a character string literal, the name of a property that evaluates to a string literal or a function that returns a string literal (see rules `characterLiteral`, `propertyName`, `function`).
+^|C |The function SHALL return a character string literal.
+^|D |The function SHALL implement the https://www.w3.org/TR/charmod-norm/#definitionCaseFolding[full case folding] algorithm defined in the implementation guidelines of https://www.unicode.org/versions/Unicode14.0.0[The Unicode Standard, Version 14.0] (see https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf[clause 5.18 Case Mappings, sub-clause Caseless Matching]).


Consider adding the term “Unicode case folding” to the terms & definitions:

Unicode case folding
case folding
process of making two texts which differ only in case identical for comparison purposes

Note: Case folding is meant for the purpose of string matching.

Source: https://www.w3.org/TR/charmod-norm/#definitionCaseFolding

Consider adding and using the term “Unicode case-insensitive matching” as well.

Meeting 2022-01-17: Agreed.

…REQ_accenti-builtin-function.adoc Meeting 2022-01-17: Agreed. Co-authored-by: Heidi Vanparys <heidi.vanparys@gmail.com>

…REQ_casei-builtin-function.adoc Meeting 2022-01-17: Agreed. Co-authored-by: Heidi Vanparys <heidi.vanparys@gmail.com>

cportele · 2022-01-17T15:32:20Z

Meeting 2022-01-17: Define both functions in one requirements/conformance class. In most cases, implementing the functions will simply require calling the function in the underlying datastore, but if someone wants/needs to implement the capability themselves, this is a complex implementation. If it turns out that it would be better to have separate conformance declarations, we can add two additional conformance classes, one for each function.

@pvretano will resolve the outstanding comments/edits and then merge the PR.

philvarner · 2022-01-31T14:03:42Z

cql2/standard/clause_7_enhanced.adoc

+====
+
+----
+etat_vol = ACCENTI('débárquér')


@pvretano I know you just merged this, but shouldn't this be

ACCENTI(etat_vol) = ACCENTI('débárquér')

as as it is for CASEI? same for the JSON example.

cportele · 2022-01-31T14:05:54Z

@philvarner - I agree.

There are also a few other comments/edits that were discussed in this PR that are not yet included. I am working on a follow-up PR. I will make that change there, too.

cc @pvretano

pvretano · 2022-01-31T14:09:39Z

@philvarner yes, it should be ACCENTI(etat_vol) = ACCENTI('débárquér'). @cportele do you want me to fix the example or will you do it in the followup PR?

cportele · 2022-01-31T14:11:49Z

@pvretano - I have already fixed it in my edits (the property references in the JSON also needed an update).

There were still some outstanding comments/edits from the discussion in PR #641: * Add explanatory text (#641 (comment)) * Removed definition of "collation" - it was not used in the text and it is not necessary to introduce the term (#641 (comment)) * Add "unicode normalization" as a term (#641 (comment)) * Add "unicode case folding" as a term (#641 (comment)) * Correct examples (#641 (comment))

follow-up edits from PR #641 Meeting 2022-01-31: Merge the PR. Any additional comment would be addressed in a new PR.

Issue-630: Changes to define the "CASEI" function for supporting case…

9d5e464

…-insensitive string comparisons.

pvretano linked an issue Oct 25, 2021 that may be closed by this pull request

Case-insensitive string comparison #630

Closed

pvretano requested a review from cportele October 25, 2021 03:28

cportele reviewed Oct 25, 2021

View reviewed changes

cql2/standard/schema/cql2.bnf Outdated Show resolved Hide resolved

cql2/standard/clause_7_enhanced.adoc Outdated Show resolved Hide resolved

cql2/standard/schema/cql2.yml Outdated Show resolved Hide resolved

Oops ... forgot to add these files.

e906e71

pvretano mentioned this pull request Dec 2, 2021

CQL2 Case-insensitive Comparison upper/lower #656

Closed

philvarner mentioned this pull request Dec 2, 2021

Updates to Filter Extension for CQL2 radiantearth/stac-api-spec#225

Merged

4 tasks

Align the definitons of CASEI and ACCENTI in the BNF and JSON encodings.

1ba1aec

philvarner mentioned this pull request Dec 3, 2021

Add Filter Extension ACCENTI conformance class (WIP in CQL2) #657

Closed

pvretano added 6 commits December 5, 2021 22:19

Fix typo.

3b544b3

Add more content about accent insensitive comparisons.

dd2519c

Update accent-insensitive comparison example.

5530ad6

Restore previous name to resolve merge conflict.

4719095

Merge branch 'master' into issue-630

80d81a6

Rename.

a7eb3a8

philvarner reviewed Dec 6, 2021

View reviewed changes

cql2/standard/schema/examples/example27.txt Outdated Show resolved Hide resolved

pvretano and others added 4 commits December 6, 2021 11:39

Update cql2/standard/schema/examples/example27.txt

93fb966

Co-authored-by: Phil Varner <phil@philvarner.com>

Add some informative text about durable case/accent insensitive

f5d9b05

comparisons.

Fix typo.

8354a5c

Fix typo.

95b426d

cportele mentioned this pull request Dec 13, 2021

update the CQL implementation interactive-instruments/ldproxy#432

Closed

heidivanparys reviewed Jan 17, 2022

View reviewed changes

cql2/standard/requirements/accent-case-insensitive-comparison/REQ_accenti-builtin-function.adoc Outdated Show resolved Hide resolved

heidivanparys reviewed Jan 17, 2022

View reviewed changes

cql2/standard/requirements/accent-case-insensitive-comparison/REQ_casei-builtin-function.adoc Outdated Show resolved Hide resolved

heidivanparys reviewed Jan 17, 2022

View reviewed changes

cportele and others added 2 commits January 17, 2022 16:30

Update cql2/standard/requirements/accent-case-insensitive-comparison/…

f68100d

…REQ_accenti-builtin-function.adoc Meeting 2022-01-17: Agreed. Co-authored-by: Heidi Vanparys <heidi.vanparys@gmail.com>

Update cql2/standard/requirements/accent-case-insensitive-comparison/…

63ad704

…REQ_casei-builtin-function.adoc Meeting 2022-01-17: Agreed. Co-authored-by: Heidi Vanparys <heidi.vanparys@gmail.com>

cportele mentioned this pull request Jan 26, 2022

Case-insensitive string comparison #630

Closed

Merge branch 'master' into issue-630

5928904

pvretano merged commit 3580a4f into opengeospatial:master Jan 30, 2022

philvarner reviewed Jan 31, 2022

View reviewed changes

cportele mentioned this pull request Jan 31, 2022

follow-up edits from PR #641 #668

Merged

cportele added a commit that referenced this pull request Jan 31, 2022

Merge pull request #668 from opengeospatial/issue-630-pt2

ed0a671

follow-up edits from PR #641 Meeting 2022-01-31: Merge the PR. Any additional comment would be addressed in a new PR.

pvretano mentioned this pull request Feb 2, 2022

CQL: give CASEI and ACCENTI distinct conformance classes #672

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue-630: Changes to define the "CASEI" function for supporting case-insensitive string comparisons. #641

Issue-630: Changes to define the "CASEI" function for supporting case-insensitive string comparisons. #641

pvretano commented Oct 25, 2021 •

edited

Loading

cportele left a comment

cportele commented Oct 25, 2021 •

edited

Loading

philvarner commented Dec 3, 2021

pvretano commented Dec 3, 2021 •

edited

Loading

philvarner commented Dec 3, 2021

philvarner commented Dec 3, 2021

pvretano commented Dec 3, 2021 •

edited

Loading

philvarner commented Dec 6, 2021

pvretano commented Dec 6, 2021 •

edited

Loading

philvarner commented Dec 6, 2021

m-mohr commented Dec 7, 2021

m-mohr commented Dec 7, 2021 •

edited

Loading

rsmith013 commented Dec 7, 2021 •

edited

Loading

cportele commented Dec 7, 2021

aznan2 commented Dec 7, 2021

m-mohr commented Dec 7, 2021 •

edited

Loading

pvretano commented Dec 7, 2021 •

edited

Loading

aznan2 commented Dec 7, 2021

pvretano commented Dec 7, 2021

philvarner commented Dec 7, 2021

philvarner commented Dec 7, 2021

heidivanparys Jan 17, 2022

cportele Jan 17, 2022

heidivanparys Jan 17, 2022

cportele Jan 17, 2022

heidivanparys Jan 17, 2022

cportele Jan 17, 2022

cportele commented Jan 17, 2022

philvarner Jan 31, 2022

cportele commented Jan 31, 2022

pvretano commented Jan 31, 2022

cportele commented Jan 31, 2022

Issue-630: Changes to define the "CASEI" function for supporting case-insensitive string comparisons. #641

Issue-630: Changes to define the "CASEI" function for supporting case-insensitive string comparisons. #641

Conversation

pvretano commented Oct 25, 2021 • edited Loading

cportele left a comment

Choose a reason for hiding this comment

cportele commented Oct 25, 2021 • edited Loading

philvarner commented Dec 3, 2021

pvretano commented Dec 3, 2021 • edited Loading

philvarner commented Dec 3, 2021

philvarner commented Dec 3, 2021

pvretano commented Dec 3, 2021 • edited Loading

philvarner commented Dec 6, 2021

pvretano commented Dec 6, 2021 • edited Loading

philvarner commented Dec 6, 2021

m-mohr commented Dec 7, 2021

m-mohr commented Dec 7, 2021 • edited Loading

rsmith013 commented Dec 7, 2021 • edited Loading

cportele commented Dec 7, 2021

aznan2 commented Dec 7, 2021

m-mohr commented Dec 7, 2021 • edited Loading

pvretano commented Dec 7, 2021 • edited Loading

aznan2 commented Dec 7, 2021

pvretano commented Dec 7, 2021

philvarner commented Dec 7, 2021

philvarner commented Dec 7, 2021

heidivanparys Jan 17, 2022

Choose a reason for hiding this comment

cportele Jan 17, 2022

Choose a reason for hiding this comment

heidivanparys Jan 17, 2022

Choose a reason for hiding this comment

cportele Jan 17, 2022

Choose a reason for hiding this comment

heidivanparys Jan 17, 2022

Choose a reason for hiding this comment

cportele Jan 17, 2022

Choose a reason for hiding this comment

cportele commented Jan 17, 2022

philvarner Jan 31, 2022

Choose a reason for hiding this comment

cportele commented Jan 31, 2022

pvretano commented Jan 31, 2022

cportele commented Jan 31, 2022

pvretano commented Oct 25, 2021 •

edited

Loading

cportele commented Oct 25, 2021 •

edited

Loading

pvretano commented Dec 3, 2021 •

edited

Loading

pvretano commented Dec 3, 2021 •

edited

Loading

pvretano commented Dec 6, 2021 •

edited

Loading

m-mohr commented Dec 7, 2021 •

edited

Loading

rsmith013 commented Dec 7, 2021 •

edited

Loading

m-mohr commented Dec 7, 2021 •

edited

Loading

pvretano commented Dec 7, 2021 •

edited

Loading