DRAFT: MF2: fix various parser bugs and add more tests #3092

catamorphism · 2024-08-08T21:47:23Z

DRAFT wip, do not review, depends on #3063

Checklist

Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-_____
Required: The PR title must be prefixed with a JIRA Issue number.
Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
Required: Each commit message must be prefixed with a JIRA Issue number.
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable

This also updates the spec tests from the current version of the MFWG repository.

Update both ICU4C and ICU4J parsers to follow the current test schema in the conformance repository. This includes adding code to both parsers to allow `src` to be either a single string or an array of strings (per unicode-org/conformance#255 ), and eliminating `srcs` in tests.

…ange

…rved annotation

…n patterns

…r than code points

…ing an empty reserved-body

…grammar

mihnita · 2024-08-09T17:47:54Z

Sorry, the sharing of the test files PR started when I was in vacation and somehow didn't get on my radar when I got back.

But I just found that the change broke Eclipse, and that now the json test files are packaged in the release jar files of ICU.

I general having Maven consume files from outside it's own folder tree is hacky and error prone.
And when it finally works, we discover that it breaks the import to various IDEs.
I think that we would want to support at least Eclipse, IntelliJ, VS Code.

We currently have that problem with the LICENSE file (I can't import icu4j in Eclipse anymore)
I'm working on that...

The other problem with sharing is that it forces the c/c++ and Java implementations to be 100% in sync at all times.
Having them work the same is in general a good thing, of course.

But becomes a PITA when something changes and we need to update the code.
Because (often) the C++ and Java devs might not be the same.

This PR being an example.

TLDR: I am tempted to keep two copies of the test data.
There is an ant task in tools/cldr/ that (copy-cldr-testdata) that copies some test data from CLDR.

These json files are probably in the same bucket: they live in CLDR, but must be tested against in ICU.

But the idea is that the ant task can copy the files in two places, icu4c and icu4j.
If code breaks, but one has time / expertise for C/C++ only, they can open a ticket against icu4j, priority zero,
but revert the test files in icu4j until that issue is fixed.

I don't know if that is a good idea or not.
But this is what we already do for other tests.

Here is a fragment of the ant script:

        <fileset id="cldrTestData" dir="${cldrDir}/common/testData">
            <!-- Add directories here to control which test data is installed. -->
            <include name="localeIdentifiers/**"/> <!-- ... -->
            <include name="personNameTest/**"/> <!-- Used in ExhaustivePersonNameTest -->
            <include name="units/**"/> <!-- Used in UnitsTest tests -->
       </fileset>

        <copy todir="${testDataDir4C}">
            <fileset refid="cldrTestData"/>
        </copy>
        <copy todir="${testDataDir4J}">
            <fileset refid="cldrTestData"/>
        </copy>

We copy the same test files in icu4c/source/test/testdata/cldr and icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/cldr

mihnita · 2024-08-09T17:50:24Z

Try:

diff -r icu4c/source/test/testdata/cldr icu4j/main/core/src/test/resources/com/ibm/icu/dev/data/cldr

There are 3 sub-folders there: localeIdentifiers, personNameTest, units
I would propose we add another one (messageformat2) and do the same thing.

mihnita · 2024-08-09T18:28:28Z

icu4j/main/core/src/main/java/com/ibm/icu/message2/InputSource.java

@@ -62,10 +62,22 @@ int readCodePoint() {
        return c;
    }

-    // Backup a number of characters.
+    // Backup a number of code points.
    void backup(int amount) {


This method works on code units (Java char), not code points.

That's because int readCodePoint() also works on code units.
It returns a code point, but the current offset in the input (cursor) is expresses in code units.
So when it finds a code point above BMP it advances with 2

Reasons:

When one wants to parse something from an offset, it is faster to express the offset in the same kind of units we use for storage. We store using char, offset in code units. Otherwise we must iterate all the way to the offset in code points. Our input buffer uses char

Easier to deal with not-properly paired surrogates. The spec does not try to enforce the surrogate correctness, and I don't think it is the job of the MF2 to do that. If I pass ".....\uDC00...{$user}! there is no reason to not format to ".....\uDC00...John!

Even if there is disagreement about the reasons, and we decide to express offsets in code points, this fix is not sufficient.

We would also have to change getPosition(), skip(int amount), gotoPosition(int position)
Because we either work on code points, or on code units, we should not mix and match.

Thanks for the explanation. I see that the issue isn't really with backup(), but how it's used in conjunction with readCodePoint() (reading a code point that happens to be a wide char, and then calling backup(1) to "push it back", is a bug).

In eee547c I reverted the change to backup() and fixed the three places I could find where readCodePoint() and backup() were being used in that way.

mihnita · 2024-08-09T18:30:59Z

icu4j/main/core/src/main/java/com/ibm/icu/message2/InputSource.java

+        }
+    }
+
+    void backupOneCodePoint() {


What happens if it is an unpaired low surrogate?
There is nothing to prevent such a thing from showing up in a message.

True, it is incorrect.
But we are not trying to reject that, we accept it.
And this code would misbehave.

Sounds like we need a test for this! (I'll add one.)

mihnita

First, thank you very much for taking a stab at updating the Java implementation to the latest spec.
I was planning to do that, but it looks like trying to share the test files precipitated things somewhat :-)

I added some comments.

But if there is a decisions to treat the mf2 test files the same way we treat other CLDR test files (units, people names, locale ids) and separate the C++ / Java tests, I can take over the Java part. And in a different PR.

Or you can continue it, as you already did a big chunk of it (all?)
Your call.

Thanks again,
M

mihnita · 2024-08-09T18:45:38Z

icu4j/main/core/src/main/java/com/ibm/icu/message2/MFDataModelFormatter.java

-            if (nr != null) {
-                return nr;
-            }
+            // "The resolution of a text or literal MUST resolve to a string."


I think that this change breaks selection as it is currently implemented.
Because it means that "1.0" does not match "1.00"

Might be desired, to not match, I don't know.
But any change here should be accompanied by a change in the plural matcher.

One might say that this is desired.
But the format is locale sensitive.
For example formatting 1 dollar I might get "1.00 dollars", and does not match |1|.
But formatting 1 Japanese yen will result in "1 ...", and that matches |1|.

So the developer using |1| in their message creates something that behaves differently on different locales.

I proposed to interpret |1| as a string, and 1 as a number, and it is not resolved:
unicode-org/message-format-wg#712

And Eemeli argued for it recently:
unicode-org/message-format-wg#842

I don't think the plural matcher has to be changed? This change is to address situations where a number literal isn't annotated, like:

{ "src": "Format {123456789.9876} number", "locale": "en-IN", "exp": "Format 123456789.9876 number", "comment": "Number literals are not formatted as numbers by default" }

This change doesn't change the behavior for .match on something with a :number annotation, or at least, there aren't any tests that show that.

icu4j/main/core/src/main/java/com/ibm/icu/message2/MFParser.java

mihnita · 2024-08-09T18:51:57Z

icu4j/main/core/src/main/java/com/ibm/icu/message2/MFParser.java

@@ -712,7 +721,7 @@ private MFDataModel.Declaration getDeclaration() throws MFParseException {
        MFDataModel.Expression expression;
        switch (declName) {
            case "input":
-                skipMandatoryWhitespaces();
+                skipOptionalWhitespaces();


The rule is input-declaration = input [s] variable-expression
So the space is mandatory.

I don't think so? If it was mandatory, it would say input-declaration = input s variable-expression (without the square brackets.)

mihnita · 2024-08-09T19:00:46Z

icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/FirstReleaseTests.java

I would rather not delete these extra tests.
They cover functionality not covered by the tests in the spec.
And that can't be covered by the spec.

They test that the implementation works end-to-end.

Some might be moved to the standard (CLDR).

But some can't, because they test that custom functions work, or that icu:skeletons work (namespaced),
or that various types work (Date, Calendar, java.time, or even long with :dateformat.

And that the type "magically selects" the proper function (For example "...{$exp}..." formats as a date if the exp is a date-like type (Date / Calendar / java.time)

This comment also (or mostly?) apply to the other deleted files (FunctionsTests, IcuFunctionsTest)

It might be less confusing to look at #3063 first. This PR includes that PR, plus additional changes.

In the description for #3063 I explained that I didn't delete any tests (unless they were duplicates), but rather, moved all the JSON filenames being read into a single file, because now the schema is the same for all the tests, so the same code can be re-used to read all of the files.

mihnita · 2024-08-09T19:04:23Z

icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/ParserSmokeTest.java

@@ -29,16 +29,5 @@ public void testNullInput() throws Exception {
        MFParser.parse(null);
    }

-    @Test


There is not much / anything left if we remove this.

So the same comment that I had for deleted files applies: do we really want to delete extra tests?
I think that the more tests we have the better.
Even some don't come from the official CLDR spec.

We can separate these tests in a different folder for cleanliness.
But not delete them, unless they are redundant.

See #3092 (comment)

mihnita · 2024-08-09T19:04:46Z

...main/core/src/test/java/com/ibm/icu/dev/test/message2/SelectorsWithVariousArgumentsTest.java

Same comment about deleting tests.

catamorphism · 2024-08-09T19:14:11Z

Sorry, the sharing of the test files PR started when I was in vacation and somehow didn't get on my radar when I got back.

But I just found that the change broke Eclipse, and that now the json test files are packaged in the release jar files of ICU.

I general having Maven consume files from outside it's own folder tree is hacky and error prone. And when it finally works, we discover that it breaks the import to various IDEs. I think that we would want to support at least Eclipse, IntelliJ, VS Code.

Of course; is there a way to add automated testing that changes don't break IDEs?

We currently have that problem with the LICENSE file (I can't import icu4j in Eclipse anymore) I'm working on that...

The other problem with sharing is that it forces the c/c++ and Java implementations to be 100% in sync at all times. Having them work the same is in general a good thing, of course.

But becomes a PITA when something changes and we need to update the code. Because (often) the C++ and Java devs might not be the same.

This PR being an example.

TLDR: I am tempted to keep two copies of the test data. There is an ant task in tools/cldr/ that (copy-cldr-testdata) that copies some test data from CLDR.

I don't want to unilaterally say yes or no to this; is it something to discuss in the TC meeting, maybe? (Note: I'll be on vacation from early next week until September 2, so I won't be at the next few meetings, but I can go back and read minutes later.)

…Point() consistent Fix three locations where `cp = readCodePoint()` was followed by `backup(1)`, in a context where `cp` might be a wide char: `getIdentifier()`: Use `peekChar()` rather than `readCodePoint()` followed by `backup()` (an identifier could be followed by a wide char that is not a name-char) `skipWhitespaces()`: The same change (a whitespace could be followed by a wide char) `getName()`: names can include wide chars The tests are in 4429088

catamorphism · 2024-08-09T20:54:12Z

First, thank you very much for taking a stab at updating the Java implementation to the latest spec. I was planning to do that, but it looks like trying to share the test files precipitated things somewhat :-)

Right, first it was precipitated by sharing test files; then I wrote a random test generator and decided it was only fair to run ICU4J on some of the generated tests as well as ICU4C, so that's where some of the other changes came from.

I added some comments.

But if there is a decisions to treat the mf2 test files the same way we treat other CLDR test files (units, people names, locale ids) and separate the C++ / Java tests, I can take over the Java part. And in a different PR.

Certainly (like I said in another comment, I think that has to be discussed with the TC).

I think I've addressed all your comments, but let me know if I missed anything!

catamorphism added 25 commits August 8, 2024 14:31

ICU-22834 Update tests to reflect MF2 schema in conformance repo

6920ffc

This also updates the spec tests from the current version of the MFWG repository.

ICU4C: Allow trailing whitespace for complex messages, due to spec ch…

8fb824f

…ange

ICU4C: Parse number literals correctly in Number::format

029cee8

ICU4J: Allow trailing whitespace after complex body, per spec change

696755f

Add ignoreJava annotations for tests with runtime errors

1555928

Update testdata/message2/README.txt

612387f

Parse text and escape chars according to spec

724770a

Handle markup with space after the initial left curly brace

b070bb5

ICU4J: Don't format unannotated number literals as numbers

18655f2

ICU4J: Treat whitespace after .input keyword as optional

aba1a43

ICU4C: Fix bug that was assuming an .input variable can't have a rese…

3ad4450

…rved annotation

Escape curly braces when serializing and normalizing

cb94e36

ICU4C: Escape '|' in patterns

e46e8c4

ICU4C: Fix bug where unsupported '.i' was parsed as an '.input'

3a022cc

ICU4C: Make an invalid test valid (unsupported statement parsing)

6b73fe1

ICU4C: When normalizing input, escape optionally-escaped characters i…

7ff3559

…n patterns

ICU4C/ICU4J: Allow trailing whitespace after a match

6d6135a

ICU4J: handle trailing whitespace after a .match

aeeac80

ICU4C: Fix parser to iterate over code points, not code units

7dd36f3

ICU4J: Fix parser bug where backup() was counting in code units rathe…

4429088

…r than code points

ICU4J: Fix bug where whitespace wasn't getting pushed back after read…

5c8ccb4

…ing an empty reserved-body

ICU4J: Fix reserved escape chars to match grammar

4564da6

ICU4J: Fix bug with '||' in a reserved statement being misparsed

a752a33

ICU4J: Allow any type of expression in an unsupported statement, per …

bb4a03d

…grammar

mihnita reviewed Aug 9, 2024

View reviewed changes

catamorphism added 2 commits August 9, 2024 12:29

ICU4J: Fix grammar rule in comment

cb3594b

catamorphism mentioned this pull request Aug 13, 2024

ICU-22834 MF2: make tests compliant with schema and update spec tests #3063

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRAFT: MF2: fix various parser bugs and add more tests #3092

DRAFT: MF2: fix various parser bugs and add more tests #3092

catamorphism commented Aug 8, 2024

mihnita commented Aug 9, 2024

mihnita commented Aug 9, 2024

mihnita Aug 9, 2024

catamorphism Aug 9, 2024 •

edited

Loading

mihnita Aug 9, 2024

catamorphism Aug 9, 2024

mihnita left a comment

mihnita Aug 9, 2024

catamorphism Aug 9, 2024

mihnita Aug 9, 2024

catamorphism Aug 9, 2024

mihnita Aug 9, 2024

catamorphism Aug 9, 2024

mihnita Aug 9, 2024

catamorphism Aug 9, 2024

mihnita Aug 9, 2024

catamorphism commented Aug 9, 2024

catamorphism commented Aug 9, 2024 •

edited

Loading

DRAFT: MF2: fix various parser bugs and add more tests #3092

Are you sure you want to change the base?

DRAFT: MF2: fix various parser bugs and add more tests #3092

Conversation

catamorphism commented Aug 8, 2024

Checklist

mihnita commented Aug 9, 2024

mihnita commented Aug 9, 2024

Choose a reason for hiding this comment

catamorphism Aug 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mihnita left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

catamorphism commented Aug 9, 2024

catamorphism commented Aug 9, 2024 • edited Loading

catamorphism Aug 9, 2024 •

edited

Loading

catamorphism commented Aug 9, 2024 •

edited

Loading