Upstream data, `jyutping` should not be an array. #4

nathanhammond · 2021-01-28T11:15:32Z

I've spent today reviewing https://github.com/lshk-org/jyutping-table and I believe that there are issues in the upstream source data.

First, I believe that jyutping should be a single field, not an array. All of the items which appear as an array are listed here:
lshk-org/jyutping-table#3 (comment)

With the exceptions of these, which I believe are actually multi-syllable:

ch	ucs2	pronunciation
𠯢	U+E064	saa1 aa6
〺	U+5345	saa1 aa6
卌	U+534C	sei3 aa6

I believe that the remainder of the list should be interpreted as this (does not include the entire list):

ch	ucs2	descriptor	pronunciation
籿	U+7C7F	fan1	mai5
粀	U+7C80	sap6	mai5
粁	U+7C81	cin1	mai5
粌	U+7C8C	baak3	mai5
粍	U+7C8D	hou4	mai5
粨	U+7CA8	baak3	mai5
糎	U+7CCE	lei4	mai5

㖊	U+358A	jing1	cam4
吋	U+540B	jing1	cyun3
呎	U+544E	jing1	cek3
哩	U+54E9	jing1	lei5
唡	U+5521	jing1	loeng2
啢	U+5562	jing1	loeng2
噚	U+565A	jing1	cam4
𠺖	U+F45A	jing1	mau5
𠰴	U+F4C0	jing1	sek6

Of the full list, only 6 are marked for having different phonetics, all of which match the second Jyutping component.

ch	ucs2	pronunciation
吋	U+540B	cyun3
呎	U+544E	cek3
哩	U+54E9	(li1, le1, lei5)
浬	U+6D6C	lei5
𠺖	U+F45A	mau5
𠰴	U+F4C0	sek6

Given the characters' construction matching the description field, a conversation with a native-speaker, and a conversation with a non-native speaker Cantonese linguist (who also consulted a native speaker), I believe:

That the first Jyutping value represents how the character would be described.
The second Jyutping value represents the actual pronunciation.

Further, it's also possible that the descriptor field pronunciation is wrong in a couple of cases, as I mention here: lshk-org/jyutping-table#4.

I propose that the object shape for this library be modified to account for this finding and post-processing added to adjust the data for correctness.

The text was updated successfully, but these errors were encountered:

nathanhammond · 2021-01-28T11:26:56Z

Output updated in my independent implementation.

chaklim · 2021-01-28T14:16:45Z

𠯢	U+E064	saa1 aa6
〺	U+5345	saa1 aa6
卌	U+534C	sei3 aa6

For the above three words, quote from JPTableFull (U+E064 is missing but I believe it is a mistake), in 4.9.2:

‘卅’(U+5345) 及‘卌’(U+534C) 兩字由於涉及粵語口語讀法，所以本表沒有依從辭書所注的單音節音，而賦予實際的雙音節發音。

Thus I believe that they are actually multi-syllable.

籿	U+7C7F	fan1	mai5
粀	U+7C80	sap6	mai5
粁	U+7C81	cin1	mai5
粌	U+7C8C	baak3	mai5
粍	U+7C8D	hou4	mai5
粨	U+7CA8	baak3	mai5
糎	U+7CCE	lei4	mai5

For the words above, I am not sure what is the right interpretation, if I have to guess I think mai5 is talking about the radical (米).

㖊	U+358A	jing1	cam4
吋	U+540B	jing1	cyun3
呎	U+544E	jing1	cek3
哩	U+54E9	jing1	lei5
唡	U+5521	jing1	loeng2
啢	U+5562	jing1	loeng2
噚	U+565A	jing1	cam4
𠺖	U+F45A	jing1	mau5
𠰴	U+F4C0	jing1	sek6

For the words above, those cases are interesting because the first jing1 I believe is talking about the word (英).
The three combination of words that I know are 英吋, 英呎, 英哩, and they are units of length.
Maybe jing1 in those cases need to be removed.

nathanhammond · 2021-01-28T15:04:40Z

Here is what I’ve got from the linguist:

“The sap6 sing1 and baak3 mai5 ones are units of measurements. 竓 = 毫升 hou sing = millilitre.”

Looking again, lshk-org/jyutping-table#3 (comment)

They all have the same set of “prefixes.”
I’m able to find some evidence of historic numeric use for them.

I’m willing to bet that mai5, sing1, hak1, ngaa5 are all measurements.

Further, there could be alternate characters included for when people would choose a different character to represent the contraction, which might explain the thing I was treating as a data issue. For example: 䇉 U+41C9 baak3 sing1 could be bad handwriting that was repeated, or bad transcription (e.g. https://www.dampfkraft.com/ghost-characters.html). Some of the other "duplicate" prefixes could also fall into that category.

jing1 is weird.

These are also weird ones:

浬	U+6D6C	hoi2 lei5
粴	U+7CB4	gung1 lei5
嗧	U+55E7	gaa1 leon4

So, we could be looking at archaic contraction pronunciations that didn’t survive to modern Cantonese.

chaklim · 2021-01-28T15:38:27Z

浬	U+6D6C	hoi2 lei5
粴	U+7CB4	gung1 lei5
嗧	U+55E7	gaa1 leon4

These three words are all units of measurements too

浬	U+6D6C	海里
粴	U+7CB4	公里
嗧	U+55E7	加侖

chaklim · 2021-01-28T15:43:16Z

If I have to guess for "米-X" words now, it should have the meaning of "X-meters"

nathanhammond · 2021-01-28T15:43:41Z

Okay, so I think this is "mystery solved." They're archaic contractions that didn't make it to modern Cantonese. Some of them may be bad handwriting or bad transcriptions (e.g. 䇉 U+41C9 baak3 sing1–I updated my above comment with a fun link), but regardless they ended up appearing in one of the source books that was used to populate the table.

Given that, I think we're looking at needing to treat all of them as archaic multi-syllable characters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upstream data, `jyutping` should not be an array. #4

Upstream data, `jyutping` should not be an array. #4

nathanhammond commented Jan 28, 2021

nathanhammond commented Jan 28, 2021

chaklim commented Jan 28, 2021

nathanhammond commented Jan 28, 2021 •

edited

Loading

chaklim commented Jan 28, 2021

chaklim commented Jan 28, 2021

nathanhammond commented Jan 28, 2021 •

edited

Loading

Upstream data, jyutping should not be an array. #4

Upstream data, jyutping should not be an array. #4

Comments

nathanhammond commented Jan 28, 2021

nathanhammond commented Jan 28, 2021

chaklim commented Jan 28, 2021

nathanhammond commented Jan 28, 2021 • edited Loading

chaklim commented Jan 28, 2021

chaklim commented Jan 28, 2021

nathanhammond commented Jan 28, 2021 • edited Loading

Upstream data, `jyutping` should not be an array. #4

Upstream data, `jyutping` should not be an array. #4

nathanhammond commented Jan 28, 2021 •

edited

Loading

nathanhammond commented Jan 28, 2021 •

edited

Loading