Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream data, jyutping should not be an array. #4

Open
nathanhammond opened this issue Jan 28, 2021 · 6 comments
Open

Upstream data, jyutping should not be an array. #4

nathanhammond opened this issue Jan 28, 2021 · 6 comments

Comments

@nathanhammond
Copy link

I've spent today reviewing https://github.com/lshk-org/jyutping-table and I believe that there are issues in the upstream source data.

First, I believe that jyutping should be a single field, not an array. All of the items which appear as an array are listed here:
lshk-org/jyutping-table#3 (comment)

With the exceptions of these, which I believe are actually multi-syllable:

ch	ucs2	pronunciation
𠯢	U+E064	saa1 aa6
	U+5345	saa1 aa6
	U+534C	sei3 aa6

I believe that the remainder of the list should be interpreted as this (does not include the entire list):

ch	ucs2	descriptor	pronunciation
籿	U+7C7F	fan1	mai5
	U+7C80	sap6	mai5
	U+7C81	cin1	mai5
	U+7C8C	baak3	mai5
	U+7C8D	hou4	mai5
	U+7CA8	baak3	mai5
	U+7CCE	lei4	mai5

	U+358A	jing1	cam4
	U+540B	jing1	cyun3
	U+544E	jing1	cek3
	U+54E9	jing1	lei5
	U+5521	jing1	loeng2
	U+5562	jing1	loeng2
	U+565A	jing1	cam4
𠺖	U+F45A	jing1	mau5
𠰴	U+F4C0	jing1	sek6

Of the full list, only 6 are marked for having different phonetics, all of which match the second Jyutping component.

ch	ucs2	pronunciation
	U+540B	cyun3
	U+544E	cek3
	U+54E9	(li1, le1, lei5)
	U+6D6C	lei5
𠺖	U+F45A	mau5
𠰴	U+F4C0	sek6

Given the characters' construction matching the description field, a conversation with a native-speaker, and a conversation with a non-native speaker Cantonese linguist (who also consulted a native speaker), I believe:

  • That the first Jyutping value represents how the character would be described.
  • The second Jyutping value represents the actual pronunciation.

Further, it's also possible that the descriptor field pronunciation is wrong in a couple of cases, as I mention here: lshk-org/jyutping-table#4.


I propose that the object shape for this library be modified to account for this finding and post-processing added to adjust the data for correctness.

@nathanhammond
Copy link
Author

@chaklim
Copy link
Owner

chaklim commented Jan 28, 2021

𠯢	U+E064	saa1 aa6
	U+5345	saa1 aa6
	U+534C	sei3 aa6

For the above three words, quote from JPTableFull (U+E064 is missing but I believe it is a mistake), in 4.9.2:

‘卅’(U+5345) 及‘卌’(U+534C) 兩字由於涉及粵語口語讀法,所以本表沒有依從辭書所注的單音節音,而賦予實際的雙音節發音。

Thus I believe that they are actually multi-syllable.

籿	U+7C7F	fan1	mai5
	U+7C80	sap6	mai5
	U+7C81	cin1	mai5
	U+7C8C	baak3	mai5
	U+7C8D	hou4	mai5
	U+7CA8	baak3	mai5
	U+7CCE	lei4	mai5

For the words above, I am not sure what is the right interpretation, if I have to guess I think mai5 is talking about the radical (米).

	U+358A	jing1	cam4
	U+540B	jing1	cyun3
	U+544E	jing1	cek3
	U+54E9	jing1	lei5
	U+5521	jing1	loeng2
	U+5562	jing1	loeng2
	U+565A	jing1	cam4
𠺖	U+F45A	jing1	mau5
𠰴	U+F4C0	jing1	sek6

For the words above, those cases are interesting because the first jing1 I believe is talking about the word (英).
The three combination of words that I know are 英吋, 英呎, 英哩, and they are units of length.
Maybe jing1 in those cases need to be removed.

@nathanhammond
Copy link
Author

nathanhammond commented Jan 28, 2021

Here is what I’ve got from the linguist:

“The sap6 sing1 and baak3 mai5 ones are units of measurements. 竓 = 毫升 hou sing = millilitre.”

Looking again, lshk-org/jyutping-table#3 (comment)

  • They all have the same set of “prefixes.”
  • I’m able to find some evidence of historic numeric use for them.

I’m willing to bet that mai5, sing1, hak1, ngaa5 are all measurements.

Further, there could be alternate characters included for when people would choose a different character to represent the contraction, which might explain the thing I was treating as a data issue. For example: 䇉 U+41C9 baak3 sing1 could be bad handwriting that was repeated, or bad transcription (e.g. https://www.dampfkraft.com/ghost-characters.html). Some of the other "duplicate" prefixes could also fall into that category.

jing1 is weird.

These are also weird ones:

	U+6D6C	hoi2 lei5
	U+7CB4	gung1 lei5
	U+55E7	gaa1 leon4

So, we could be looking at archaic contraction pronunciations that didn’t survive to modern Cantonese.

@chaklim
Copy link
Owner

chaklim commented Jan 28, 2021

	U+6D6C	hoi2 lei5
	U+7CB4	gung1 lei5
	U+55E7	gaa1 leon4

These three words are all units of measurements too

	U+6D6C	海里
	U+7CB4	公里
	U+55E7	加侖

@chaklim
Copy link
Owner

chaklim commented Jan 28, 2021

If I have to guess for "米-X" words now, it should have the meaning of "X-meters"

@nathanhammond
Copy link
Author

nathanhammond commented Jan 28, 2021

Okay, so I think this is "mystery solved." They're archaic contractions that didn't make it to modern Cantonese. Some of them may be bad handwriting or bad transcriptions (e.g. 䇉 U+41C9 baak3 sing1–I updated my above comment with a fun link), but regardless they ended up appearing in one of the source books that was used to populate the table.

Given that, I think we're looking at needing to treat all of them as archaic multi-syllable characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants