Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data gathering #19

Closed
3 of 8 tasks
hugolpz opened this issue Mar 16, 2017 · 10 comments
Closed
3 of 8 tasks

Data gathering #19

hugolpz opened this issue Mar 16, 2017 · 10 comments

Comments

@hugolpz
Copy link
Collaborator

hugolpz commented Mar 16, 2017

We currently look for database with { "glyph": "西", "phonetic": "xī" } (or xi1, or alternatives).

Sources possible, info to complete :

Moedict

  • link (to complete)
  • json format
  • range : most common caracters, trad only ?

Unicode :

  • link (to complete)
  • xml
  • range : traditional/modern ; -more complete for a font
  • which phonetic format its provided also. ("glyph": "西", "phonetic": "xī" or xi1 ?)

CJKlib

@edouard-lopez
Copy link
Member

edouard-lopez commented Mar 16, 2017

What about Unihan?

With the hexadecimal codepoint we can get the glyph like this in Python:

>>> print(chr(int('0x897F', 16)))
西

A JS solution would be better, but this is out of the scope of the project, we can do it anyway we think fits.

@hugolpz
Copy link
Collaborator Author

hugolpz commented Mar 16, 2017

Please check out :

screenshot from 2017-03-16 17-59-36
screenshot from 2017-03-16 17-58-24

@edouard-lopez
Copy link
Member

Thanks for the link cjk-unihan might be useful for other projects.

I think it's better to limit the project to generating font and outsource the data gathering/validation to another project. This way we stay focus and efficient.

I'm closing as different users might have different needs hence handcraft their dictionaries.

@edouard-lopez
Copy link
Member

I reckon the JS solution is in tobei/unihan code

const character = String.fromCodePoint(parseInt(code.substring(2), 16));

@hugolpz
Copy link
Collaborator Author

hugolpz commented Mar 16, 2017

Did you gathered the data ?

@edouard-lopez
Copy link
Member

Not yet, could you work on a project to do so?

@hugolpz
Copy link
Collaborator Author

hugolpz commented Mar 17, 2017

Yup. See also peterolson/hanzi-tools#1 (comment)

screenshot from 2017-03-17 11-16-25

@edouard-lopez
Copy link
Member

@hugolpz I think you have a typo in your comment, there is a ratio of 1:10 between node-pinyin and unihan characters/phonetic pairs. Can you confirm/correct this number?

@hugolpz
Copy link
Collaborator Author

hugolpz commented Feb 11, 2018

@edouard-lopez
Copy link
Member

We can get the codepoint using punycode

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants