Question about grapheme set #14

kkp15 · 2022-01-13T14:51:46Z

Hello. Thank you for this amazing repository!
I have a question though. What’s the easiest way to get a unique grapheme set for a specific language? How did you get that list when training a multilingual model?

cschaefer26 · 2022-01-21T09:18:48Z

Hi, you can just extract it from the training data. E.g. you collect the set of characters from it and then paste the result into the config. That's basically how i proceeded for the trained models (I filtered some graphemes though).

skanda1005 · 2022-06-07T06:14:00Z

Hi @cschaefer26 , I wanted to train the model for hindi, but had doubts on how I need to make the config file, especially the input and output because I'm getting index out of range error. Thanks!

cschaefer26 · 2022-06-09T08:24:08Z

Hi, you can use the standard config file, but you will have to adjust the language and:

text_symbols
phoneme_symbols

according to the symbols that occur in your dataset!

skanda1005 · 2022-06-09T08:27:05Z

Got it working, thanks!

cschaefer26 · 2022-06-09T08:47:07Z

Nice, let me know if you run into issues.

skanda1005 · 2022-06-10T10:11:03Z

Hi, so I realized that in my phoneme set, if some of the phonemes have multiple characters, it doesn't get parsed and those multiple char phones are either removed or replaced after preprocessing.
Any solutions to this issue?

cschaefer26 · 2022-06-10T10:31:32Z

Hi, multiple characters shouldn't be a problem, the cmudict model has multi-char phonemes: https://github.com/as-ideas/DeepPhonemizer#:~:text=en_us_cmudict_forward

You can pass each sample as a tuple of [str, str, list], e.g. ('en', 'word', ['p', 'h', 'o', 'neme'])

skanda1005 · 2022-06-10T11:04:04Z

So, I am training it in hindi and phones like t͡ʃ and ẽː dont get parsed. I used these as inputs for the tokenizer and there is no output meaning it doesn't get tokenized.
PS. t͡ʃ is actually 3 chars, not 2. Would that cause a problem?

cschaefer26 · 2022-06-10T11:27:06Z

No that should be fine. Actually your example looks more like there should be three phoneme chars as output instead of a single phoneme instance incorporating all three chars (t͡ʃ). Just make sure the symbols are present in the config (phoneme_symbols, e.g. '͡')

skanda1005 · 2022-06-10T12:45:14Z

Oh, So should I separate the chars of that phone as 3 different elements in the list?
e.g ['t', '͡', 'ʃ')

cschaefer26 · 2022-06-10T15:08:26Z

Yes, that's also how the standard config is set. You can then simply provide the phonemized words as strings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about grapheme set #14

Question about grapheme set #14

kkp15 commented Jan 13, 2022

cschaefer26 commented Jan 21, 2022

skanda1005 commented Jun 7, 2022

cschaefer26 commented Jun 9, 2022 •

edited

Loading

skanda1005 commented Jun 9, 2022

cschaefer26 commented Jun 9, 2022

skanda1005 commented Jun 10, 2022

cschaefer26 commented Jun 10, 2022 •

edited

Loading

skanda1005 commented Jun 10, 2022

cschaefer26 commented Jun 10, 2022 •

edited

Loading

skanda1005 commented Jun 10, 2022

cschaefer26 commented Jun 10, 2022

Question about grapheme set #14

Question about grapheme set #14

Comments

kkp15 commented Jan 13, 2022

cschaefer26 commented Jan 21, 2022

skanda1005 commented Jun 7, 2022

cschaefer26 commented Jun 9, 2022 • edited Loading

skanda1005 commented Jun 9, 2022

cschaefer26 commented Jun 9, 2022

skanda1005 commented Jun 10, 2022

cschaefer26 commented Jun 10, 2022 • edited Loading

skanda1005 commented Jun 10, 2022

cschaefer26 commented Jun 10, 2022 • edited Loading

skanda1005 commented Jun 10, 2022

cschaefer26 commented Jun 10, 2022

cschaefer26 commented Jun 9, 2022 •

edited

Loading

cschaefer26 commented Jun 10, 2022 •

edited

Loading

cschaefer26 commented Jun 10, 2022 •

edited

Loading