Allow automatic training for wakewords outside of pronunciation library #65

nh-kando · 2023-10-19T14:30:33Z

I am trying to train a custom wakeword in French. To do so, I've put a phonetic wakeword so that it more or less corresponds to the French pronunciation.

The problem is that the line:

!{sys.executable} openwakeword/openwakeword/train.py --training_config my_model.yaml --generate_clips

Does not work for such a word. The first error that I am getting is:

FileNotFoundError: [Errno 2] No such file or directory: '/content/openwakeword/openwakeword/resources/en_us_cmudict_forward.pt'

But even if I manually download the file and put it in the correct folder, I then get:

Traceback (most recent call last):
  File "/content/openwakeword/openwakeword/train.py", line 553, in <module>
    adversarial_texts.extend(generate_adversarial_texts(
  File "/content/openwakeword/openwakeword/data.py", line 989, in generate_adversarial_texts
    adversarial_texts.append(" ".join(np.random.choice(txts, size=n_words, replace=False)))
  File "mtrand.pyx", line 965, in numpy.random.mtrand.RandomState.choice
ValueError: Cannot take a larger sample than population when 'replace=False'

I am running the notebook on a Google Colab, with a T4 GPU

The text was updated successfully, but these errors were encountered:

dscripka · 2023-10-22T00:21:11Z

@nh-kando, after looking into this error I think I understand how it might have happened, but it's a very odd (and potentially rare) scenario. Can you share the wakeword you are trying to use so I can try to reproduce this error locally?

StuartIanNaylor · 2023-10-24T16:45:53Z

If you have a look at https://github.com/StuartIanNaylor/ProjectEars/blob/main/dataset/ml-commons/create-db-syllable.py

_The Sonority Sequencing Principle (SSP) is a language agnostic algorithm proposed
by Otto Jesperson in 1904. The sonorous quality of a phoneme is judged by the
openness of the lips. Syllable breaks occur before troughs in sonority. For more
on the SSP see Selkirk (1984).

The default implementation uses the English alphabet, but the sonority_hiearchy
can be modified to IPA or any other alphabet for the use-case. The SSP is a
universal syllabification algorithm, but that does not mean it performs equally
across languages. Bartlett et al. (2009) is a good benchmark for English accuracy
if utilizing IPA (pg. 311).

Importantly, if a custom hierarchy is supplied and vowels span across more than
one level, they should be given separately to the vowels class attribute.

References:

Otto Jespersen. 1904. Lehrbuch der Phonetik.
Leipzig, Teubner. Chapter 13, Silbe, pp. 185-203.
Elisabeth Selkirk. 1984. On the major class features and syllable theory.
In Aronoff & Oehrle (eds.) Language Sound Structure: Studies in Phonology.
Cambridge, MIT Press. pp. 107-136.
Susan Bartlett, et al. 2009. On the Syllabification of Phonemes.
In HLT-NAACL. pp. 308-316.

You could use that you just have to add language based as en at https://github.com/StuartIanNaylor/ProjectEars/blob/fda5758ee37bbcb2be4dcab6a59dbb1407081139/dataset/ml-commons/create-db-syllable.py#L197
de at https://github.com/StuartIanNaylor/ProjectEars/blob/fda5758ee37bbcb2be4dcab6a59dbb1407081139/dataset/ml-commons/create-db-syllable.py#L204

likely its easier to throw back providing the sonority_hierarchy to native language speakers who can google it.

dscripka · 2023-10-25T23:05:32Z

@nh-kando I have fixed a few bugs and made updates to the example Google Colab training notebooks that should resolve this issue.

Let me know if you are still facing any issue when trying to train a model.

dscripka mentioned this issue Oct 22, 2023

Automatic Training - Google Collab Errors #70

Closed

dscripka closed this as completed Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow automatic training for wakewords outside of pronunciation library #65

Allow automatic training for wakewords outside of pronunciation library #65

nh-kando commented Oct 19, 2023

dscripka commented Oct 22, 2023

StuartIanNaylor commented Oct 24, 2023

dscripka commented Oct 25, 2023

Allow automatic training for wakewords outside of pronunciation library #65

Allow automatic training for wakewords outside of pronunciation library #65

Comments

nh-kando commented Oct 19, 2023

dscripka commented Oct 22, 2023

StuartIanNaylor commented Oct 24, 2023

dscripka commented Oct 25, 2023