Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation and scripts for 8 kHz model #632

Merged
merged 15 commits into from
May 20, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,8 @@ To release a new version, please update the changelog as followed:
- Online audio augmentation notebook in ASR examples ([PR #605](https://github.com/NVIDIA/NeMo/pull/605)) - @titu1994
- ContextNet Encoder + Decoder Initial Support ([PR #630](https://github.com/NVIDIA/NeMo/pull/630)) - @titu1994
- Added finetuning with Megatron-LM ([PR #601](https://github.com/NVIDIA/NeMo/pull/601)) - @ekmb
- Added documentation for 8 kHz model ([PR #632](https://github.com/NVIDIA/NeMo/pull/632)) - @jbalam-nv


### Changed
- Syncs across workers at each step to check for NaN or inf loss. Terminates all workers if stop\_on\_nan\_loss is set (as before), lets Apex deal with it if apex.amp optimization level is O1 or higher, and skips the step across workers otherwise. ([PR #637](https://github.com/NVIDIA/NeMo/pull/637)) - @redoctopus
Expand Down
39 changes: 39 additions & 0 deletions docs/sources/source/asr/8kHz_models.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
8kHz Models
blisc marked this conversation as resolved.
Show resolved Hide resolved
blisc marked this conversation as resolved.
Show resolved Hide resolved
===========

For applications based on telephony speech, using models trained on narrowband audio data sampled at 8 kHz may perform better than using models built with
audio at a higher frequency (Note that to use models with audio at a different sample rate from your data, you would need to resample your data to match the sampling rate in the
config file of the model). One approach to create large datasets for training a model suitable for your application would be to convert all audio data
to the formats prevalent in your application. Here we detail one such approach that we took to train a model based on 8 kHz data.

To train a model suitable for recognizing telephony speech we converted some of the datasets to G.711 :cite:`8kHz-mod-itu1988g711`. G.711 is a popular speech codec used in VoIP products and encodes speech
at 64 kbps using PCM u-law companding. We converted audio from LibriSpeech, Mozilla Common Voice and WSJ datasets to G.711 format and combined Fisher and Switchboard datasets to
train a :ref:`Quartznet15x5 <Quartznet_model>` model with about 4000 hours of data. To convert your audio to G.711 format you can use the script `convert_wav_to_g711wav.py` found in the `scripts` sub-directory of the nemo base directory.

Among the experiments that we ran, we got the best accuracy for a model that used our 16 kHz Quartznet15x5 model's weights as pre-trained weights. We then
trained the model for 250 epochs with five datasets mentioned above. Here are some results for our best model so far (note that all the test sets
were converted to G.711 format for the results below):

====================== =====================
Test set WER (%)
====================== =====================
LibriSpeech dev-clean 4.35
LibriSpeech dev-other 11.89
LibriSpeech test-clean 4.45
LibriSpeech test-other 12.02
Switchboard test 10.74
Switchboard dev 10.59
====================== =====================

The model was first pretrained with 8 kHz LibriSpeech data for 134 epochs and then trained for another 250 epochs using G.711 audio from all the five datasets listed above. For best accuracy
in your application, you may choose to :ref:`fine-tune <fine-tune>` this model using data collected from your application.

..
The pre-trained model is available for download `here <https://ngc.nvidia.com/models/nvidian:nemo:quartznet_15x5_8_khz_for_nemo>`_.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented this as we are going to wait for QA to approve the model for release.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any update on this? Would like to test the model.

References
----------
.. bibliography:: asr_all.bib
:style: plain
:labelprefix: 8kHz-mod
:keyprefix: 8kHz-mod-
12 changes: 8 additions & 4 deletions docs/sources/source/asr/asr_all.bib
Original file line number Diff line number Diff line change
Expand Up @@ -60,8 +60,6 @@ @misc{ardila2019common
primaryClass={cs.CL}
}



@article{graves2012,
title={Sequence Transduction with Recurrent Neural Networks},
author={Graves, Alex},
Expand Down Expand Up @@ -927,8 +925,14 @@ @article{novograd2019
}

@article{kriman2019quartznet,
title={Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions},
title={Quartznet: {Deep} automatic speech recognition with 1d time-channel separable convolutions},
author={Kriman, Samuel and Beliaev, Stanislav and Ginsburg, Boris and Huang, Jocelyn and Kuchaiev, Oleksii and Lavrukhin, Vitaly and Leary, Ryan and Li, Jason and Zhang, Yang},
journal={arXiv preprint arXiv:1910.10261},
year={2019}
}
}

@misc{itu1988g711,
title={{ITU-T} {G.711} - {Pulse} code modulation ({PCM}) of voice frequencies},
author={ITU-T Geneva Switzerland},
year={1988},
}
2 changes: 2 additions & 0 deletions docs/sources/source/asr/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ Speech Recognition
tutorial
datasets
models
8kHz_models




7 changes: 7 additions & 0 deletions docs/sources/source/asr/jasper.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,10 @@ Jasper10x5dr | Librispeech, `here <https://ngc.nvidia.com/catalog/mode
| Switchboard
Jasper15x5SEP Aishell2 `here <https://ngc.nvidia.com/catalog/models/nvidia:aishell2_jasper10x5dr>`__
============= ======================= =================================================================================

References
^^^^^^^^^^
.. bibliography:: asr_all.bib
:style: plain
:labelprefix: ASR-MODELS
:keyprefix: asr-models-
7 changes: 0 additions & 7 deletions docs/sources/source/asr/models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,3 @@ Models
jasper
quartznet

References
-------------

.. bibliography:: asr_all.bib
:style: plain
:labelprefix: ASR-MODELS
:keyprefix: asr-models-
13 changes: 8 additions & 5 deletions docs/sources/source/asr/quartznet.rst
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
.. _Quartznet_model:

QuartzNet
---------

QuartzNet is a version of Jasper :cite:`asr-models-li2019jasper` model with separable convolutions and larger filters. It can achieve performance
QuartzNet :cite:`qtz-models-kriman2019quartznet` is a version of Jasper :cite:`qtz-models-li2019jasper` model with separable convolutions and larger filters. It can achieve performance
similar to Jasper but with an order of magnitude less parameters.
Similarly to Jasper, QuartzNet family of models are denoted as QuartzNet_[BxR] where B is the number of blocks, and R - the number of convolutional sub-blocks within a block. Each sub-block contains a 1-D *separable* convolution, batch normalization, ReLU, and dropout:

.. image:: quartz_vertical.png
:align: center
:alt: quartznet model

.. note:: This checkpoint was trained on LibriSpeech :cite:`panayotov2015librispeech` and full "validated" part of En Mozilla Common Voice :cite:`ardila2019common`

`QuartzNet paper <https://arxiv.org/abs/1910.10261>`_.

Pretrained models can be found, `here <https://ngc.nvidia.com/catalog/models/nvidia:quartznet15x5>`_.
Pretrained models can be found at the following links:

============= ===================== ==============================================================================
Network Dataset Download Link
Expand All @@ -24,7 +24,10 @@ QuartzNet15x5 Aishell2 `here <https://ngc.nvidia.com/catalog/models
============= ===================== ==============================================================================

References
----------
^^^^^^^^^^

.. bibliography:: asr_all.bib
:style: plain
:labelprefix: QTZ-MODELS
:keyprefix: qtz-models-

1 change: 1 addition & 0 deletions docs/sources/source/asr/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,7 @@ The command above should trigger 8-GPU training with mixed precision. In the com
.. tip::
You can pass several manifests (comma-separated) to train on a combined dataset like this: `--train_manifest=/manifests/librivox-train-all.json,/manifests/librivox-train-all-sp10pcnt.json,/manifests/cv/validated.json`. Here it combines 3 data sets: LibriSpeech, Mozilla Common Voice and LibriSpeech speed perturbed.

.. _fine-tune:

Fine-tuning
-----------
Expand Down
7 changes: 0 additions & 7 deletions docs/sources/source/speaker_recognition/models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,3 @@ Models

quartznet

References
----------

.. bibliography:: speaker.bib
:style: plain
:labelprefix: SPEAKER-TUT
:keyprefix: speaker-tut-
8 changes: 0 additions & 8 deletions docs/sources/source/speaker_recognition/quartznet.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,3 @@ QuartzNet3x2 voxceleb1 ffsvc-dev 14.22% 7
voxceleb2
============== ================= ===================== ====================== ==========


References
----------

.. bibliography:: speaker.bib
:style: plain
:labelprefix: SPEAKER-TUT
:keyprefix: speaker-tut-
7 changes: 0 additions & 7 deletions docs/sources/source/speech_command/models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,3 @@ Models

quartznet

References
-------------

.. bibliography:: speech_recognition_all.bib
:style: plain
:labelprefix: SPEECH-RECOGNITION-MODELS
:keyprefix: speech-recognition-models-
6 changes: 2 additions & 4 deletions docs/sources/source/speech_command/quartznet.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
QuartzNet
---------

QuartzNet is a version of Jasper :cite:`asr-models-li2019jasper` model with separable convolutions and larger filters. It can achieve performance
QuartzNet is a version of Jasper :cite:`speech-recognition-models-li2019jasper` model with separable convolutions and larger filters. It can achieve performance
similar to Jasper but with an order of magnitude less parameters.
Similarly to Jasper, QuartzNet family of models are denoted as QuartzNet_[BxR] where B is the number of blocks, and R - the number of convolutional sub-blocks within a block. Each sub-block contains a 1-D *separable* convolution, batch normalization, ReLU, and dropout:

Expand All @@ -11,8 +11,6 @@ These models are trained on Google Speech Commands dataset (V1 - all 30 classes)
:align: center
:alt: quartznet model

.. note:: This checkpoint was trained on LibriSpeech :cite:`panayotov2015librispeech` and full "validated" part of En Mozilla Common Voice :cite:`ardila2019common`

`QuartzNet paper <https://arxiv.org/abs/1910.10261>`_.

These QuartzNet models were trained for 200 epochs using mixed precision on 2 GPUs with a batch size of 128 over 200 epochs.
Expand All @@ -32,7 +30,7 @@ QuartzNet3x2 (93k params) Speech Commands V2 97.29% Test


References
----------
^^^^^^^^^^

.. bibliography:: speech_recognition_all.bib
:style: plain
Expand Down
7 changes: 7 additions & 0 deletions docs/sources/source/speech_command/speech_recognition_all.bib
Original file line number Diff line number Diff line change
Expand Up @@ -40,4 +40,11 @@ @article{park2019
year = "2019",
eid = {arXiv:1904.08779},
eprint = {1904.08779},
}

@article{li2019jasper,
title={Jasper: An End-to-End Convolutional Neural Acoustic Model},
author={Li, Jason and Lavrukhin, Vitaly and Ginsburg, Boris and Leary, Ryan and Kuchaiev, Oleksii and Cohen, Jonathan M and Nguyen, Huyen and Gadde, Ravi Teja},
journal={arXiv preprint arXiv:1904.03288},
year={2019}
}
Loading