Expressive Audio-Visual Talking head in Greek

This code accompanies the paper Video-realistic expressive audio-visual speech synthesis for the Greek language. You can get the preprint in ResearchGate.


Neutral	Angry	Happy	Sad

Project Structure

.
├── data                   # Code to extract audio-visual features for training. Modified from [HTS](http://hts.sp.nitech.ac.jp/)
├── hts                    # Code to train an expressive audio-visual talking head. Modified from [HTS](http://hts.sp.nitech.ac.jp/)
├── merlin                 # Code to train a DNN-based expressive audio-visual talking head. Modified from [Merlin](https://github.com/CSTR-Edinburgh/merlin)
├── aam_model              # Code to synthesize the active appearance model from shape and texture features. 
├── LICENSE
└── README.md

Prerequisites

You need to download the HTS toolkit and SPTK.
HTK and Festival are needed only for some special features (e.g., if you want to create your own labels).
For DNN-based synthesis you will need Theano and Python 2.7.

Getting Started

Installation

~~Download the CVSP-EAV dataset from here and extract it.~~ If you wish to download the CVSP EAV dataset, email me at filby[at]central.ntua.gr
Put the aam-model downloaded from the dataset (file all_emotions.mat) in the aam_model/model/ directory.
Download the STRAIGHT vocoder from https://github.com/HidekiKawahara/legacy_STRAIGHT.
Compile the mex files in aam_model/mex needed for the facial reconstruction by calling

make

Feature Extraction

In the data/ subdirectory edit the Makefile to point to your system paths (The section you need to edit is marked with comments) and desired feature types (e.g., emotion) and outputs.
Then, you need to extract STRAIGHT waveform features:

make straight

This step takes a lot of time, and the resulting features have a total size of around ~110GB.

Then, extract the mel-generalized cepstral coefficients, the pitch, and the band-aperiodicy components:

make features

Training the HMM-based expressive audiovisual speech synthesis talking head

Copy the folder hts_style_labels from the CVSP-EAV dataset into the data/ subdirectory and rename the folder to just labels.
In the data/ subdirectory create some additional files needed for training by running:

make labels

In the hts/ subdirectory edit the configuration script Configuration.pm to point to your system paths (The section you need to edit is marked with comments) and configuration choices (e.g., select emotion).
Train the HMM models:

./Training.pl Configuration.pm

This will take a lot of time (up to 5-6 hours) according to your system specs. If an error occurs during training, the steps up to that do not need to be repeated. You can select which training steps run from the switches in Configuration.pm.

You will find the output in hts/gen.

Training the DNN-based expressive audiovisual speech synthesis talking head

Edit the merlin/egs/greek/s1/scripts/setup.sh to point to your system path (The section you need to edit is marked with comments) and configuration choices (e.g., select emotion).
Train the DNN models and generate output:

./run_full_voice.sh

You will find the output in merlin/egs/greek/s1/experiments.

Adaptation and Interpolation

The code for HMM adaptation and interpolation is missing. Maybe I will add it at some time but it is currently not in my plans (it is old, buggy and a hassle to package).

AAM training

Same with Adaptation and Interpolation, the code to train the AAM model from scratch is not provided. If you want the hand-labelled images and landmarks I used to train it you can e-mail. I used AAMtools from George Papandreou.

Unit selection

The code for the unit selection part of the paper is not available (Commercial software from Innoetics).

Author

Panagiotis P. Filntisis

Acknowledgments

Special thanks to

Nassos Katsamanis for his guidance during this project and initial codebase.
Pyrros Tsiakoulis for his help in the unit selection part of the paper.
George Papandreou for his code on active appearance models.
Dimitra Tarousi for the recording of the CVSP-EAV database.

License

This project is licensed under the GPL v3 License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Expressive Audio-Visual Talking head in Greek

Project Structure

Prerequisites

Getting Started

Installation

Feature Extraction

Training the HMM-based expressive audiovisual speech synthesis talking head

Training the DNN-based expressive audiovisual speech synthesis talking head

Adaptation and Interpolation

AAM training

Unit selection

Author

Acknowledgments

License

About

Releases

Packages

Contributors 7

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 188 Commits
aam_model		aam_model
data		data
hts		hts
images		images
merlin		merlin
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

filby89/expressive-audiovisual-speech-synthesis-GR

Folders and files

Latest commit

History

Repository files navigation

Expressive Audio-Visual Talking head in Greek

Project Structure

Prerequisites

Getting Started

Installation

Feature Extraction

Training the HMM-based expressive audiovisual speech synthesis talking head

Training the DNN-based expressive audiovisual speech synthesis talking head

Adaptation and Interpolation

AAM training

Unit selection

Author

Acknowledgments

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages