SF-CRL: Speech-Facial Contrastive Representation Learning for Speaker Feature Extraction

Recently, several studies have shown that various modalities can be used to extract features from audio data. For example, based on the CLIP methodology, the pre-trained AudioCLIP model achieved state-of-the-art (SOTA) performance on the ESC dataset by extracting generalized features from audio with text and image. However, most research focuses on general sounds such as rain or animal noises. In this reason, this study aims to extract unique features of individual voices using a dataset of human speech. Several previous studies have demonstrated the close correlation between human speech and facial features such as the jawline and oral structure. Leveraging this correlation, we performed cross-modal contrastive learning between pairs of human face images and speech. Recent advancements in multimodal learning have highlighted the potential of integrating various modalities to enhance feature extraction from audio data. However, most research has focused on non-human sounds. This study addresses this gap by extracting unique features of individual voices using human speech datasets. We propose SF-CRL (Speech-Facial Contrastive Representation Learning), a model leveraging the correlation between speech and facial features.

Model

The image shows the overall architecture of SF-CRL. Utilizing mel-spectrograms of speech and corresponding facial images, our model employs modified VGG-M architectures for audio and image encoders and incorporates a custom loss function to maximize the similarity between audio and visual features. Both the audio and image encoders utilized a modified VGG-M architecture. We implemented our model to perform cross-modal contrastive learning between audio and visual features. This model utilizes a custom loss function to maximize the similarity between corresponding audio and visual features. To train the model, run the following code.

python model.py

Experiment

For Experiment, we calculated speech-face matching accuracy by cosine similarity. Also, we performed retrieving image from audio. We assessed whether the model could match a randomly generated image among five images to a given speech on unseen data (GRID dataset). This approach aims to extract distinctive features of individual voices, potentially enhancing applications in speaker recognition and other areas requiring precise voice analysis.

Evaluations on the LRS3 and GRID datasets demonstrate that SF-CRL outperforms benchmark models such as Resemblyzer, Wav2Vec 2.0, and AudioCLIP in speaker similarity and cross-modal retrieval tasks. Our approach effectively captures distinctive voice features, with potential applications in speaker recognition and biometric authentication.

Evaluation Result
Speaker Scatter Plot
Encoder Attention Map
- Audio feature extractor
- Visual feature extractor
Ablation Study for Feature-Matching Loss

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
model.py		model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SF-CRL: Speech-Facial Contrastive Representation Learning for Speaker Feature Extraction

Model

Experiment

About

Releases

Packages

Contributors 2

Languages

dxlabskku/2024capstone_SFCRL

Folders and files

Latest commit

History

Repository files navigation

SF-CRL: Speech-Facial Contrastive Representation Learning for Speaker Feature Extraction

Model

Experiment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages