DiffTalk

The pytorch implementation for our CVPR2023 paper "DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation".

[Project] [Paper] [Video Demo]

Requirements

python 3.7.0
pytorch 1.10.0
pytorch-lightning 1.2.5
torchvision 0.11.0
pytorch-lightning==1.2.5

For more details, please refer to the requirements.txt. We conduct the experiments with 8 NVIDIA 3090Ti GPUs.

Put the first stage model to ./models.

Dataset

Please download the HDTF dataset for training and test, and process the dataset as following.

Data Preprocessing:

Set all videos to 25 fps.
Extract the audio signals and facial landmarks.
Put the processed data in ./data/HDTF, and construct the data directory as following.
Constract the data_train.txt and data_test.txt as following.

./data/HDTF:

|——data/HDTF
   |——images
      |——0_0.jpg
      |——0_1.jpg
      |——...
      |——N_M.bin
   |——landmarks
      |——0_0.lmd
      |——0_1.lmd
      |——...
      |——N_M.lms
   |——audio_smooth
      |——0_0.npy
      |——0_1.npy
      |——...
      |——N_M.npy

./data/data_train(test).txt:

0_0
0_1
0_2
...
N_M

N is the total number of classes, and M is the class size.

Training

sh run.sh

Test

sh inference.sh

Weakness

The DiffTalk models talking head generation as an iterative denoising process, which needs more time to synthesize a frame compared with most GAN-based approaches. This is also a common problem of LDM-based works.
The model is trained on the HDTF dataset, and it sometimes fails on some identities from other datasets.
When driving a portrait with more challenging cross-identity audio, the audio-lip synchronization of the synthesized video is slightly inferior to the ones under self-driven setting.
During inference, the network is also sensitive to the mask shape in z_T , where the mask needs to cover the mouth region completely and its shape cannot leak any lip shape information.

Acknowledgement

This code is built upon the publicly available code latent-diffusion. Thanks the authors of latent-diffusion for making their excellent work and codes publicly available.

Citation

Please cite the following paper if you use this repository in your research.

@inproceedings{shen2023difftalk,
   author={Shen, Shuai and Zhao, Wenliang and Meng, Zibin and Li, Wanhua and Zhu, Zheng and Zhou, Jie and Lu, Jiwen},
   title={DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation},
   booktitle={CVPR},
   year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
configs/latent-diffusion		configs/latent-diffusion
data		data
ldm		ldm
models		models
scripts		scripts
Readme.md		Readme.md
inference.sh		inference.sh
main.py		main.py
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DiffTalk

Requirements

Dataset

Training

Test

Weakness

Acknowledgement

Citation

About

Releases

Packages

Languages

sstzal/DiffTalk

Folders and files

Latest commit

History

Repository files navigation

DiffTalk

Requirements

Dataset

Training

Test

Weakness

Acknowledgement

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages