Skip to content

PyTorch Implementation for "Visually Guided Sound Source Separation with Audio-Visual Predictive Coding" (IEEE T-NNLS, 2023)

License

Notifications You must be signed in to change notification settings

zjsong/Audio-Visual-Predictive-Coding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Audio-Visual Predictive Coding (AVPC)

This repository provides a PyTorch implementation of AVPC, which can be used to separate musical instrument sounds when the corresponding video frames are available.

Paper

Visually Guided Sound Source Separation with Audio-Visual Predictive Coding
Zengjie Song1, Zhaoxiang Zhang2
1School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an, China
2New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China
IEEE Transactions on Neural Networks and Learning Systems (T-NNLS), 2023
PDF | arXiv

Abstract: The framework of visually guided sound source separation generally consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing. An ongoing trend in this field has been to tailor involved visual feature extractor for informative visual guidance and separately devise module for feature fusion, while utilizing U-Net by default for sound analysis. However, such a divide-and-conquer paradigm is parameter-inefficient and, meanwhile, may obtain suboptimal performance as jointly optimizing and harmonizing various model components is challengeable. By contrast, this article presents a novel approach, dubbed audio-visual predictive coding (AVPC), to tackle this task in a parameter-efficient and more effective manner. The network of AVPC features a simple ResNet-based video analysis network for deriving semantic visual features, and a predictive coding (PC)-based sound separation network that can extract audio features, fuse multimodal information, and predict sound separation masks in the same architecture. By iteratively minimizing the prediction error between features, AVPC integrates audio and visual information recursively, leading to progressively improved performance. In addition, we develop a valid self-supervised learning strategy for AVPC via copredicting two audio-visual representations of the same sound source. Extensive evaluations demonstrate that AVPC outperforms several baselines in separating musical instrument sounds, while reducing the model size significantly.

Dependencies

We have tested the code on the following environment:

  • Python 3.9.18 | PyTorch 2.0.0 | torchvision 0.15.0 | torchaudio 2.0.0 | CUDA 11.8 | Ubuntu 20.04.4

Download & pre-process videos

We train and test models on respectively two music video datasets: MUSIC-11 and MUSIC-21, as well as showing some qualitative results on URMP. Videos are downloaded with youtube-dl if only the YouTube IDs are given. Please see main text (Sec. V-A) for details of pre-processing video frames and audio signals. To accelerate data loading, we divide the whole audio from each video into 20-second segments.

We provide the information of MUSIC-11 training, validation, and test data in data/train516.csv, data/val11.csv, and data/test11.csv, respectively. In each .csv file, the first column shows the path of audio segments; the second column presents the path of video frames; the third column displays the number of audio segments; and the fourth column gives the number of video frames.

Training

To train AVPC's on MUSIC-11 with the default setting, simply run:

python main.py

Note: For both model training and test, you need to specify the system environment in the function main().

Test

After training, frame_best.pth and sound_best.pth can be obtained, and you need to place them in models/pretrained_models/ before test. To test AVPC's on MUSIC-11 with the default setting, simply run:

python test.py

Citation

Please consider citing our paper in your publications if the project helps your research.

@article{song2023visually,
  title={Visually Guided Sound Source Separation With Audio-Visual Predictive Coding},
  author={Song, Zengjie and Zhang, Zhaoxiang},
  journal={IEEE Transactions on Neural Networks and Learning Systems},
  year={2023}
}

Acknowledgement

Our code is developed based on Sound-of-Pixels. We thank the authors for sharing their code. We also thank Xudong Xu for his valuable suggestions of accelerating data loading.

About

PyTorch Implementation for "Visually Guided Sound Source Separation with Audio-Visual Predictive Coding" (IEEE T-NNLS, 2023)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages