Skip to content

Latest commit

 

History

History
20 lines (15 loc) · 2.24 KB

2312.08723.md

File metadata and controls

20 lines (15 loc) · 2.24 KB

Background

  • Background This paper discusses training a contextual music generation model using a dataset that is split into individual audio parts, or stems. It notes that most of the existing music generation models are text-driven and the paper instead focuses on creating a model that can integrate contextual information for music generation, which can be more practical in applications.

  • Existing Work Current work is concentrated on text-conditioned music generation and has not fully taken into account the context information and multi-channel properties of music.

Core Contributions

  • Introduced a non-autoregressive language model-like music generation architecture
    • Challenge 1: Handling multi-audio channels and contextual information The model is designed to support multi-audio channels and has been modified to adapt to the multi-voice nature of music. Moreover, it uses novel sampling methods to enhance the music quality and its contextual alignment.

    • Challenge 2: Improving the music quality and contextual consistency Through training on an internal dataset and using Frechet Audio Distance (FAD) and Mean Opinion Score (MOS) tests to verify the generated music quality and context consistency, the experiments proved that the model with the best sampling parameters can produce music of quality comparable to existing state-of-the-art models.

Implementation and Deployment

The paper primarily uses two evaluation metrics: Frechet Audio Distance (FAD) for assessing the quality of music and Mean Opinion Score (MOS) for evaluating the subjective quality of music. Experiments show that the quality of music generated on the internal dataset is significantly improved and comparable to the quality of existing top music generation models. The MOS test results indicate that the model with the best sampling parameters can create musically plausible outcomes.

Summary

The paper presents a new non-autoregressive language model approach for music generation, which optimizes the processing of multiple channels and the consistency between music and contextual information, and demonstrates, through objective and subjective assessments, the quality of the music generated and its alignment with contextual information.