Akademska digitalna zbirka SLovenije - logo
E-resources
Peer reviewed Open access
  • Multi-representation knowle...
    Gao, Liang; Xu, Kele; Wang, Huaimin; Peng, Yuxing

    Multimedia tools and applications, 02/2022, Volume: 81, Issue: 4
    Journal Article

    Audio classification aims to discriminate between different audio signal types, and it has received intensive attention due to its wide applications. In deep learning-based audio classification methods, researchers usually transform the raw signal of audios into different feature representations (such as Short Time Fourier Transform and Mel Frequency Cepstral Coefficients) as the inputs of networks. However, selecting the feature representation requires expert knowledge and extensive experimental verification. Besides, using a single type of feature representation may cause suboptimal results as the information implied in different kinds of feature representations may be complementary. Previous works show that ensembling the networks trained on different representations can greatly boost classification performance. However, making inferences using multiple networks is cumbersome and computation expensive. In this paper, we propose a novel end-to-end collaborative training framework for the audio classification task. The framework takes multiple representations as inputs to train the networks jointly with a knowledge distillation method. Consequently, our framework significantly promotes the performance of networks without increasing the computational overhead in the inference stage. Extensive experimental results demonstrate that the proposed approach improves classification performance and achieves competitive results on both acoustic scene classification tasks and general audio tagging tasks.