Nonnegative matrix factorization (NMF) is a powerful technique of extracting meaningful patterns from an observed matrix and has been used for many applications in the audio signal processing field. ...In this article, the principle of NMF and some extensions based on a complex generative model are reviewed. Also, their application to audio source separation is presented.
Independent low-rank matrix analysis (ILRMA) is the state-of-the-art algorithm for blind source separation (BSS) in the determined situation (the number of microphones is greater than or equal to ...that of source signals). ILRMA achieves a great separation performance by modeling the power spectrograms of the source signals via the nonnegative matrix factorization (NMF). Such a highly developed source model can solve the permutation problem of the frequency-domain BSS to a large extent, which is the reason for the excellence of ILRMA. In this paper, we further improve the separation performance of ILRMA by additionally considering the general structure of spectrograms, which is called
consistency
, and hence, we call the proposed method
Consistent ILRMA
. Since a spectrogram is calculated by an overlapping window (and a window function induces spectral smearing called main- and side-lobes), the time-frequency bins depend on each other. In other words, the time-frequency components are related to each other via the uncertainty principle. Such co-occurrence among the spectral components can function as an assistant for solving the permutation problem, which has been demonstrated by a recent study. On the basis of these facts, we propose an algorithm for realizing Consistent ILRMA by slightly modifying the original algorithm. Its performance was extensively evaluated through experiments performed with various window lengths and shift lengths. The results indicated several tendencies of the original and proposed ILRMA that include some topics not fully discussed in the literature. For example, the proposed Consistent ILRMA tends to outperform the original ILRMA when the window length is sufficiently long compared to the reverberation time of the mixing system.
This paper addresses the determined blind source separation problem and proposes a new effective method unifying independent vector analysis (IVA) and nonnegative matrix factorization (NMF). IVA is a ...state-of-the-art technique that utilizes the statistical independence between sources in a mixture signal, and an efficient optimization scheme has been proposed for IVA. However, since the source model in IVA is based on a spherical multivariate distribution, IVA cannot utilize specific spectral structures such as the harmonic structures of pitched instrumental sounds. To solve this problem, we introduce NMF decomposition as the source model in IVA to capture the spectral structures. The formulation of the proposed method is derived from conventional multichannel NMF (MNMF), which reveals the relationship between MNMF and IVA. The proposed method can be optimized by the update rules of IVA and single-channel NMF. Experimental results show the efficacy of the proposed method compared with IVA and MNMF in terms of separation accuracy and convergence speed.
This paper proposes harmonic vector analysis (HVA) based on a general algorithmic framework of audio blind source separation (BSS) that is also presented in this paper. BSS for a convolutive audio ...mixture is usually performed by multichannel linear filtering when the numbers of microphones and sources are equal (determined situation). This paper addresses such determined BSS based on batch processing. To estimate the demixing filters, effective modeling of the source signals is important. One successful example is independent vector analysis (IVA) that models the signals via co-occurrence among the frequency components in each source. To give more freedom to the source modeling, a general framework of determined BSS is presented in this paper. It is based on the plug-and-play scheme using a primal-dual splitting algorithm and enables us to model the source signals implicitly through a time-frequency mask. By using the proposed framework, determined BSS algorithms can be developed by designing masks that enhance the source signals. As an example of its application, we propose HVA by defining a time-frequency mask that enhances the harmonic structure of audio signals via sparsity of cepstrum. The experiments showed that HVA outperforms IVA and independent low-rank matrix analysis (ILRMA) for both speech and music signals. A MATLAB code is provided along with the paper for a reference.
This paper describes several important methods for the blind source separation of audio signals in an integrated manner. Two historically developed routes are featured. One started from independent ...component analysis and evolved to independent vector analysis (IVA) by extending the notion of independence from a scalar to a vector. In the other route, nonnegative matrix factorization (NMF) has been extended to multichannel NMF (MNMF). As a convergence point of these two routes, independent low-rank matrix analysis has been proposed, which integrates IVA and MNMF in a clever way. All the objective functions in these methods are efficiently optimized by majorization-minimization algorithms with appropriately designed auxiliary functions. Experimental results for a simple two-source two-microphone case are given to illustrate the characteristics of these five methods.
•We propose phase reconstruction methods from amplitude spectrograms using directional statistics deep neural networks (DNNs).•The directional statistics DNN is a novel deep generative model that has ...a circular probability distribution as the conditional probability.•We use the DNN to model not only phase of speech signals but also group delay that is strongly related to amplitude spectra.•Experimental evaluation demonstrates that our method outperforms the conventional signal processing based method.
This paper presents a deep neural network (DNN)-based phase reconstruction method from amplitude spectrograms. In speech processing, an amplitude spectrogram is often used for processing, and the corresponding phases are reconstructed from the amplitude spectrogram by using the Griffin-Lim method. However, the Griffin-Lim method causes unnatural artifacts in synthetic speech. To solve this problem, we propose the directional-statistics DNNs for predicting phases from the amplitude spectrograms. We first propose the von Mises distribution DNN, which is a generative model having the von Mises distribution and models histograms of a periodic variable. We extend it for modeling group delay that has a stronger connection to the amplitude spectrograms. Furthermore, we generalize the group-delay modeling and propose another DNN called the sine-skewed generalized cardioid distribution DNN for modeling asymmetric histograms such as a group delay. Results from objective and subjective evaluations indicate that (1) our von Mises distribution DNN can predict group delay more accurately than predicting phases, (2) our DNN works as better initialization of the Griffin-Lim method, (3) the phase reconstruction methods based on our von Mises distribution DNN achieve better speech quality than the conventional Griffin-Lim method, and (4) our sine-skewed generalized cardioid distribution DNN models the group delay more accurately than our von Mises distribution DNN.
Timbre conversion of musical instrument sounds, utilizing deep neural networks (DNNs), has been extensively researched and continues to generate significant interest in the development of more ...advanced techniques. We propose a novel algorithm for timbre conversion that utilizes a variational autoencoder. However, this system must be capable of predicting the amplitude spectrogram from the melfrequency cepstrum coefficient (MFCC). This research aims to build a DNN-based decoder that utilizes the MFCC and time-frame-wise total amplitude as inputs to predict the amplitude spectrogram. Experiments conducted using a musical instrument sound dataset show that a decoder incorporating bidirectional long short-term memory yields accurate predictions of amplitude spectrograms.