Localizing audio sources is challenging in real reverberant environments, especially when several sources are active. We propose to use a neural network built from stacked convolutional and recurrent ...layers in order to estimate the directions of arrival of multiple sources from a first-order Ambisonics recording. It returns the directions of arrival over a discrete grid of a known number of sources. We propose to use features derived from the acoustic intensity vector as inputs. We analyze the behavior of the neural network by means of a visualization technique called layerwise relevance propagation. This analysis highlights which parts of the input signal are relevant in a given situation. We also conduct experiments to evaluate the performance of our system in various environments, from simulated rooms to real recordings, with one or two speech sources. The results show that the proposed features significantly improve performances with respect to raw Ambisonics inputs.
Sound Event Detection in Synthetic Domestic Environments Serizel, Romain; Turpault, Nicolas; Shah, Ankit ...
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
05/2020
Conference Proceeding
Odprti dostop
We present a comparative analysis of the performance of state-of-the-art sound event detection systems. In particular, we study the robustness of the systems to noise and signal degradation, which is ...known to impact model generalization. Our analysis is based on the results of task 4 of the DCASE 2019 challenge, where submitted systems were evaluated on, in addition to real-world recordings, a series of synthetic soundscapes that allow us to carefully control for different soundscape characteristics. Our results show that while overall systems exhibit significant improvements compared to previous work, they still suffer from biases that could prevent them from generalizing to real-world scenarios.
•A constant residual noise power constraint for rank-1 MWF is proposed.•Speech covariance matrix reconstruction to fulfill the rank-1 assumption.•An extensive comparison of the multichannel linear ...filters supported by BLSTM.•A feature variance metric that correlates with the word error rate.
Multichannel linear filters, such as the Multichannel Wiener Filter (MWF) and the Generalized Eigenvalue (GEV) beamformer are popular signal processing techniques which can improve speech recognition performance. In this paper, we present an experimental study on these linear filters in a specific speech recognition task, namely the CHiME-4 challenge, which features real recordings in multiple noisy environments. Specifically, the rank-1 MWF is employed for noise reduction and a new constant residual noise power constraint is derived which enhances the recognition performance. To fulfill the underlying rank-1 assumption, the speech covariance matrix is reconstructed based on eigenvectors or generalized eigenvectors. Then the rank-1 constrained MWF is evaluated with alternative multichannel linear filters under the same framework, which involves a Bidirectional Long Short-Term Memory (BLSTM) network for mask estimation. The proposed filter outperforms alternative ones, leading to a 40% relative Word Error Rate (WER) reduction compared with the baseline Weighted Delay and Sum (WDAS) beamformer on the real test set, and a 15% relative WER reduction compared with the GEV-BAN method. The results also suggest that the speech recognition accuracy correlates more with the Mel-frequency cepstral coefficients (MFCC) feature variance than with the noise reduction or the speech distortion level.
This paper introduces deep neural network (DNN)–hidden Markov model (HMM)-based methods to tackle speech recognition in heterogeneous groups of speakers including children. We target three speaker ...groups consisting of children, adult males and adult females. Two different kind of approaches are introduced here: approaches based on DNN adaptation and approaches relying on vocal-tract length normalisation (VTLN). First, the recent approach that consists in adapting a general DNN to domain/language specific data is extended to target age/gender groups in the context of DNN–HMM. Then, VTLN is investigated by training a DNN–HMM system by using either mel frequency cepstral coefficients normalised with standard VTLN or mel frequency cepstral coefficients derived acoustic features combined with the posterior probabilities of the VTLN warping factors. In this later, novel, approach the posterior probabilities of the warping factors are obtained with a separate DNN and the decoding can be operated in a single pass when the VTLN approach requires two decoding passes. Finally, the different approaches presented here are combined to take advantage of their complementarity. The combination of several approaches is shown to improve the baseline phone error rate performance by thirty per cent to thirty-five per cent relative and the baseline word error rate performance by about ten per cent relative.
Unsupervised speech enhancement based on variational autoencoders has shown promising performance compared with the commonly used supervised methods. This approach involves the use of a pre-trained ...deep speech prior along with a parametric noise model, where the noise parameters are learned from the noisy speech signal with an expectation-maximization (EM)-based method. The E-step involves an intractable latent posterior distribution. Existing algorithms to solve this step are either based on computationally heavy Monte Carlo Markov Chain sampling methods and variational inference, or inefficient optimization-based methods. In this paper, we propose a new approach based on Langevin dynamics that generates multiple sequences of samples and comes with a total variation-based regularization to incorporate temporal correlations of latent vectors. Our experiments demonstrate that the developed framework makes an effective compromise between computational efficiency and enhancement quality, and outperforms existing methods.
This paper proposes a benchmark of submissions to Detection and Classification Acoustic Scene and Events 2021 Challenge (DCASE) Task 4 representing a sampling of the state-of-the-art in Sound Event ...Detection task. The submissions are evaluated according to the two polyphonic sound detection score scenarios proposed for the DCASE 2021 Challenge Task 4, which allow to make an analysis on whether submissions are designed to perform fine-grained temporal segmentation, coarse-grained temporal segmentation, or have been designed to be polyvalent on the scenarios proposed.We study the solutions proposed by participants to analyze their robustness to varying level target to non-target signal-to-noise ratio and to temporal localization of target sound events. A last experiment is proposed in order to study the impact of non-target events on systems outputs. Results show that systems adapted to provide coarse segmentation outputs are more robust to different target to non-target signal-to-noise ratio and, with the help of specific data augmentation methods, they are more robust to time localization of the original event. Results of the last experiment display that systems tend to spuriously predict short events when non-target events are present. This is particularly true for systems that are tailored to have a fine segmentation.
This paper presents low-rank approximation based multichannel Wiener filter algorithms for noise reduction in speech plus noise scenarios, with application in cochlear implants. In a single speech ...source scenario, the frequency-domain autocorrelation matrix of the speech signal is often assumed to be a rank-1 matrix, which then allows to derive different rank-1 approximation based noise reduction filters. In practice, however, the rank of the autocorrelation matrix of the speech signal is usually greater than one. Firstly, the link between the different rank-1 approximation based noise reduction filters and the original speech distortion weighted multichannel Wiener filter is investigated when the rank of the autocorrelation matrix of the speech signal is indeed greater than one. Secondly, in low input signal-to-noise-ratio scenarios, due to noise non-stationarity, the estimation of the autocorrelation matrix of the speech signal can be problematic and the noise reduction filters can deliver unpredictable noise reduction performance. An eigenvalue decomposition based filter and a generalized eigenvalue decomposition based filter are introduced that include a more robust rank-1, or more generally rank-R, approximation of the autocorrelation matrix of the speech signal. These noise reduction filters are demonstrated to deliver a better noise reduction performance especially in low input signal-to-noise-ratio scenarios. The filters are especially useful in cochlear implants, where more speech distortion and hence a more aggressive noise reduction can be tolerated.
This paper presents weighted approaches for integrated active noise control and noise reduction in hearing aids. The unweighted integrated active noise control and noise reduction scheme introduced ...in the previous work does not allow to trade-off between the active noise control and the noise reduction. In some circumstances it will, however, be useful to emphasize one of the functional blocks.
Changing the original optimisation problem to a constrained optimisation problem leads to a scheme based on a weighted mean squared error criterion that allows to focus either on the active noise control or on the noise reduction. It is similarly possible to derive a scheme that allows to focus either on reducing the speech distortion or on reducing the residual noise at the eardrum. In a single speech source scenario and when the number of sound sources (speech plus noise sources) is less than or equal to the number of microphones, it is possible to derive a simple formula for the output signal-to-noise ratio of the latter scheme. It can then be shown that this scheme delivers a constant signal-to-noise ratio at the eardrum for any weighting factor.
•Introduce speech distortion weighted integrated ANC and NR.•Theoretical derivation of the output SNR in the single speech source case.•Theoretical validation of the previous results: weighted ANC and NR (EUSIPCO '09).•Experimental comparison of un-weighted ANC and NR and weighted approaches.
In this paper, we study the usefulness of various matrix factorization methods for learning features to be used for the specific acoustic scene classification (ASC) problem. A common way of ...addressing ASC has been to engineer features capable of capturing the specificities of acoustic environments. Instead, we show that better representations of the scenes can be automatically learned from time-frequency representations using matrix factorization techniques. We mainly focus on extensions including sparse, kernel-based, convolutive and a novel supervised dictionary learning variant of principal component analysis and nonnegative matrix factorization. An experimental evaluation is performed on two of the largest ASC datasets available in order to compare and discuss the usefulness of these methods for the task. We show that the unsupervised learning methods provide better representations of acoustic scenes than the best conventional hand-crafted features on both datasets. Furthermore, the introduction of a novel nonnegative supervised matrix factorization model and deep neural networks trained on spectrograms, allow us to reach further improvements.
Performing an adequate evaluation of sound event detection (SED) systems is far from trivial and is still subject to ongoing research. The recently proposed polyphonic sound detection (PSD)-receiver ...operating characteristic (ROC) and PSD score (PSDS) make an important step into the direction of an evaluation of SED systems which is independent from a certain decision threshold. This allows to obtain a more complete picture of the overall system behavior which is less biased by threshold tuning. Yet, the PSD-ROC is currently only approximated using a finite set of thresholds. The choice of the thresholds used in approximation, however, can have a severe impact on the resulting PSDS. In this paper we propose a method which allows for computing system performance on an evaluation set for all possible thresholds jointly, enabling accurate computation not only of the PSD-ROC and PSDS but also of other collar-based and intersection-based performance curves. It further allows to select the threshold which best fulfills the requirements of a given application. Source code is publicly available in our SED evaluation package sed_scores_eval 1 .