Objective: In a cochlear implant (CI) speech processor, noise reduction (NR) is a critical component for enabling CI users to attain improved speech perception under noisy conditions. Identifying an ...effective NR approach has long been a key topic in CI research. Method: Recently, a deep denoising autoencoder (DDAE) based NR approach was proposed and shown to be effective in restoring clean speech from noisy observations. It was also shown that DDAE could provide better performance than several existing NR methods in standardized objective evaluations. Following this success with normal speech, this paper further investigated the performance of DDAE-based NR to improve the intelligibility of envelope-based vocoded speech, which simulates speech signal processing in existing CI devices. Results: We compared the performance of speech intelligibility between DDAE-based NR and conventional single-microphone NR approaches using the noise vocoder simulation. The results of both objective evaluations and listening test showed that, under the conditions of nonstationary noise distortion, DDAE-based NR yielded higher intelligibility scores than conventional NR approaches. Conclusion and significance: This study confirmed that DDAE-based NR could potentially be integrated into a CI processor to provide more benefits to CI users under noisy conditions.
Previous studies have proven that integrating video signals, as a complementary modality, can facilitate improved performance for speech enhancement (SE). However, video clips usually contain large ...amounts of data and pose a high cost in terms of computational resources and thus may complicate the SE system. As an alternative source, a bone-conducted speech signal has a moderate data size while manifesting speech-phoneme structures, and thus complements its air-conducted counterpart. In this study, we propose a novel multi-modal SE structure in the time domain that leverages bone- and air-conducted signals. In addition, we examine two ensemble-learning-based strategies, early fusion (EF) and late fusion (LF), to integrate the two types of speech signals, and adopt a deep learning-based fully convolutional network to conduct the enhancement. The experiment results on the Mandarin corpus indicate that this newly presented multi-modal (integrating bone- and air-conducted signals) SE structure significantly outperforms the single-source SE counterparts (with a bone- or air-conducted signal only) in various speech evaluation metrics. In addition, the adoption of an LF strategy other than an EF in this novel SE multi-modal structure achieves better results.
Speech enhancement (SE) aims to reduce noise in speech signals. Most SE techniques focus only on addressing audio information. In this paper, inspired by multimodal learning, which utilizes data from ...different modalities, and the recent success of convolutional neural networks (CNNs) in SE, we propose an audio-visual deep CNNs (AVDCNN) SE model, which incorporates audio and visual streams into a unified network model. We also propose a multitask learning framework for reconstructing audio and visual signals at the output layer. Precisely speaking, the proposed AVDCNN model is structured as an audio-visual encoder-decoder network, in which audio and visual data are first processed using individual CNNs, and then fused into a joint network to generate enhanced speech (the primary task) and reconstructed images (the secondary task) at the output layer. The model is trained in an end-to-end manner, and parameters are jointly learned through back propagation. We evaluate enhanced speech using five instrumental criteria. Results show that the AVDCNN model yields a notably superior performance compared with an audio-only CNN-based SE model and two conventional SE approaches, confirming the effectiveness of integrating visual information into the SE process. In addition, the AVDCNN model also outperforms an existing audio-visual SE model, confirming its capability of effectively combining audio and visual information in SE.
Deep learning techniques such as convolutional neural networks (CNN) have been successfully applied to identify pathological voices. However, the major disadvantage of using these advanced models is ...the lack of interpretability in explaining the predicted outcomes. This drawback further introduces a bottleneck for promoting the classification or detection of voice-disorder systems, especially in this pandemic period. In this paper, we proposed using a series of learnable sinc functions to replace the very first layer of a commonly used CNN to develop an explainable SincNet system for classifying or detecting pathological voices. The applied sinc filters, a front-end signal processor in SincNet, are critical for constructing the meaningful layer and are directly used to extract the acoustic features for following networks to generate high-level voice information. We conducted our tests on three different Far Eastern Memorial Hospital voice datasets. From our evaluations, the proposed approach achieves the highest 7%–accuracy and 9%–sensitivity improvements from conventional methods and thus demonstrates superior performance in predicting input pathological waveforms of the SincNet system. More importantly, we intended to give possible explanations between the system output and the first-layer extracted speech features based on our evaluated results.
Objective: This study focuses on the first (S1) and second (S2) heart sound recognition based only on acoustic characteristics; the assumptions of the individual durations of S1 and S2 and time ...intervals of S1-S2 and S2-S1 are not involved in the recognition process. The main objective is to investigate whether reliable S1 and S2 recognition performance can still be attained under situations where the duration and interval information might not be accessible. Methods: A deep neural network (DNN) method is proposed for recognizing S1 and S2 heart sounds. In the proposed method, heart sound signals are first converted into a sequence of Mel-frequency cepstral coefficients (MFCCs). The K-means algorithm is applied to cluster MFCC features into two groups to refine their representation and discriminative capability. The refined features are then fed to a DNN classifier to perform S1 and S2 recognition. We conducted experiments using actual heart sound signals recorded using an electronic stethoscope. Precision, recall, F-measure, and accuracy are used as the evaluation metrics. Results: The proposed DNN-based method can achieve high precision, recall, and F-measure scores with more than 91% accuracy rate. Conclusion: The DNN classifier provides higher evaluation scores compared with other well-known pattern classification methods. Significance: The proposed DNN-based method can achieve reliable S1 and S2 recognition performance based on acoustic characteristics without using an ECG reference or incorporating the assumptions of the individual durations of S1 and S2 and time intervals of S1-S2 and S2-S1.
The detection of audio tampering plays a crucial role in ensuring the authenticity and integrity of multimedia files. This paper presents a novel approach to identifying tampered audio files by ...leveraging the unique Electric Network Frequency (ENF) signal, which is inherent to the power grid and serves as a reliable indicator of authenticity. The study begins by establishing a comprehensive Chinese ENF database containing diverse ENF signals extracted from audio files. The proposed methodology involves extracting the ENF signal, applying wavelet decomposition, and utilizing the autoregressive model to train effective classification models. Subsequently, the framework is employed to detect audio tampering and assess the influence of various environmental conditions and recording devices on the ENF signal. Experimental evaluations conducted on our Chinese ENF database demonstrate the efficacy of the proposed method, achieving impressive accuracy rates ranging from 91% to 93%. The results emphasize the significance of ENF-based approaches in enhancing audio file forensics and reaffirm the necessity of adopting reliable tamper detection techniques in multimedia authentication.
Attention-deficit/hyperactivity disorder (ADHD) is a prevalent neurodevelopmental disorder affecting children worldwide; however, diagnosing ADHD remains a complex task. Theta/beta ratio (TBR) ...derived from electroencephalography (EEG) recordings has been proposed as a potential biomarker for ADHD, but its effectiveness in children with ADHD remains controversial. Behavioral assessments, such as the Conners Continuous Performance Test-3
edition (CPT-3), have been utilized to assess attentional capacity in individuals with ADHD. This study aims to investigate the correlation between TBR and CPT-3 scores in children and adolescents with ADHD.
In a retrospective analysis, we examined patients regularly monitored for ADHD at Taipei Tzu Chi Hospital, who underwent both EEG and CPT-3 assessments. Severity of ADHD was evaluated using parent- and teacher-completed Swanson, Nolan, and Pelham (SNAP)-IV rating scales.
The study encompassed 55 ADHD patients (41 with abnormal CPT-3 scores, 14 with normal CPT-3 scores) and 45 control subjects. TBR demonstrated elevation in ADHD patients with abnormal CPT-3 scores, indicating its potential to represent attentional capacity akin to behavioral assessments like CPT-3. However, significant correlations between TBR values and CPT-3 variables or SNAP-IV rating scales were not observed. Moreover, TBR values exhibited considerable overlap across the groups, leading to diminished sensitivity and negative predictive value as a potential neurophysiological ADHD biomarker.
While our study underscores the utility of both TBR and CPT-3 in assessing attentional capacity, their sensitivity in diagnosing ADHD is limited. A comprehensive evaluation, integrating clinical expertise, parental input, and detailed neuropsychometric tests, remains pivotal for a thorough and precise diagnosis of ADHD.
Attention problems are frequently observed in patients with Prader-Willi syndrome (PWS); however, only few studies have investigated the severity and mechanisms of attention problems in them. In this ...study, we aim to evaluate dynamic changes in the quantitative electroencephalographic (EEG) spectrum during attention tasks in patients with PWS.
From January to June 2019, 10 patients with PWS and 10 age-matched neurotypical control participants were recruited at Taipei Tzu Chi Hospital. Each participant completed Conners' continuous performance test, third edition (CPT-3), tasks with simultaneous EEG monitoring. The dynamic changes in the quantitative EEG spectrum between the resting state and during CPT-3 tasks were compared.
Behaviorally, patients with PWS experienced significant attention problems, indicated by the high scores for several CPT-3 variables. The theta/beta ratio of the resting-state EEG spectrum revealed no significant differences between the control participants and patients with PWS. During CPT-3 tasks, a significant decrease in the alpha power was noted in controls compared with that in patients with PWS. The attention-to-resting alpha power ratio was positively correlated with many CPT-3 variables. After adjusting for genotype, age, intelligence, and body mass index, the attention-to-resting alpha power ratio was still significantly correlated with participants' commission errors.
This study provides evidence that attention problems are frequently observed in patients with PWS, while attention impairment can be demonstrated by dynamic changes in the quantitative EEG spectrum.