A 22-dBA Digital Optical MEMS Microphone Milleri, Niccolo de; Valli, Luca; Fueldner, Marc ...
IEEE journal of solid-state circuits,
2024-July, Letnik:
59, Številka:
7
Journal Article
Recenzirano
The market and the applications of micro-electromechanical systems (MEMS)-based microphones have been in continuous growth over the last decades. This article presents a promising acoustic-sensing ...technology that mixes the consolidated MEMS technology with an innovative optical transduction technique of acoustic signals. The proposed method allows to significantly reduce the intrinsic noise of the system and to increase its signal-to-noise ratio (SNR). The designed digital optical microphone reaches an SNR of 71.6 dBA in a <inline-formula> <tex-math notation="LaTeX">5{\times } 5{\times } 2 </tex-math></inline-formula> mm3 package with an output sensitivity of -21 dBFS/Pa. This article describes each section of the system-in-package (SiP) microphone, starting from the physics behind the transduction mechanism, covering the application-specified integrated circuit (ASIC) and package design, and the optical stack structure. A final analysis of the obtained experimental results is provided and compared with the state-of-art reported in the literature.
A unidirectional microelectromechanical system (MEMS) microphone single module for handsfree and voice recognition systems in automobiles is presented. Because in-cabin noise and temperature ...variation affect the reduction in the handsfree and voice recognition system performance, noise suppression and increasing the signal-to-noise ratio (SNR) of the microphone are required. In this study, a capacitive MEMS microphone module with a high SNR and a unidirectional characteristic is achieved by designing the structure, package, and module of the MEMS microphone. To improve the SNR, the microphone is developed using a slit-edged membrane. The slit structure is designed to release the residual stress of the membrane to achieve improved sensitivity and SNR. The unidirectional characteristic of the microphone enables suppression of noise signals from undesired directions. The directional characteristic of the microphone is realized by attaching a porous SU-8 filter to delay the time to one of the two acoustic ports on the package. Tests on the proposed unidirectional MEMS microphone package and module show that an SNR of 62.4 dB and a front-back ratio of 27.1 dB are achieved.
The MEMS microphone is a representative device among the MEMS family, which has attracted substantial research interest, and those tailored for human voice have earned distinct success in ...commercialization. Although sustained development persists, challenges such as residual stress, environmental noise, and structural innovation are posed. To collect and summarize the recent advances in this subject, this paper presents a concise review concerning the transduction mechanism, diverse mechanical structure topologies, and effective methods of noise reduction for high-performance MEMS microphones with a dynamic range akin to the audible spectrum, aiming to provide a comprehensive and adequate analysis of this scope.
This article presents a lumped-parameter model (LPM) providing a deeper understanding of the compliant backplate in capacitive micro-electromechanical systems (MEMS) microphones. Some previous models ...simplify the backplate as stationary, whereas others treat it as vibrating. This work not only models the backplate as vibrating but also considers the coupling effect between the mechanical and electrical domains. The extended model allows for a more detailed analysis of how the microphone converts sound into an electrical signal. Specifically, the theoretical derivations using Lagrange equations show how backplate motion can impact the microphone's performance. The analysis of the LPM aligns well with the results of finite element analysis (FEA) when the frequency is below the high-order resonance, validating the theoretical concepts. In particular, the model with electrical coupling of the vibrating backplate effectively captures the sensitivity dip resulting from the backplate resonance, unlike models lacking this coupling. The theoretical framework is also extended to the phenomenon of pull-in. A backplate that is overly compliant can narrow the operating frequency range and increase the likelihood of experiencing pull-in. Thus, there is a tradeoff between optimizing the microphone's acoustic performance and ensuring its mechanical robustness. This work provides valuable insights into navigating these tradeoffs.
Bone conduction and in-ear microphones pick up the Bone-conducted (BC) speech, which has traveled through bones and soft tissues. This BC speech is less sensitive to surrounding noise than ...Air-conducted (AC) speech. Therefore, there is an interest in recording this signal. However, the intelligibility and quality of these microphones are known to be a limit to their use. Previous work has noted confusion between vowel sounds in terms of intelligibility. To evaluate the quality and intelligibility, studies rely on subjective and objective tests. This paper aims to determine if the standard objective methods for rating speech quality and intelligibility can be applied to BC speech recorded through these microphones. In order to estimate intelligibility, a subjective test based on vowel recognition was compared to STOI and to a new criterion based on second formant frequencies of oral vowels. For speech quality estimation, MUSHRA and PESQ tests were compared. Results show difficulties in using objective methods instead of subjective ones. Therefore, for ongoing studies, it is suggested to use subjective methods to evaluate speech quality and intelligibility.
•Objective and subjective evaluation of speech quality do not correlate.•Introduction of a new objective metric to evaluate vowel recognition.•The new objective metric for vowel recognition works only for in-ear microphones.•Objective and subjective evaluation of intelligibility do not correlate.
This paper studies the problem of frequency-invariant beamforming with concentric circular microphone arrays (CCMAs) and presents an approach to the design of frequency-invariant and symmetric ...beampatterns. We first apply the Jacobi-Anger expansion to each ring of the CCMA to approximate the beampattern. The beamformer is then designed by using all the expansions from different rings. In comparison with the existing work in the literature where a Jacobi-Anger expansion of the same order is applied to different rings, here in this contribution the order of the Jacobi-Anger expansion at a ring is related to its number of sensors and, as a result, the expansion order at different rings may be different. The developed approach is rather general. It is not only able to mitigate the deep nulls problem in the directivity factor and the white noise gain, that is common to circular microphone arrays (CMAs), and improve the steering flexibility, but is also flexible to use in practice where a smaller ring can have less microphones than a larger one. We discuss the conditions for the design of <inline-formula><tex-math notation="LaTeX"> N</tex-math></inline-formula>th-order symmetric beampatterns and examples of frequency-invariant beampatterns with commonly used array geometries such as CMAs, CMAs with a sensor at the center, and CCMAs. We show the advantage of adding one microphone at the center of either a CMA or a CCMA, i.e., circumventing the deep nulls problem caused by the 0th-order Bessel function.
Speech is produced by a nonlinear, dynamical Vocal Tract (VT) system, and is transmitted through multiple (air, bone and skin conduction) modes, as captured by the air, bone and throat microphones ...respectively. Speaker specific characteristics that capture this nonlinearity are rarely used as stand-alone features for speaker modeling, and at best have been used in tandem with well known linear spectral features to produce tangible results. This paper proposes Recurrent Plot (RP) embeddings as stand-alone, non-linear speaker-discriminating features. Two datasets, the continuous multimodal TIMIT speech corpus and the consonant-vowel unimodal syllable dataset, are used in this study for conducting closed-set speaker identification experiments. Experiments with unimodal speaker recognition systems show that RP embeddings capture the nonlinear dynamics of the VT system which are unique to every speaker, in all the modes of speech. The Air (A), Bone (B) and Throat (T) microphone systems, trained purely on RP embeddings perform with an accuracy of 95.81%, 98.18% and 99.74%, respectively. Experiments using the joint feature space of combined RP embeddings for bimodal (A-T, A-B, B-T) and trimodal (A-B-T) systems show that the best trimodal system (99.84% accuracy) performs on par with trimodal systems using spectrogram (99.45%) and MFCC (99.98%). The 98.84% performance of the B-T bimodal system shows the efficacy of a speaker recognition system based entirely on alternate (bone and throat) speech, in the absence of the standard (air) speech. The results underscore the significance of the RP embedding, as a nonlinear feature representation of the dynamical VT system that can act independently for speaker recognition. It is envisaged that speech recognition too will benefit from this nonlinear feature.
We propose an integrated end-to-end automatic speech recognition (ASR) paradigm by joint learning of the front-end speech signal processing and back-end acoustic modeling. We believe that "only good ...signal processing can lead to top ASR performance" in challenging acoustic environments. This notion leads to a unified deep neural network (DNN) framework for distant speech processing that can achieve both high-quality enhanced speech and high-accuracy ASR simultaneously. Our goal is accomplished by two techniques, namely: (i) a reverberation-time-aware DNN based speech dereverberation architecture that can handle a wide range of reverberation times to enhance speech quality of reverberant and noisy speech, followed by (ii) DNN-based multicondition training that takes both clean-condition and multicondition speech into consideration, leveraging upon an exploitation of the data acquired and processed with multichannel microphone arrays, to improve ASR performance. The final end-to-end system is established by a joint optimization of the speech enhancement and recognition DNNs. The recent REverberant Voice Enhancement and Recognition Benchmark (REVERB) Challenge task is used as a test bed for evaluating our proposed framework. We first report on superior objective measures in enhanced speech to those listed in the 2014 REVERB Challenge Workshop on the simulated data test set. Moreover, we obtain the best single-system word error rate (WER) of 13.28% on the 1-channel REVERB simulated data with the proposed DNN-based pre-processing algorithm and clean-condition training. Leveraging upon joint training with more discriminative ASR features and improved neural network based language models, a low single-system WER of 4.46% is attained. Next, a new multi-channel-condition joint learning and testing scheme delivers a state-of-the-art WER of 3.76% on the 8-channel simulated data with a single ASR system. Finally, we also report on a preliminary yet promising experimentation with the REVERB real test data.
This study proposes a complex spectral mapping approach for single- and multi-channel speech enhancement, where deep neural networks (DNNs) are used to predict the real and imaginary (RI) components ...of the direct-path signal from noisy and reverberant ones. The proposed system contains two DNNs. The first one performs single-channel complex spectral mapping. The estimated complex spectra are used to compute a minimum variance distortion-less response (MVDR) beamformer. The RI components of beamforming results, which encode spatial information, are then combined with the RI components of the mixture to train the second DNN for multi-channel complex spectral mapping. With estimated complex spectra, we also propose a novel method of time-varying beamforming. State-of-the-Art performance is obtained on the speech enhancement and recognition tasks of the CHiME-4 corpus. More specifically, our system obtains 6.82%, 3.19% and 1.99% word error rates (WER) respectively on the single-, two-, and six-microphone tasks of CHiME-4, significantly surpassing the current best results of 9.15%, 3.91% and 2.24% WER.
An important problem in ad-hoc microphone speech separation is how to guarantee the robustness of a system with respect to the locations and numbers of microphones. The former requires the system to ...be invariant to different indexing of the microphones with the same locations, while the latter requires the system to be able to process inputs with varying dimensions. Conventional optimization-based beamforming techniques satisfy these requirements by definition, while for deep learning-based end-to-end systems those constraints are not fully addressed. In this paper, we propose transform-average-concatenate (TAC), a simple design paradigm for channel permutation and number invariant multi-channel speech separation. Based on the filter-and-sum network (FaSNet), a recently proposed end-to-end time-domain beamforming system, we show how TAC significantly improves the separation performance across various numbers of microphones in noisy reverberant separation tasks with ad-hoc arrays. Moreover, we show that TAC also significantly improves the separation performance with fixed geometry array configuration, further proving the effectiveness of the proposed paradigm in the general problem of multi-microphone speech separation.