The task of measuring the spectral density of power of a speech signal in sliding observation window mode is examined. A parametric approach to solving this task using an autoregressive data model is ...studied. The problem of optimizing the order of an autoregressive model under the conditions of small samples is studied. It is proposed to solve the problem using a hybrid method of spectral analysis based on sequential enumeration of a finite number of variants. The optimization criterion is formulated in terms of an inverse problem: from the speech signal to the voice source. It uses the scale-invariant measure of the spectral distance as the objective function, and the Schuster periodogram as the reference sample. The effectiveness of the hybrid method has been experimentally evaluated on the basis of the author's software. It is shown that with the duration of the observation window no greater than 10 ms, the use of the hybrid method increases the accuracy of spectral analysis by more than 30%, compared to the well-known Berg method, the order of which is established according to the Akaike information criterion.
This paper is focused on the automatic extraction of persons and their attributes (gender, year of born) from album of photos and videos. A two-stage approach is proposed in which, firstly, the ...convolutional neural network simultaneously predicts age/gender from all photos and additionally extracts facial representations suitable for face identification. Here the MobileNet is modified and is preliminarily trained to perform face recognition in order to additionally recognize age and gender. The age is estimated as the expected value of top predictions in the neural network. In the second stage of the proposed approach, extracted faces are grouped using hierarchical agglomerative clustering techniques. The birth year and gender of a person in each cluster are estimated using aggregation of predictions for individual photos. The proposed approach is implemented in an Android mobile application. It is experimentally demonstrated that the quality of facial clustering for the developed network is competitive with the state-of-the-art results achieved by deep neural networks, though implementation of the proposed approach is much computationally cheaper. Moreover, this approach is characterized by more accurate age/gender recognition when compared to the publicly available models.
The problem of autoregressive modeling of a speech signal based on the data of the discrete Fourier transform in the mode of a sliding window of small duration (milliseconds) is considered. The ...problem of stability of the formed autoregressive model is investigated. To overcome it, it is proposed to use the envelope of the Schuster periodogram as a reference spectral sample. A new method of autoregressive modeling has been developed, in which the detection of the spectral envelope is carried out using a recirculator of a sequence of samples in the frequency domain. An example of its practical implementation is considered, a full-scale experiment is set up and carried out. Based on the results of the experiment, conclusions were drawn about achieving a significant gain in terms of not only stability, but also the accuracy of the autoregressive model of the speech signal.
The problem of determining the accuracy of an autoregressive model of a speech signal is considered, and a method for measuring the accuracy index in the sliding observation window mode is proposed. ...As an indicator of accuracy, we used a modified value of the COSH-distance (hyperbolic cosine) of the autoregressive model relative to the eponymous (single phoneme) Schuster periodogram as a reference spectral sample. To study the possibilities of the proposed method, a full-scale experiment was set up and carried out, in which the object of study was a set of autoregressive models of different orders. These models were obtained by Berg’s method for the vowel speech sounds of a test speaker. According to the results of the performed measurements for each vowel, the optimal values of the autoregressive order and the corresponding optimal autoregressive model were found. It is shown that this optimization made it possible to increase the accuracy of the autoregressive model of the speech signal by more than 60%, depending on the sound of the test speaker's speech and the characteristics of his vocal tract. The results obtained are intended for use in automatic processing and digital speech transmission systems with radical data compression based on linear prediction coefficients.
We consider the problem of determination of the intelligibility of speech of a speaker according to a finite fragment of the speech signal. It is shown that the main difficulties in the solution of ...this problem are connected with the necessity of analysis of small samples. To overcome the problem of small samples, we proposed a new high-speed method for measuring the intelligibility of speech signals on the sonic level of its perception. The proposed method is based on the information indicator of speech intelligibility in the Kullback-Leibler metric. We consider an example of practical realization of the new method with the use of a self-regression model of minimum sound units from the speech flow of a speaker. The characteristics of efficiency of the new method are analyzed. It is shown that, under certain conditions, the application of the information indicator enables us to realize the general systems principle of guaranteed result. On the basis of the software developed by the authors, we designed and performed full-scale experiments and established quantitative estimates for the speed of this method. It is shown that, with the help of this method, quite accurate and reliable estimates of the information indicator are obtained for short (2–3 min) segments of speech signals. The accumulated results and the conclusions made on their basis are intended for applications in the development of new systems and improvement of the existing systems of automatic speech processing and recognition intended for the operation in the real-time mode.
This paper is devoted to tracking dynamics of psycho-emotional state based on analysis of the user’s facial video and voice. We propose a novel technology with personalized acoustic and visual ...lightweight neural network models that can be launched in real-time on any laptop or even mobile device. At first, two separate user-independent classifiers (feed-forward neural networks) are trained for speech emotion and facial expression recognition in video. The former extracts acoustic features with OpenL3 or OpenSmile frameworks. The latter is based on preliminary extraction of emotional features from each frame with a pre-trained convolutional neural network. Next, both classifiers are fine-tuned using a small number of short emotional videos that should be available for each user. The face of a user is identified during the real-time tracking of emotional state to choose the concrete neural networks. The final decision about current emotion in a short time frame is predicted by blending the outputs of personalized audio and video classifiers. It is experimentally demonstrated for the Russian Acted Multimodal Affective Set that the proposed approach makes it possible to increase the emotion recognition accuracy by 2–15%.
We developed a new method for measuring the pitch frequency of speech signals with elevated noise immunity. The problem of protection against intense background noise is solved in this method by the ...frequency selection of vocalized segments of speech signals according to a scheme with comb filter of interperiodic accumulation. The efficiency of the method is analyzed both theoretically and experimentally with the help of a multichannel frequency meter intended for the acoustic speech analysis. It is shown that, for a signal-to-noise ratio of 10 dB and higher, the error of the method does not exceed 2%.
This research relates to the field of speech technologies, where the key issue is the optimization of speech signal processing under conditions of a prior uncertainty of its fine structure. The ...problem of automatic (objective) analysis of the speaker’s voice timbre using a speech signal of finite duration is considered. It is proposed to use a universal information-theoretic approach to solve it. Based on the Kullback-Leibler divergence, an expression was obtained to describe the asymptotically optimal decision statistic for differentiating speech signals by the voice timbre. The author highlights a serious obstacle during practical implementation of such statistics, namely: synchronization of the sequence of observations with the pitch of speech signals. To overcome the described obstacle, an objective measure of timbre-based differences in speech signals is proposed in terms of the acoustic theory of speech production and its “acoustic tube” type model of the speaker’s vocal tract. The possibilities of practical implementation of a new measure based on an adaptive recursive filter are considered. A full-scale experiment was set up and carried out. The experimental results confirmed two main properties of the proposed measure: high sensitivity to differences in speech signals in terms of voice timbre and invariance with respect to the fundamental pitch frequency. The obtained results can be used when designing and studying digital speech processing systems tuned to the speaker’s voice, for example, digital voice communication systems, biometric and biomedical systems, etc.
The problems of implementing systems with a voice interface for remote service of the population are examined. The effectiveness of such systems can be enhanced by automatic analysis of the changes ...of the emotional state of the user during dialogue. In order to do real-time measurements of the index of the dynamics of the emotional state, it is proposed to use the effect of the sound (phonetic) variability of speech of the user at observation intervals that are of small duration (fractions of a minute). Based on an information-theoretic approach, a method was developed for acoustic measurements of the dynamics of the emotional state under conditions of small samples, using a scale-invariant measure of the variations of the speech waveform in the frequency domain. An example of the practical instantiation of this method in real-time conditions is examined. It is shown that in this case the delay in obtaining measurement results does not exceed 10–20 s. The results of experimental studies confirmed the rapid response of the proposed method and its sensitivity to modifications of the dynamics of the emotional state under the effect of external perturbations. The developed method can be used to introduce automated monitoring of the quality of voice samples of users of the unified biometric systems. Also, the method will be useful to enhance security by noncontact detection of potentially dangerous persons with short-term disturbance of the psychoemotional state.