In this article, behaviour of students in the e-learning environment is analyzed. The novel pipeline is proposed based on video facial processing. At first, face detection, tracking and clustering ...techniques are applied to extract the sequences of faces of each student. Next, a single efficient neural network is used to extract emotional features in each frame. This network is pre-trained on face identification and fine-tuned for facial expression recognition on static images from AffectNet using a specially developed robust optimization technique. It is shown that the resulting facial features can be used for fast simultaneous prediction of students' engagement levels (from disengaged to highly engaged), individual emotions (happy, sad, etc.,) and group-level affect (positive, neutral or negative). This model can be used for real-time video processing even on a mobile device of each student without the need for sending their facial video to the remote server or teacher's PC. In addition, the possibility to prepare a summary of a lesson is demonstrated by saving short clips of different emotions and engagement of all students. The experimental study on the datasets from EmotiW (Emotion Recognition in the Wild) challenges showed that the proposed network significantly outperforms existing single models.
In this paper, we consider the problem of autoregressive modeling of a speech signal according to the data of its discrete Fourier transform on intervals of one speech frame (several milliseconds). ...Based on the information-theoretic approach, a novel method, in which two computational procedures, namely, iterative optimization of autoregressive parameters and their automatic amplitude scaling are separated from each other was developed. A full-scale experiment was set up and carried out. The main advantage of the new method in comparison with its known analogs is shown to be the extremely high rate of convergence of iterations to the optimal solution.
—
Starting from the definition of the main tone of the speaker’s speech as the minimum frequency of the linear power spectrum of the vocalized segments of the speech signal, an estimation of ...potentially achievable accuracy of its measurement under the action of background interference such as white Gaussian noise has been made. Based on this estimation, a suboptimal algorithm for measuring the pitch frequency using a short speech frame has been developed. The developed algorithm effectiveness is confirmed by the results of the experiment, during which the author’s software was used.
—
The article proposes a new algorithm for solving the problem of real-time detection of vowel speech sounds based on (
R
+ 1)-element information and the whitening filter method. An example of ...practical application of the algorithm is described and an assessment of its efficiency is provided. A full-scale experiment is conducted; its results indicate that the proposed algorithm demonstrates a sufficiently high speed and a guaranteed significance level of decisions with minimal performance requirements to the computing equipment.
The article considers the problem of personal biometric data “aging” over time. A method has been proposed to overcome this problem by automatically updating the specified data in the biometric ...system storage using the speech signals of registered users obtained during latest requests for their identification and online service. The proposed method uses a scale-invariant indicator of the voice template quality. As a result, it is characterized by guaranteed reliability of the decisions made in the conditions of a wide speech signal dynamic range. It was established that the use of a scale-invariant indicator provides the guaranteed significance level of decisions made by a conventional observer. A full-scale experiment implementing the proposed method has been set up and carried out using an authoring software; practical justification for the effectiveness of the proposed method with real speech data has been given. The results obtained are intended for using in the development of new and modernization of existing systems and technologies for automated quality control and updating of personal biometric data.
The problem of determining a fundamental tone frequency of a speech signal in the presence of white Gaussian noise is examined. A method for measuring this frequency is proposed which takes into ...account the periodic structure of the power spectrum of voiced speech frames and is based on the principle of harmonic energy accumulation in the frequency domain. For this purpose a procedure for equalizing the envelope of the power spectrum is introduced in the algorithm for processing a speech signal using a two-level autoregression model of the observations: within the limits of a single period of the fundamental tone and within an interval of several of these periods. Here adaptation of the order of the autoregression of the lower level to the observed frame is planned. An example of the practical realization of the adaptive method based on the Berg method is examined. The basic advantages of the adaptive method compared to the known analogs are high speed and enhanced noise stability, which are confirmed in a full-scale experiment. A gain in threshold signals of 5-10 dB was obtained through use of the adaptive method.
A novel image recognition algorithm based on sequential three-way decisions is introduced to speed up the inference in a convolutional neural network. In contrast to the majority of existing studies, ...our approach does not require a special procedure to train a neural network, and thus it can be used with arbitrary architectures including pre-trained convolutional nets. Each image is associated with a sequence of features extracted at different layers of the neural network. Features from earlier layers stand for coarse-grained image representation. Fine-grained representations include embeddings from one of later layers. Confidence scores of classifiers representing the input image at each granularity level are computed in order to populate a set of unlikely classes with low confidence scores. The thresholds for these scores are chosen by using the step-up multiple testing procedure. The categories from this set are not considered at the next levels with finer granularity. The algorithm selecting the granularity levels and thresholds for each level is trained on a small sample. An experimental study for several datasets and neural architectures demonstrated that the proposed approach reduces the running time by up to 40% with a controllable decrease in accuracy.
This paper addresses the face recognition task for offline mobile applications. Using AutoML techniques, we propose a novel approach to develop a fast neural network-based facial feature extractor ...for a concrete device. First, the Once-for-All SuperNet is trained on a large facial dataset. Each device is characterized by its lookup table, which contains the running times of inference in each layer of the SuperNet. An evolutionary search is then used to select the most accurate subnetwork within a limit on the maximum expected latency. We propose training a neural architecture comparator using Gradient Boosted Trees to choose the better subnetwork in this search. Experimental face verification and recognition results demonstrate our proposed approach's robustness to various facial region positions. Our best model achieves an identification accuracy of 98.7% for the LFW dataset in less than 5 ms on the Qualcomm Snapdragon 865 GPU.
The paper addresses the issue of insufficient speed of image recognition methods if the number of classes is rather large. We propose the novel algorithm based on sequential three-way decisions and a ...formal description of granular computing. Each image is associated with principal component scores of the high-dimensional features extracted by deep convolution neural network. Low number of principal components stand for the coarse-grained granules, while fine-grained granules include all components. Initially, first principal components of an observed image and all training instances are matched at the coarsest granularity level. Next, negative decisions are defined by using the multiple comparisons theory and asymptotic distribution of the Kullback-Leibler divergence. Namely, the distance factors (ratios of the minimum distance and all other distances) are evaluated. The set of negative decisions is populated by the instances, for which the distance factors exceed a certain threshold. The images from this set are not examined at the next levels with finer granularity. In the experiments unconstrained face recognition and image categorisation are considered using the state-of-the-art deep learning-based feature extractors. We demonstrate that the proposed approach decreases the running time in 1.5–10 times when compared to conventional classifiers and the known multi-class decision-theoretic rough sets.
This article introduces the novel technique to reduce the computation time for classifying a sequence of observations (frames), such as a video stream, where each observation is described by ...high-dimensional embeddings extracted by a deep neural network. By using the methodology of granular computing, an observed sequence is represented at various scales using different frame rates. The coarse-grained granule is described as an aggregation (mean pooling) of deep embeddings of an object from a few frames extracted with a low frame rate. A descriptor for a fine-grained granule is computed using the embeddings of most frames. The classifiers are learned for every granularity level. At the classification phase, the coarse-grained descriptor of the input sequence is fed into the first classifier, and the classes with high confidence scores fill a positive set from three-way decisions. The decision-making procedure is terminated at a granularity level for which the only one category is included in its positive set or the last fine-grained granule is reached. It is experimentally shown for the video-based facial expression recognition problem that our technique is up to 30 times faster than traditional processing of all frames without significant accuracy degradation.