In this paper, we present a gated convolutional neural network and a temporal attention-based localization method for audio classification, which won the 1st place in the large-scale weakly ...supervised sound event detection task of Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 challenge. The audio clips in this task, which are extracted from YouTube videos, are manually labelled with one or more audio tags, but without time stamps of the audio events, hence referred to as weakly labelled data. Two subtasks are defined in this challenge including audio tagging and sound event detection using this weakly labelled data. We propose a convolutional recurrent neural network (CRNN) with learnable gated linear units (GLUs) non-linearity applied on the log Mel spectrogram. In addition, we propose a temporal attention method along the frames to predict the locations of each audio event in a chunk from the weakly labelled data. The performances of our systems were ranked the 1st and the 2nd as a team in these two sub-tasks of DCASE 2017 challenge with F value 55.6% and Equal error 0.73, respectively.
In this paper we present an approach to polyphonic sound event detection in real life recordings based on bi-directional long short term memory (BLSTM) recurrent neural networks (RNNs). A single ...multilabel BLSTM RNN is trained to map acoustic features of a mixture signal consisting of sounds from multiple classes, to binary activity indicators of each event class. Our method is tested on a large database of real-life recordings, with 61 classes (e.g. music, car, speech) from 10 different everyday contexts. The proposed method outperforms previous approaches by a large margin, and the results are further improved using data augmentation techniques. Overall, our system reports an average F1-score of 65.5% on 1 second blocks and 64.7% on single frames, a relative improvement over previous state-of-the-art approach of 6.8% and 15.1% respectively.
Anomalous event detection in surveillance videos is a challenging and practical research problem among image and video processing community. Compared to the frame-level annotations of anomalous ...events, obtaining video-level annotations is quite fast and cheap though such high-level labels may contain significant noise. More specifically, an anomalous labeled video may actually contain anomaly only in a short duration while the rest of the video frames may be normal. In the current work, we propose a weakly supervised anomaly detection framework based on deep neural networks which is trained in a self-reasoning fashion using only video-level labels. To carry out the self-reasoning based training, we generate pseudo labels by using binary clustering of spatio-temporal video features which helps in mitigating the noise present in the labels of anomalous videos. Our proposed formulation encourages both the main network and the clustering to complement each other in achieving the goal of more accurate anomaly detection. The proposed framework has been evaluated on publicly available real-world anomaly detection datasets including UCF-crime, ShanghaiTech and UCSD Ped2. The experiments demonstrate superiority of our proposed framework over the current state-of-the-art methods.
Real-time situational awareness and event analysis are crucial to the security of the modern power grid, which is a complicated nonlinear system and hard to be completely modeled. Massive data is ...collected but the information hasn't been sufficiently leveraged. To effectively extract the event features, this paper proposes a framework for event detection, localization, and classification in power grids based on semi-supervised learning. Specifically, event detection is realized by invertible neural network (INN), hence to learn complex distributions of real-world measurements in a flexible way. Abundant normal measurements are learned by INN and explicit log-likelihoods then serve as the indicator to distinguish events with adequate sensitivity. Moreover, risks induced by events are assessed and spatial locations are determined. Since the majority of power system events are recorded without labels in practice, a pseudo label (PL) technique is leveraged to classify events with limited labels. The PL-based approach has an enhanced separating capability for events and outperforms other approaches under a low labeling rate. Case studies with simulated data in the IEEE 39-bus system and real-world measurements verify the effectiveness of the proposed framework.
This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources ...active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.
Gait analysis serves as a pivotal tool in identifying abnormalities associated with various disorders. Recently, inertial measurement units (IMUs) have emerged as a feasible tool, showing promising ...results for continuous gait monitoring. However, current gait analysis algorithms often overlook the importance of sensor placement and corresponding motion characteristics. Moreover, there has been limited effort to tailor gait analysis algorithms for optimal performance with sensors placed in specific locations. In response to this, we propose a novel gait analysis algorithm designed for heel-mounted IMUs. Our algorithm employs refined methods to accurately assess heel dynamics and calculate a comprehensive range of spatiotemporal gait parameters and parameters related to symmetry and variability. Experiments with straight walking and daily activities simulation were performed and an optical motion capture (OMC) system was used as a reference system. The results demonstrated strong correlation ( r > 0.9) and good agreement with common gait parameters even in daily conditions (stride length -0.009 ± 0.055 m, stride time -0.002 ± 0.023 s and walking speed -0.004 ± 0.048 m/s). All spatiotemporal gait parameters exhibit high reliability, as indicated by a minimum intraclass correlation coefficient (ICC) of 0.921. The findings affirm the potential of the proposed algorithm to perform daily gait analysis and monitoring task, offering a reliable tool for professionals in the field. By addressing the shortcomings of existing algorithms and focusing on the heel, our approach contributes to the advancement of gait analysis, paving the way for more accessible and accurate gait assessment methods in real life.
Gait event detection is an essential approach to execute accurate gait recognition, and many studies use portable and reliable IMUs for gait event detection. The popular methods mainly pay attention ...to the rules of specific signals or build the machine learning models when the event occurs, both of which overlook the consideration of the differences in characteristics coupled by multiple inputs. In this paper, we propose a method based on fuzzy logic to quantify the event possibility and use it to detect gait events through the angular velocities and accelerations of lower limbs measured by IMUs. The proposed method identifies the event when heel and toe contact or leave the ground, making full use of the distribution characteristics of all extracted inputs without complex calculation. The mean absolute time differences between the detection and actual event in the recognition of heel strike (HS), toe strike (TS), heel off (HO) and toe off (TO) are 34ms, 23ms, 28ms and 38ms respectively in walking. We aim to propose an analysis method and provide some reference for gait recognition of assisted walking exoskeleton robots for healthy individuals, such as soldiers and workers.
Objective: In this paper, we accurately detect the state-sequence first heart sound (S1)-systole-second heart sound (S2)-diastole, i.e., the positions of S1 and S2, in heart sound recordings. We ...propose an event detection approach without explicitly incorporating a priori information of the state duration. This renders it also applicable to recordings with cardiac arrhythmia and extendable to the detection of extra heart sounds (third and fourth heart sound), heart murmurs, as well as other acoustic events. Methods: We use data from the 2016 PhysioNet/CinC Challenge, containing heart sound recordings and annotations of the heart sound states. From the recordings, we extract spectral and envelope features and investigate the performance of different deep recurrent neural network (DRNN) architectures to detect the state sequence. We use virtual adversarial training, dropout, and data augmentation for regularization. Results: We compare our results with the state-of-the-art method and achieve an average score for the four events of the state sequence of {\bf F}_{1}\approx 96% on an independent test set. Conclusion: Our approach shows state-of-the-art performance carefully evaluated on the 2016 PhysioNet/CinC Challenge dataset. Significance: In this work, we introduce a new methodology for the segmentation of heart sounds, suggesting an event detection approach with DRNNs using spectral or envelope features.
Sound event detection and sound localization or tracking have historically been two separate areas of research. Recent development of sound event detection methods approach also the localization ...side, but lack a consistent way of measuring the joint performance of the system; instead, they measure the separate abilities for detection and for localization. This paper proposes augmentation of the localization metrics with a condition related to the detection, and conversely, use of location information in calculating the true positives for detection. An extensive evaluation example is provided to illustrate the behavior of such joint metrics. The comparison to the detection only and localization only performance shows that the proposed joint metrics operate in a consistent and logical manner, and characterize adequately both aspects.
Sound event detection (SED) methods are tasked with labeling segments of audio recordings by the presence of active sound sources. SED is typically posed as a supervised machine learning problem, ...requiring strong annotations for the presence or absence of each sound source at every time instant within the recording. However, strong annotations of this type are both labor- and cost-intensive for human annotators to produce, which limits the practical scalability of SED methods. In this paper, we treat SED as a multiple instance learning (MIL) problem, where training labels are static over a short excerpt, indicating the presence or absence of sound sources but not their temporal locality. The models, however, must still produce temporally dynamic predictions, which must be aggregated (pooled) when comparing against static labels during training. To facilitate this aggregation, we develop a family of adaptive pooling operators - referred to as autopool - which smoothly interpolate between common pooling operators, such as min-, max-, or average-pooling, and automatically adapt to the characteristics of the sound sources in question. We evaluate the proposed pooling operators on three datasets, and demonstrate that in each case, the proposed methods outperform nonadaptive pooling operators for static prediction, and nearly match the performance of models trained with strong, dynamic annotations. The proposed method is evaluated in conjunction with convolutional neural networks, but can be readily applied to any differentiable model for time-series label prediction. While this paper focuses on SED applications, the proposed methods are general, and could be applied widely to MIL problems in any domain.