Visual speech information from the speaker's mouth region has been successfully shown to improve noise robustness of automatic speech recognizers, thus promising to extend their usability in the ...human computer interface. In this paper, we review the main components of audiovisual automatic speech recognition (ASR) and present novel contributions in two main areas: first, the visual front-end design, based on a cascade of linear image transforms of an appropriate video region of interest, and subsequently, audiovisual speech integration. On the latter topic, we discuss new work on feature and decision fusion combination, the modeling of audiovisual speech asynchrony, and incorporating modality reliability estimates to the bimodal recognition process. We also briefly touch upon the issue of audiovisual adaptation. We apply our algorithms to three multisubject bimodal databases, ranging from small- to large-vocabulary recognition tasks, recorded in both visually controlled and challenging environments. Our experiments demonstrate that the visual modality improves ASR over all conditions and data considered, though less so for visually challenging environments and large vocabulary tasks.
Human action recognition in 3D skeleton sequences has attracted a lot of research attention. Recently, long short-term memory (LSTM) networks have shown promising performance in this task due to ...their strengths in modeling the dependencies and dynamics in sequential data. As not all skeletal joints are informative for action recognition, and the irrelevant joints often bring noise which can degrade the performance, we need to pay more attention to the informative ones. However, the original LSTM network does not have explicit attention ability. In this paper, we propose a new class of LSTM network, global context-aware attention LSTM, for skeleton-based action recognition, which is capable of selectively focusing on the informative joints in each frame by using a global context memory cell. To further improve the attention capability, we also introduce a recurrent attention mechanism, with which the attention performance of our network can be enhanced progressively. Besides, a two-stream framework, which leverages coarse-grained attention and fine-grained attention, is also introduced. The proposed method achieves state-of-the-art performance on five challenging datasets for skeleton-based action recognition.
In this paper, a 2-D noncausal Markov model is proposed for passive digital image-splicing detection. Different from the traditional Markov model, the proposed approach models an image as a 2-D ...noncausal signal and captures the underlying dependencies between the current node and its neighbors. The model parameters are treated as the discriminative features to differentiate the spliced images from the natural ones. We apply the model in the block discrete cosine transformation domain and the discrete Meyer wavelet transform domain, and the cross-domain features are treated as the final discriminative features for classification. The support vector machine which is the most popular classifier used in the image-splicing detection is exploited in our paper for classification. To evaluate the performance of the proposed method, all the experiments are conducted on public image-splicing detection evaluation data sets, and the experimental results have shown that the proposed approach outperforms some state-of-the-art methods.
Activity-based travel demand models are becoming essential tools used in transportation planning and regional development scenario evaluation. They describe travel itineraries of individual ...travelers, namely, what activities they are participating in, when they perform these activities, and how they choose to travel to the activity locales. However, data collection for activity-based models is performed through travel surveys that are infrequent, expensive, and reflect the changes in transportation with significant delays. Thanks to the ubiquitous cell phone data, we see an opportunity to substantially complement these surveys with data extracted from network carrier mobile phone usage logs, such as call detail records (CDRs). In this paper, we develop input-output hidden Markov models to infer travelers' activity patterns from CDRs. We apply the model to the data collected by a major network carrier serving millions of users in the San Francisco Bay Area. Our approach delivers an end-to-end actionable solution to the practitioners in the form of a modular and interpretable activity-based travel demand model. It is experimentally validated with three independent data sources: aggregated statistics from travel surveys, a set of collected ground truth activities, and the results of a traffic micro-simulation informed with the travel plans synthesized from the developed generative model.
Multimodal Machine Learning: A Survey and Taxonomy Baltrusaitis, Tadas; Ahuja, Chaitanya; Morency, Louis-Philippe
IEEE transactions on pattern analysis and machine intelligence,
02/2019, Volume:
41, Issue:
2
Journal Article
Peer reviewed
Open access
Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors. Modality refers to the way in which something happens or is experienced and a ...research problem is characterized as multimodal when it includes multiple such modalities. In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together. Multimodal machine learning aims to build models that can process and relate information from multiple modalities. It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential. Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy. We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and co-learning. This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research.
Hidden Markov and semi‐Markov models (H(S)MMs) constitute useful tools for modeling observations subject to certain dependency structures. The hidden states render these models very flexible and ...allow them to capture many different types of latent patterns and dynamics present in the data. This has led to the increased popularity of these models, which have been applied to a variety of problems in various domains and settings, including longitudinal data. In many longitudinal studies, the response variable is categorical or count‐type. Generalized linear mixed models (GLMMs) can be used to analyze a wide range of variables, including categorical and count. The present study proposes a model that combines HSMMs with GLMMs, leading to generalized linear mixed hidden semi‐Markov models (GLM‐HSMMs). These models can account for time‐varying unobserved heterogeneity and handle different response types. Parameter estimation is achieved using a Monte Carlo Newton‐Raphson (MCNR)‐like algorithm. In our proposed model, the distribution of the random effects depends on hidden states. We illustrate the applicability of GLM‐HSMMs with an example in the field of occupational health, where the response variable consists of count values. Furthermore, we assess the performance of our MCNR‐like algorithm through a simulation study.
For offering proactive services (e.g., personalized exercise recommendation) to the students in computer supported intelligent education, one of the fundamental tasks is predicting student ...performance (e.g., scores) on future exercises, where it is necessary to track the change of each student's knowledge acquisition during her exercising activities. Unfortunately, to the best of our knowledge, existing approaches can only exploit the exercising records of students, and the problem of extracting rich information existed in the materials (e.g., knowledge concepts, exercise content) of exercises to achieve both more precise prediction of student performance and more interpretable analysis of knowledge acquisition remains underexplored. To this end, in this paper, we present a holistic study of student performance prediction. To directly achieve the primary goal of performance prediction, we first propose a general E xercise- E nhanced R ecurrent N eural N etwork (EERNN) framework by exploring both student's exercising records and the text content of corresponding exercises. In EERNN, we simply summarize each student's state into an integrated vector and trace it with a recurrent neural network, where we design a bidirectional LSTM to learn the encoding of each exercise from its content. For making final predictions, we design two implementations on the basis of EERNN with different prediction strategies, i.e., EERNNM with Markov property and EERNNA with Attention mechanism . Then, to explicitly track student's knowledge acquisition on multiple knowledge concepts, we extend EERNN to an explainable E xercise-aware K nowledge T racing (EKT) framework by incorporating the knowledge concept information, where the student's integrated state vector is now extended to a knowledge state matrix. In EKT, we further develop a memory network for quantifying how much each exercise can affect the mastery of students on multiple knowledge concepts during the exercising process. Finally, we conduct extensive experiments and evaluate both EERNN and EKT frameworks on a large-scale real-world data. The results in both general and cold-start scenarios clearly demonstrate the effectiveness of two frameworks in student performance prediction as well as the superior interpretability of EKT.
Recurrent sequence-to-sequence models using encoder-decoder architecture have made great progress in speech recognition task. However, they suffer from the drawback of slow training speed because the ...internal recurrence limits the training parallelization. In this paper, we present the Speech-Transformer, a no-recurrence sequence-to-sequence model entirely relies on attention mechanisms to learn the positional dependencies, which can be trained faster with more efficiency. We also propose a 2D-Attention mechanism, which can jointly attend to the time and frequency axes of the 2-dimensional speech inputs, thus providing more expressive representations for the Speech-Transformer. Evaluated on the Wall Street Journal (WSJ) speech recognition dataset, our best model achieves competitive word error rate (WER) of 10.9%, while the whole training process only takes 1.2 days on 1 GPU, significantly faster than the published results of recurrent sequence-to-sequence models.
Surveillance videos are able to capture a variety of realistic anomalies. In this paper, we propose to learn anomalies by exploiting both normal and anomalous videos. To avoid annotating the ...anomalous segments or clips in training videos, which is very time consuming, we propose to learn anomaly through the deep multiple instance ranking framework by leveraging weakly labeled training videos, i.e. the training labels (anomalous or normal) are at video-level instead of clip-level. In our approach, we consider normal and anomalous videos as bags and video segments as instances in multiple instance learning (MIL), and automatically learn a deep anomaly ranking model that predicts high anomaly scores for anomalous video segments. Furthermore, we introduce sparsity and temporal smoothness constraints in the ranking loss function to better localize anomaly during training. We also introduce a new large-scale first of its kind dataset of 128 hours of videos. It consists of 1900 long and untrimmed real-world surveillance videos, with 13 realistic anomalies such as fighting, road accident, burglary, robbery, etc. as well as normal activities. This dataset can be used for two tasks. First, general anomaly detection considering all anomalies in one group and all normal activities in another group. Second, for recognizing each of 13 anomalous activities. Our experimental results show that our MIL method for anomaly detection achieves significant improvement on anomaly detection performance as compared to the state-of-the-art approaches. We provide the results of several recent deep learning baselines on anomalous activity recognition. The low recognition performance of these baselines reveals that our dataset is very challenging and opens more opportunities for future work. The dataset is available at: http://crcv.ucf.edu/projects/real-world/
It is very difficult to establish and maintain end-to-end connections in a vehicle ad hoc network (VANET) as a result of high vehicle speed, long inter-vehicle distance, and varying vehicle density. ...Instead, a store-and-forward strategy has been considered for vehicle communications. The success of this strategy, however, depends heavily on the cooperation among nodes. Different from exiting store-and-forward solutions, we propose predictive routing based on the hidden Markov model (PRHMM) for VANETS, which exploits the regularity of vehicle moving behaviors to increase the transmission performance. As vehicle movements often exhibit a high degree of repetition, including regular visits to certain places and regular contacts during daily activities, we can predict a vehicle's future locations based on the knowledge of past traces and the hidden Markov model. Consequently, the short-term route of a vehicle and its packet delivery probability for a specific mobile destination can be predicted. Moreover, PRHMM enables seamless handoff between vehicle-to-vehicle and vehicle-to-infrastructure communications so that the transmission performance will not be constrained by the vehicle density and moving speed. Simulation evaluation demonstrates that PRHMM performs much better in terms of delivery ratio, end-to-end delay, traffic overhead, and buffer occupancy.