Advances in Variational Inference Zhang, Cheng; Butepage, Judith; Kjellstrom, Hedvig ...
IEEE transactions on pattern analysis and machine intelligence,
08/2019, Letnik:
41, Številka:
8
Journal Article
Recenzirano
Odprti dostop
Many modern unsupervised or semi-supervised machine learning algorithms rely on Bayesian probabilistic models. These models are usually intractable and thus require approximate inference. Variational ...inference (VI) lets us approximate a high-dimensional Bayesian posterior with a simpler variational distribution by solving an optimization problem. This approach has been successfully applied to various models and large-scale applications. In this review, we give an overview of recent trends in variational inference. We first introduce standard mean field variational inference, then review recent advances focusing on the following aspects: (a) scalable VI, which includes stochastic approximations, (b) generic VI, which extends the applicability of VI to a large class of otherwise intractable models, such as non-conjugate models, (c) accurate VI, which includes variational models beyond the mean field approximation or with atypical divergences, and (d) amortized VI, which implements the inference over local latent variables with inference networks. Finally, we provide a summary of promising future research directions.
This paper presents a novel framework for speech-driven gesture production, applicable to virtual agents to enhance human-computer interaction. Specifically, we extend recent deep-learning-based, ...data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates. We provide an analysis of different representations for the input (speech) and the output (motion) of the network by both objective and subjective evaluations. We also analyze the importance of smoothing of the produced motion. Our results indicated that the proposed method improved on our baseline in terms of objective measures. For example, it better captured the motion dynamics and better matched the motion-speed distribution. Moreover, we performed user studies on two different datasets. The studies confirmed that our proposed method is perceived as more natural than the baseline, although the difference in the studies was eliminated by appropriate post-processing: hip-centering and smoothing. We conclude that it is important to take both motion representation and post-processing into account when designing an automatic gesture-production method.
Orthopedic disorders are common among horses, often leading to euthanasia, which often could have been avoided with earlier detection. These conditions often create varying degrees of subtle ...long-term pain. It is challenging to train a visual pain recognition method with video data depicting such pain, since the resulting pain behavior also is subtle, sparsely appearing, and varying, making it challenging for even an expert human labeller to provide accurate ground-truth for the data. We show that a model trained solely on a dataset of horses with acute experimental pain (where labeling is less ambiguous) can aid recognition of the more subtle displays of orthopedic pain. Moreover, we present a human expert baseline for the problem, as well as an extensive empirical study of various domain transfer methods and of what is detected by the pain recognition method trained on clean experimental pain in the orthopedic dataset. Finally, this is accompanied with a discussion around the challenges posed by real-world animal behavior datasets and how best practices can be established for similar fine-grained action recognition tasks. Our code is available at https://github.com/sofiabroome/painface-recognition.
Automated recognition of human facial expressions of pain and emotions is to a certain degree a solved problem, using approaches based on computer vision and machine learning. However, the ...application of such methods to horses has proven difficult. Major barriers are the lack of sufficiently large, annotated databases for horses and difficulties in obtaining correct classifications of pain because horses are non-verbal. This review describes our work to overcome these barriers, using two different approaches. One involves the use of a manual, but relatively objective, classification system for facial activity (Facial Action Coding System), where data are analyzed for pain expressions after coding using machine learning principles. We have devised tools that can aid manual labeling by identifying the faces and facial keypoints of horses. This approach provides promising results in the automated recognition of facial action units from images. The second approach, recurrent neural network end-to-end learning, requires less extraction of features and representations from the video but instead depends on large volumes of video data with ground truth. Our preliminary results suggest clearly that dynamics are important for pain recognition and show that combinations of recurrent neural networks can classify experimental pain in a small number of horses better than human raters.
An essential task for computer vision-based assistive technologies is to help visually impaired people to recognize objects in constrained environments, for instance, recognizing food items in ...grocery stores. In this paper, we introduce a novel dataset with natural images of groceries—fruits, vegetables, and packaged products—where all images have been taken inside grocery stores to resemble a shopping scenario. Additionally, we download iconic images and text descriptions for each item that can be utilized for better representation learning of groceries. We select a multi-view generative model, which can combine the different item information into lower-dimensional representations. The experiments show that utilizing the additional information yields higher accuracies on classifying grocery items than only using the natural images. We observe that iconic images help to construct representations separated by visual differences of the items, while text descriptions enable the model to distinguish between visually similar items by different ingredients.
•Introducing a dataset of grocery items with real images and web-scraped information•Best model performance is achieved when all information in the dataset is combined•Web-scraped images help to group the grocery items based on their color and shape•Web-scraped text separates groceries based on differences in ingredients and flavor
In recent years, several computer vision-based assistive technologies for helping visually impaired people have been released on the market. We study a special case whereby visual capability is often important when searching for objects: grocery shopping. To enable assistive vision devices for grocery shopping, data representing the grocery items have to be available. We, therefore, provide a challenging dataset of smartphone images of grocery items resembling the shopping scenario with an assistive vision device. Our dataset is publicly available to encourage other researchers to evaluate their computer vision models on grocery item classification in real-world environments. The next step would be to deploy the trained models into mobile devices, such as smartphone applications, to evaluate whether the models can perform effectively in real time via human users. This dataset is a step toward enabling these technologies to make everyday life easier for the visually impaired.
Assistive vision devices are used for recognizing food items to help visually impaired people with grocery shopping on their own. We have collected smartphone images of groceries using an assistive device when shopping. Our results show that utilizing both smartphone images and data downloaded from supermarket websites can enhance a system's grocery recognition capabilities compared with only using smartphone images. Our study is thus a step toward fully image-based assistive devices for helping the visually impaired with grocery shopping.
In the spirit of recent work on contextual recognition and estimation, we present a method for estimating the pose of human hands, employing information about the shape of the object in the hand. ...Despite the fact that most applications of human hand tracking involve grasping and manipulation of objects, the majority of methods in the literature assume a free hand, isolated from the surrounding environment. Occlusion of the hand from grasped objects does in fact often pose a severe challenge to the estimation of hand pose. In the presented method, object occlusion is not only compensated for, it contributes to the pose estimation in a contextual fashion; this without an explicit model of object shape. Our hand tracking method is non-parametric, performing a nearest neighbor search in a large database (.. entries) of hand poses with and without grasped objects. The system that operates in real time, is robust to self occlusions, object occlusions and segmentation errors, and provides full hand pose reconstruction from monocular video. Temporal consistency in hand pose is taken into account, without explicitly tracking the hand in the high-dim pose space. Experiments show the non-parametric method to outperform other state of the art regression methods, while operating at a significantly lower computational cost than comparable model-based hand tracking methods.
•We developed a system that estimates in real-time the articulated pose of the hand.•Our approach is discriminative, with low computational load and fast error recovery.•Our system is robust to occlusions, implicitly extracting information from them.•The system is thoroughly evaluated with quantitative and qualitative experiments.
The generalized linear mixed model for binary outcomes with the probit link function is used in many fields but has a computationally challenging likelihood when there are many random effects. We ...extend a previously used importance sampler, making it much faster in the context of estimating heritability and related effects from family data by adding a gradient and a Hessian approximation and making a faster implementation. Additionally, a graph-based method is suggested to simplify the likelihood when there are thousands of individuals in each family. Simulation studies show that the resulting method is orders of magnitude faster, has a negligible efficiency loss, and confidence intervals with nominal coverage. We also analyze data from a large study of obsessive-compulsive disorder based on Swedish multi-generational data. In this analysis, the proposed method yielded similar results to a previous analysis, but was much faster.
Supplementary materials
for this article are available online.