In this paper, we propose an algorithm for online, real-time tracking of arbitrary objects in videos from unconstrained environments. The method is based on a particle filter framework using ...different visual features and motion prediction models. We effectively integrate a discriminative online learning classifier into the model and propose a new method to collect negative training examples for updating the classifier at each video frame. Instead of taking negative examples only from the surroundings of the object region, or from specific background regions, our algorithm samples the negatives from a contextual motion density function in order to learn to discriminate the target as early as possible from potential distracting image regions. We experimentally show that this learning scheme improves the overall performance of the tracking algorithm. Moreover, we present quantitative and qualitative results on four challenging public data sets that show the robustness of the tracking algorithm with respect to appearance and view changes, lighting variations, partial occlusions, as well as object deformations. Finally, we compare the results with more than 30 state-of-the-art methods using two public benchmarks, showing very competitive results.
This paper presents a framework using siamese Multi-layer Perceptrons (MLP) for supervised dimensionality reduction and face identification. Compared with the classical MLP that trains on fully ...labeled data, the siamese MLP learns on side information only, i.e., how similar of data examples are to each other. In this study, we compare it with the classical MLP on the problem of face identification. Experimental results on the Extended Yale B database demonstrate that the siamese MLP training with side information achieves comparable classification performance with the classical MLP training on fully labeled data. Besides, while the classical MLP fixes the dimension of the output space, the siamese MLP allows flexible output dimension, hence we also apply the siamese MLP for visualization of the dimensionality reduction to the 2-d and 3-d spaces.
A self-organization hydrodynamic process has recently been proposed to partially explain the formation of femtosecond laser-induced nanopatterns on Nickel, which have important applications in ...optics, microbiology, medicine, etc. Exploring laser pattern space is difficult, however, which simultaneously (i) motivates using machine learning (ML) to search for novel patterns and (ii) hinders it, because of the few data available from costly and time-consuming experiments. In this paper, we use ML to predict novel patterns by integrating partial physical knowledge in the form of the Swift-Hohenberg (SH) partial differential equation (PDE). To do so, we propose a framework to learn with few data, in the absence of initial conditions, by benefiting from background knowledge in the form of a PDE solver. We show that in the case of a self-organization process, a feature mapping exists in which initial conditions can safely be ignored and patterns can be described in terms of PDE parameters alone, which drastically simplifies the problem. In order to apply this framework, we develop a second-order pseudospectral solver of the SH equation which offers a good compromise between accuracy and speed. Our method allows us to predict new nanopatterns in good agreement with experimental data. Moreover, we show that pattern features are related, which imposes constraints on novel pattern design, and suggest an efficient procedure of acquiring experimental data iteratively to improve the generalization of the learned model. It also allows us to identify the limitations of the SH equation as a partial model and suggests an improvement to the physical model itself.
In this paper, we propose a new method for estimating the visual focus of attention (VFOA) in a video stream captured by a single distant camera and showing several persons sitting around a table, ...like in formal meeting or video conferencing settings. The visual targets for a given person are automatically extracted online using an unsupervised algorithm that incrementally learns the different appearance clusters from low-level visual features computed from face patches provided by a face tracker without the need of an intermediate error-prone step of head pose estimation as in classical approaches. The clusters learned in that way can then be used to classify the different visual attention targets of the person during a tracking run, without any prior knowledge on the environment and the configuration of the room or the visible persons. The experiments on public datasets containing almost 2 h of annotated videos from meetings and video conferencing show that the proposed algorithm produces state-of-the-art results and even outperforms a traditional supervised method that is based on head orientation estimation and that classifies VFOA using Gaussian mixture models.
In this paper, we present a new algorithm for real-time single-object tracking in videos in unconstrained environments. The algorithm comprises two different components that are trained "in one shot" ...at the first video frame: a detector that makes use of the generalized Hough transform with color and gradient descriptors and a probabilistic segmentation method based on global models for foreground and background color distributions. Both components work at pixel level and are used for tracking in a combined way adapting each other in a co-training manner. Moreover, we propose an adaptive shape model as well as a new probabilistic method for updating the scale of the tracker. Through effective model adaptation and segmentation, the algorithm is able to track objects that undergo rigid and non-rigid deformations and considerable shape and appearance variations. The proposed tracking method has been thoroughly evaluated on challenging benchmarks, and outperforms the state-of-the-art tracking methods designed for the same task. Finally, a very efficient implementation of the proposed models allows for extremely fast tracking.
•We introduce a new method to measure periodicity on any video, even out of the “daily-life” domain.•Our framework can create a periodic latent representation for any video (conventional or 4D ...videos) without supervision.•Our method can convert periodic videos into simple periodic signals.•We introduce a robust peak detection and counting algorithm for periodic signals.
We introduce a context-agnostic unsupervised method to count periodicity in videos. Current methods estimate periodicity for a specific type of application (e.g. some repetitive human motion). We propose a novel method that provides a powerful generalisation ability since it is not biased towards specific visual features. It is thus applicable to a range of diverse domains that require no adaptation, by relying on a deep neural network that is trained completely unsupervised. More specifically, it is trained to transform the periodic temporal data into some lower-dimensional latent encoding in such a way that it forms a cyclic path in this latent space. We also introduce a novel algorithm that is able to reliably detect and count periods in complex time series. Despite being unsupervised and facing supervised methods with complex architectures, our experimental results demonstrate that our approach is able to reach state-of-the-art performance for periodicity counting on the challenging QUVA video benchmark.
•Original siamese neural network objective function.•Polar sine-based angular reformulation for cosine dissimilarity learning.•Application on a multimodal human action dataset.•New evaluations of 3 ...siamese neural networks using input data pairs, triplets and tuples.•Projection space analysis and computation complexity.
This paper focuses on metric learning with Siamese Neural Networks (SNN). Without any prior, SNNs learn to compute a non-linear metric using only similarity and dissimilarity relationships between input data. Our SNN model proposes three contributions: a tuple-based architecture, an objective function with a norm regularisation and a polar sine-based angular reformulation for cosine dissimilarity learning. Applying our SNN model for Human Action Recognition (HAR) gives very competitive results using only one accelerometer or one motion capture point on the Multimodal Human Action Dataset (MHAD). Performances and properties of our proposals in terms of accuracy, convergence and complexity are assessed, with very favourable results. Additional experiments on the ”Challenge for Multimodal Mid-Air Gesture Recognition for Close Human Computer Interaction” Dataset (ChAirGest) confirm the competitive comparison of our proposals with state-of-the-arts models.
•Rethink the problem of robust facial landmark detection between the reaserch and the practical use.•Novel method based on the Wasserstein loss to significantly improve the robustness of facial ...landmark detection.•Several modifications to the current evaluation metrics to reflect the robustness of the state-of-the-art methods more effectively.
The recent performance of facial landmark detection has been significantly improved by using deep Convolutional Neural Networks (CNNs), especially the Heatmap Regression Models (HRMs). Although their performance on common benchmark datasets has reached a high level, the robustness of these models still remains a challenging problem in the practical use under noisy conditions of realistic environments. Contrary to most existing work focusing on the design of new models, we argue that improving the robustness requires rethinking many other aspects, including the use of datasets, the format of landmark annotation, the evaluation metric as well as the training and detection algorithm itself. In this paper, we propose a novel method for robust facial landmark detection, using a loss function based on the 2D Wasserstein distance combined with a new landmark coordinate sampling relying on the barycenter of the individual probability distributions. Our method can be plugged-and-play on most state-of-the-art HRMs with neither additional complexity nor structural modifications of the models. Further, with the large performance increase, we found that current evaluation metrics can no longer fully reflect the robustness of these models. Therefore, we propose several improvements to the standard evaluation protocol. Extensive experimental results on both traditional evaluation metrics and our evaluation metrics demonstrate that our approach significantly improves the robustness of state-of-the-art facial landmark detection models.
In this paper, we propose an algorithm for on-line, real-time tracking of arbitrary objects in videos from unconstrained environments. The method is based on a particle filter framework using ...different visual features and motion prediction models. We effectively integrate a discriminative on-line learning classifier into the model and propose a new method to collect negative training examples for updating the classifier at each video frame. Instead of taking negative examples only from the surroundings of the object region, or from specific background regions, our algorithm samples the negatives from a contextual motion density function in order to learn to discriminate the target as early as possible from potential distracting image regions. We experimentally show that this learning scheme improves the overall performance of the tracking algorithm. Moreover, we present quantitative and qualitative results on four challenging public datasets that show the robustness of the tracking algorithm with respect to appearance and view changes, lighting variations, partial occlusions as well as object deformations. Finally, we compare the results with more than 30 state-of-the-art methods using two public benchmarks, showing very competitive results.