CoReS: Compatible Representations via Stationarity Biondi, Niccolo; Pernici, Federico; Bruni, Matteo ...
IEEE transactions on pattern analysis and machine intelligence,
08/2023, Volume:
45, Issue:
8
Journal Article
Peer reviewed
Compatible features enable the direct comparison of old and new learned features allowing to use them interchangeably over time. In visual search systems, this eliminates the need to extract new ...features from the gallery-set when the representation model is upgraded with novel data. This has a big value in real applications as re-indexing the gallery-set can be computationally expensive when the gallery-set is large, or even infeasible due to privacy or other concerns of the application. In this paper, we propose CoReS, a new training procedure to learn representations that are compatible with those previously learned, grounding on the stationarity of the features as provided by fixed classifiers based on polytopes. With this solution, classes are maximally separated in the representation space and maintain their spatial configuration stationary as new classes are added, so that there is no need to learn any mappings between representations nor to impose pairwise training with the previously learned model. We demonstrate that our training procedure largely outperforms the current state of the art and is particularly effective in the case of multiple upgrades of the training-set, which is the typical case in real applications. Code will be available upon publication.
Where previous reviews on content-based image retrieval emphasize what can be seen in an image to bridge the semantic gap, this survey considers what people tag about an image. A comprehensive ...treatise of three closely linked problems (i.e., image tag assignment, refinement, and tag-based image retrieval) is presented. While existing works vary in terms of their targeted tasks and methodology, they rely on the key functionality of tag relevance, that is, estimating the relevance of a specific tag with respect to the visual content of a given image and its social context. By analyzing what information a specific method exploits to construct its tag relevance function and how such information is exploited, this article introduces a two-dimensional taxonomy to structure the growing literature, understand the ingredients of the main works, clarify their connections and difference, and recognize their merits and limitations. For a head-to-head comparison with the state of the art, a new experimental protocol is presented, with training sets containing 10,000, 100,000, and 1 million images, and an evaluation on three test sets, contributed by various research groups. Eleven representative works are implemented and evaluated. Putting all this together, the survey aims to provide an overview of the past and foster progress for the near future.
Object Tracking by Oversampling Local Features Pernici, Federico; Del Bimbo, Alberto
IEEE transactions on pattern analysis and machine intelligence,
12/2014, Volume:
36, Issue:
12
Journal Article
Peer reviewed
In this paper, we present the ALIEN tracking method that exploits oversampling of local invariant representations to build a robust object/context discriminative classifier. To this end, we use ...multiple instances of scale invariant local features weakly aligned along the object template. This allows taking into account the 3D shape deviations from planarity and their interactions with shadows, occlusions, and sensor quantization for which no invariant representations can be defined. A non-parametric learning algorithm based on the transitive matching property discriminates the object from the context and prevents improper object template updating during occlusion. We show that our learning rule has asymptotic stability under mild conditions and confirms the drift-free capability of the method in long-term tracking. A real-time implementation of the ALIEN tracker has been evaluated in comparison with the state-of-the-art tracking systems on an extensive set of publicly available video sequences that represent most of the critical conditions occurring in real tracking environments. We have reported superior or equal performance in most of the cases and verified tracking with no drift in very long video sequences.
Am I Done? Predicting Action Progress in Videos Becattini, Federico; Uricchio, Tiberio; Seidenari, Lorenzo ...
ACM transactions on multimedia computing communications and applications,
01/2021, Volume:
16, Issue:
4
Journal Article
Peer reviewed
Open access
In this article, we deal with the problem of predicting action progress in videos. We argue that this is an extremely important task, since it can be valuable for a wide range of interaction ...applications. To this end, we introduce a novel approach, named ProgressNet, capable of predicting
when
an action takes place in a video,
where
it is located within the frames, and
how far
it has progressed during its execution. To provide a general definition of action progress, we ground our work in the linguistics literature, borrowing terms and concepts to understand which actions can be the subject of progress estimation. As a result, we define a categorization of actions and their phases. Motivated by the recent success obtained from the interaction of Convolutional and Recurrent Neural Networks, our model is based on a combination of the Faster R-CNN framework, to make framewise predictions, and LSTM networks, to estimate action progress through time. After introducing two evaluation protocols for the task at hand, we demonstrate the capability of our model to effectively predict action progress on the UCF-101 and J-HMDB datasets.
In this article, we address the problem of 4D facial expressions generation. This is usually addressed by animating a neutral 3D face to reach an expression peak, and then get back to the neutral ...state. In the real world though, people show more complex expressions, and switch from one expression to another. We thus propose a new model that generates transitions between different expressions, and synthesizes long and composed 4D expressions. This involves three sub-problems: (1) modeling the temporal dynamics of expressions, (2) learning transitions between them, and (3) deforming a generic mesh. We propose to encode the temporal evolution of expressions using the motion of a set of 3D landmarks, that we learn to generate by training a manifold-valued GAN (Motion3DGAN). To allow the generation of composed expressions, this model accepts two labels encoding the starting and the ending expressions. The final sequence of meshes is generated by a Sparse2Dense mesh Decoder (S2D-Dec) that maps the landmark displacements to a dense, per-vertex displacement of a known mesh topology. By explicitly working with motion trajectories, the model is totally independent from the identity. Extensive experiments on five public datasets show that our proposed approach brings significant improvements with respect to previous solutions, while retaining good generalization to unseen data.
In this paper, we address the problem of content-based image retrieval (CBIR) by learning images representations based on the activations of a Convolutional Neural Network. We propose an end-to-end ...trainable network architecture that exploits a novel multi-scale local pooling based on the trainable aggregation layer NetVLAD (Arandjelovic et al in Proceedings of the IEEE conference on computer vision and pattern recognition CVPR, NetVLAD, 2016) and bags of local features obtained by splitting the activations, allowing to reduce the dimensionality of the descriptor and to increase the performance of retrieval. Training is performed using an improved triplet mining procedure that selects samples based on their difficulty to obtain an effective image representation, reducing the risk of overfitting and loss of generalization. Extensive experiments show that our approach, that can be effectively used with different CNN architectures, obtains state-of-the-art results on standard and challenging CBIR datasets.
Facial Action Units (AUs) correspond to the deformation/contraction of individual facial muscles or their combinations. As such, each AU affects just a small portion of the face, with deformations ...that are asymmetric in many cases. Generating and analyzing AUs in 3D is particularly relevant for the potential applications it can enable. In this paper, we propose a solution for 3D AU detection and synthesis by developing on a newly defined 3D Morphable Model (3DMM) of the face. Differently from most of the 3DMMs existing in the literature, which mainly model global variations of the face and show limitations in adapting to local and asymmetric deformations, the proposed solution is specifically devised to cope with such difficult morphings. During a training phase, the deformation coefficients are learned that enable the 3DMM to deform to 3D target scans showing neutral and facial expression of the same individual, thus decoupling expression from identity deformations. Then, such deformation coefficients are used, on the one hand, to train an AU classifier, on the other, they can be applied to a 3D neutral scan to generate AU deformations in a subject-independent manner. The proposed approach for AU detection is validated on the Bosphorus dataset, reporting competitive results with respect to the state-of-the-art, even in a challenging cross-dataset setting. We further show the learned coefficients are general enough to synthesize realistic 3D face instances with AUs activation.
Face analysis from 2D images and videos is a central task in many multimedia applications. Methods developed to this end perform either face recognition or facial expression recognition, and in both ...cases results are negatively influenced by variations in pose, illumination, and resolution of the face. Such variations have a lower impact on 3D face data, which has given the way to the idea of using a 3D morphable model as an intermediate tool to enhance face analysis on 2D data. In this paper, we propose a new approach for constructing a 3D morphable shape model (called DL-3DMM) and show our solution can reach the accuracy of deformation required in applications where fine details of the face are concerned. For constructing the model, we start from a set of 3D face scans with large variability in terms of ethnicity and expressions. Across these training scans, we compute a point-topoint dense alignment, which is accurate also in the presence of topological variations of the face. The DL-3DMM is constructed by learning a dictionary of basis components on the aligned scans. The model is then fitted to 2D target faces using an efficient regularized ridge-regression guided by 2D/3D facial landmark correspondences in order to generate pose-normalized face images. Comparison between the DL-3DMM and the standard PCA-based 3DMM demonstrates that in general a lower reconstruction error can be obtained with our solution. Application to action unit detection and emotion recognition from 2D images and videos shows competitive results with state of the art methods on two benchmark datasets.
Automatic Emotion Recognition for Cultural Heritage Baecchi, Claudio; Ferracani, Andrea; Alberto, Del Bimbo
IOP conference series. Materials Science and Engineering,
11/2020, Volume:
949, Issue:
1
Journal Article
Peer reviewed
Open access
In this work we present an automatic emotion recognition system for the re-use of multimedia content and storytelling for cultural heritage. A huge amount of heterogeneous multimedia data on cultural ...heritage is available in online and offline databases that can be used and adapted to produce new content. In the real world, human video editors may want to select the video sequences composing the final video with the intention to induce an emotional reaction in the viewer (e.g. happiness, excitement, sadness). Usually they try to achieve this result following their personal judgement. However, this task of video selection could benefit a lot from the exploitation of an automatic sentiment classification system. Our system can help the editor in choosing the video sequences that best fit the desired emotion to be induced. First-of-all the system splits the video in scenes. Then it classifies them using a multimodal classifier which combines temporal features extracted form LSTM, sentiment-related features obtained through a DNN, audio features and motion-related features. The system learns which features are more important and exploits them to classify the scenes in terms of valence and arousal which are well known to correlate with induced emotions. Finally it provides an online video composer which allows the editor to search, filter and compose the scenes in a new video using sentiment information. To train the classifier we also collected and annotated a small dataset of both users recorded videos and professional ones downloaded from the web.