In this paper, we explore how to leverage readily available unlabeled data to improve semi-supervised human detection performance. For this purpose, we specifically modify the region proposal network ...(RPN) for learning on a partially labeled dataset. Based on commonly observed false positive types, a verification module is developed to assess foreground human objects in the candidate regions to provide an important cue for filtering the RPN's proposals. The remaining proposals with high confidence scores are then used as pseudo annotations for re-training our detection model. To reduce the risk of error propagation in the training process, we adopt a self-paced training strategy to progressively include more pseudo annotations generated by the previous model over multiple training rounds. The resulting detector re-trained on the augmented data can be expected to have better detection performance. The effectiveness of the main components of this framework is verified through extensive experiments, and the proposed approach achieves state-of-the-art detection results on multiple scene-specific human detection benchmarks in the semi-supervised setting.
We address Unsupervised Video Object Segmentation (UVOS), the task of automatically generating accurate pixel masks for salient objects in a video sequence and of tracking these objects consistently ...through time, without any input about which objects should be tracked. Towards solving this task, we present UnOVOST (Unsupervised Offline Video Object Segmentation and Tracking) as a simple and generic algorithm which is able to track and segment a large variety of objects. This algorithm builds up tracks in a number stages, first grouping segments into short tracklets that are spatio-temporally consistent, before merging these tracklets into long-term consistent object tracks based on their visual similarity. In order to achieve this we introduce a novel tracklet-based Forest Path Cutting data association algorithm which builds up a decision forest of track hypotheses before cutting this forest into paths that form long-term consistent object tracks. When evaluating our approach on the DAVIS 2017 Unsupervised dataset we obtain state-of-the-art performance with a mean ℱ score of 67.9% on the val, 58% on the test-dev and 56.4% on the test-challenge benchmarks, obtaining first place in the DAVIS 2019 Unsupervised Video Object Segmentation Challenge. UnOVOST even performs competitively with many semi-supervised video object segmentation algorithms even though it is not given any input as to which objects should be tracked and segmented.
Classifying the sub-categories of an object from the same super-category (e.g., bird species and cars) in fine-grained visual classification (FGVC) highly relies on discriminative feature ...representation and accurate region localization. Existing approaches mainly focus on distilling information from high-level features. In this paper, by contrast, we show that by integrating low-level information (e.g., color, edge junctions, texture patterns), performance can be improved with enhanced feature representation and accurately located discriminative regions. Our solution, named Attention Pyramid Convolutional Neural Network (AP-CNN), consists of 1) a dual pathway hierarchy structure with a top-down feature pathway and a bottom-up attention pathway, hence learning both high-level semantic and low-level detailed feature representation, and 2) an ROI-guided refinement strategy with ROI-guided dropblock and ROI-guided zoom-in operation, which refines features with discriminative local regions enhanced and background noises eliminated. The proposed AP-CNN can be trained end-to-end, without the need of any additional bounding box/part annotation. Extensive experiments on three popularly tested FGVC datasets (CUB-200-2011, Stanford Cars, and FGVC-Aircraft) demonstrate that our approach achieves state-of-the-art performance. Models and code are available at https://github.com/PRIS-CV/AP-CNN_Pytorch-master.
The emerging application of attosecond techniques to molecular systems allows the role of electronic coherence in the control of chemical reactions to be investigated. Prompt ionization of molecules ...by an attosecond pulse may induce charge migration across a molecular structure on attosecond to few-femtosecond timescales, thereby possibly determining the subsequent relaxation pathways that a molecule may take. We discuss how proposals for this 'charge-directed reactivity' fit within the current understanding of quantum control and review the current state of the art of attosecond molecular science. Specifically, we review the role of electronic coherence and coupling of the electronic and nuclear degrees of freedom in high-harmonic spectroscopy and in the first attosecond pump-probe experiments on molecular systems.
Human action analytics has attracted a lot of attention for decades in computer vision. It is important to extract discriminative spatio-temporal features to model the spatial and temporal evolutions ...of different actions. In this paper, we propose a spatial and temporal attention model to explore the spatial and temporal discriminative features for human action recognition and detection from skeleton data. We build our networks based on the recurrent neural networks with long short-term memory units. The learned model is capable of selectively focusing on discriminative joints of skeletons within each input frame and paying different levels of attention to the outputs of different frames. To ensure effective training of the network for action recognition, we propose a regularized cross-entropy loss to drive the learning process and develop a joint training strategy accordingly. Moreover, based on temporal attention, we develop a method to generate the action temporal proposals for action detection. We evaluate the proposed method on the SBU Kinect Interaction data set, the NTU RGB + D data set, and the PKU-MMD data set, respectively. Experiment results demonstrate the effectiveness of our proposed model on both action recognition and action detection.
Recently there have been significant advances in image up scaling or image super-resolution based on a dictionary of low and high resolution exemplars. The running time of the methods is often ...ignored despite the fact that it is a critical factor for real applications. This paper proposes fast super-resolution methods while making no compromise on quality. First, we support the use of sparse learned dictionaries in combination with neighbor embedding methods. In this case, the nearest neighbors are computed using the correlation with the dictionary atoms rather than the Euclidean distance. Moreover, we show that most of the current approaches reach top performance for the right parameters. Second, we show that using global collaborative coding has considerable speed advantages, reducing the super-resolution mapping to a precomputed projective matrix. Third, we propose the anchored neighborhood regression. That is to anchor the neighborhood embedding of a low resolution patch to the nearest atom in the dictionary and to precompute the corresponding embedding matrix. These proposals are contrasted with current state-of-the-art methods on standard images. We obtain similar or improved quality and one or two orders of magnitude speed improvements.