We address the problem of video representation learning without human-annotated labels. While previous efforts address the problem by designing novel self-supervised tasks using video data, the ...learned features are merely on a frame-by-frame basis, which are not applicable to many video analytic tasks where spatio-temporal features are prevailing. In this paper we propose a novel self-supervised approach to learn spatio-temporal features for video representation. Inspired by the success of two-stream approaches in video classification, we propose to learn visual features by regressing both motion and appearance statistics along spatial and temporal dimensions, given only the input video data. Specifically, we extract statistical concepts (fast-motion region and the corresponding dominant direction, spatio-temporal color diversity, dominant color, etc.) from simple patterns in both spatial and temporal domains. Unlike prior puzzles that are even hard for humans to solve, the proposed approach is consistent with human inherent visual habits and therefore easy to answer. We conduct extensive experiments with C3D to validate the effectiveness of our proposed approach. The experiments show that our approach can significantly improve the performance of C3D when applied to video classification tasks. Code is available at https://github.com/laura-wang/video_repres_mas.
We present a new efficient edge-preserving filter-"tree filter"-to achieve strong image smoothing. The proposed filter can smooth out high-contrast details while preserving major edges, which is not ...achievable for bilateral-filter-like techniques. Tree filter is a weighted-average filter, whose kernel is derived by viewing pixel affinity in a probabilistic framework simultaneously considering pixel spatial distance, color/intensity difference, as well as connectedness. Pixel connectedness is acquired by treating pixels as nodes in a minimum spanning tree (MST) extracted from the image. The fact that an MST makes all image pixels connected through the tree endues the filter with the power to smooth out high-contrast, fine-scale details while preserving major image structures, since pixels in small isolated region will be closely connected to surrounding majority pixels through the tree, while pixels inside large homogeneous region will be automatically dragged away from pixels outside the region. The tree filter can be separated into two other filters, both of which turn out to have fast algorithms. We also propose an efficient linear time MST extraction algorithm to further improve the whole filtering speed. The algorithms give tree filter a great advantage in low computational complexity (linear to number of image pixels) and fast speed: it can process a 1-megapixel 8-bit image at ~ 0.25 s on an Intel 3.4 GHz Core i7 CPU (including the construction of MST). The proposed tree filter is demonstrated on a variety of applications.
The speed of optical flow algorithm is crucial for many video editing tasks such as slow motion synthesis, selection propagation, tone adjustment propagation, and so on. Variational coarse-to-fine ...optical flow algorithms can generally produce high-quality results but cannot fulfil the speed requirement of many practical applications. Besides, large motions in real-world videos also pose a difficult problem to coarse-to-fine variational approaches. We, in this paper, present a fast optical flow algorithm that can handle large displacement motions. Our algorithm is inspired by recent successes of local methods in visual correspondence searching as well as approximate nearest neighbor field algorithms. The main novelty is a fast randomized edge-preserving approximate nearest neighbor field algorithm, which propagates self-similarity patterns in addition to offsets. Experimental results on public optical flow benchmarks show that our method is significantly faster than state-of-the-art methods without compromising on quality, especially when scenes contain large motions. Finally, we show some demo applications by applying our technique into real-world video editing tasks.
Face Anti-Spoofing: Model Matters, so Does Data Yang, Xiao; Luo, Wenhan; Bao, Linchao ...
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
06/2019
Conference Proceeding
Face anti-spoofing is an important task in full-stack face applications including face detection, verification, and recognition. Previous approaches build models on datasets which do not simulate the ...real-world data well (e.g., small scale, insignificant variance, etc.). Existing models may rely on auxiliary information, which prevents these anti-spoofing solutions from generalizing well in practice. In this paper, we present a data collection solution along with a data synthesis technique to simulate digital medium-based face spoofing attacks, which can easily help us obtain a large amount of training data well reflecting the real-world scenarios. Through exploiting a novel Spatio-Temporal Anti-Spoof Network (STASN), we are able to push the performance on public face anti-spoofing datasets over state-of-the-art methods by a large margin. Since the proposed model can automatically attend to discriminative regions, it makes analyzing the behaviors of the network possible.We conduct extensive experiments and show that the proposed model can distinguish spoof faces by extracting features from a variety of regions to seek out subtle evidences such as borders, moire patterns, reflection artifacts, etc.
We present a fast optical flow algorithm that can handle large displacement motions. Our algorithm is inspired by recent successes of local methods in visual correspondence searching as well as ...approximate nearest neighbor field algorithms. The main novelty is a fast randomized edge-preserving approximate nearest neighbor field algorithm which propagates self-similarity patterns in addition to offsets. Experimental results on public optical flow benchmarks show that our method is significantly faster than state-of-the-art methods without compromising on quality, especially when scenes contain large motions.
This paper addresses the problem of video object segmentation, where the initial object mask is given in the first frame of an input video. We propose a novel spatiotemporal Markov Random Field (MRF) ...model defined over pixels to handle this problem. Unlike conventional MRF models, the spatial dependencies among pixels in our model are encoded by a Convolutional Neural Network (CNN). Specifically, for a given object, the probability of a labeling to a set of spatially neighboring pixels can be predicted by a CNN trained for this specific object. As a result, higher-order, richer dependencies among pixels in the set can be implicitly modeled by the CNN. With temporal dependencies established by optical flow, the resulting MRF model combines both spatial and temporal cues for tackling video object segmentation. However, performing inference in the MRF model is very difficult due to the very high-order dependencies. To this end, we propose a novel CNN-embedded algorithm to perform approximate inference in the MRF. This algorithm proceeds by alternating between a temporal fusion step and a feed-forward CNN step. When initialized with an appearance-based one-shot segmentation CNN, our model outperforms the winning entries of the DAVIS 2017 Challenge, without resorting to model ensembling or any dedicated detectors.
In this article, we present an end-to-end learning framework for detailed 3D face reconstruction from a single image. Our approach uses a 3DMM-based coarse model and a displacement map in UV-space to ...represent a 3D face. Unlike previous work addressing the problem, our learning framework does not require supervision of surrogate ground-truth 3D models computed with traditional approaches. Instead, we utilize the input image itself as supervision during learning. In the first stage, we combine a photometric loss and a facial perceptual loss between the input face and the rendered face, to regress a 3DMM-based coarse model. In the second stage, both the input image and the regressed texture of the coarse model are unwrapped into UV-space, and then sent through an image-to-image translation network to predict a displacement map in UV-space. The displacement map and the coarse model are used to render a final detailed face, which again can be compared with the original input image to serve as a photometric loss for the second stage. The advantage of learning displacement map in UV-space is that face alignment can be explicitly done during the unwrapping, thus facial details are easier to learn from large amount of data. Extensive experiments demonstrate the superiority of our method over previous work.
Accurate 3D reconstruction of the hand and object shape from a hand-object image is important for understanding human-object interaction as well as human daily activities. Different from bare hand ...pose estimation, hand-object interaction poses a strong constraint on both the hand and its manipulated object, which suggests that hand configuration may be crucial contextual information for the object, and vice versa. However, current approaches address this task by training a two-branch network to reconstruct the hand and object separately with little communication between the two branches. In this work, we propose to consider hand and object jointly in feature space and explore the reciprocity of the two branches. We extensively investigate cross-branch feature fusion architectures with MLP or LSTM units. Among the investigated architectures, a variant with LSTM units that enhances object feature with hand feature shows the best performance gain. Moreover, we employ an auxiliary depth estimation module to augment the input RGB image with the estimated depth map, which further improves the reconstruction accuracy. Experiments conducted on public datasets demonstrate that our approach significantly outperforms existing approaches in terms of the reconstruction accuracy of objects.
This paper presents a real-time decolorization method. Given the human visual systems preference for luminance information, the luminance should be preserved as much as possible during ...decolorization. As a result, the proposed decolorization method measures the amount of color contrast/detail lost when converting color to luminance. The detail loss is estimated by computing the difference between two intermediate images: one obtained by applying bilateral filter to the original color image, and the other obtained by applying joint bilateral filter to the original color image with its luminance as the guidance image. The estimated detail loss is then mapped to a grayscale image named residual image by minimizing the difference between the image gradients of the input color image and the objective grayscale image that is the sum of the residual image and the luminance. Apparently, the residual image will contain pixels with all zero values (that is the two intermediate images will be the same) only when no visual detail is missing in the luminance. Unlike most previous methods, the proposed decolorization method preserves both contrast in the color image and the luminance. Quantitative evaluation shows that it is the top performer on the standard test suite. Meanwhile it is very robust and can be directly used to convert videos while maintaining the temporal coherence. Specifically it can convert a high-resolution video (1280 × 720) in real time (about 28 Hz) on a 3.4 GHz i7 CPU.