Video Enhancement with Task-Oriented Flow Xue, Tianfan; Chen, Baian; Wu, Jiajun ...
International journal of computer vision,
1/8, Letnik:
127, Številka:
8
Journal Article
Recenzirano
Odprti dostop
Many video enhancement algorithms rely on optical flow to register frames in a video sequence. Precise flow estimation is however intractable; and optical flow itself is often a sub-optimal ...representation for particular video processing tasks. In this paper, we propose task-oriented flow (TOFlow), a motion representation learned in a self-supervised, task-specific manner. We design a neural network with a trainable motion estimation component and a video processing component, and train them jointly to learn the task-oriented flow. For evaluation, we build Vimeo-90K, a large-scale, high-quality video dataset for low-level video processing. TOFlow outperforms traditional optical flow on standard benchmarks as well as our Vimeo-90K dataset in three video processing tasks: frame interpolation, video denoising/deblocking, and video super-resolution.
MaskGIT: Masked Generative Image Transformer Chang, Huiwen; Zhang, Han; Jiang, Lu ...
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2022-June
Conference Proceeding
Odprti dostop
Generative transformers have experienced rapid popularity growth in the computer vision community in synthesizing high-fidelity and high-resolution images. The best generative transformer models so ...far, however, still treat an image naively as a sequence of tokens, and decode an image sequentially following the raster scan ordering (i.e. line-by-line). We find this strategy neither optimal nor efficient. This paper proposes a novel image synthesis paradigm using a bidirectional transformer decoder, which we term MaskGIT. During training, MaskGIT learns to predict randomly masked tokens by attending to tokens in all directions. At inference time, the model begins with generating all tokens of an image simultaneously, and then refines the image iteratively conditioned on the previous generation. Our experiments demonstrate that MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset, and accelerates autoregressive decoding by up to 48x. Besides, we illustrate that MaskGIT can be easily extended to various image editing tasks, such as inpainting, extrapolation, and image manipulation. Project page: masked-generative-image-transformer.github.io.
We present a unified computational approach for taking photos through reflecting or occluding elements such as windows and fences. Rather than capturing a single image, we instruct the user to take a ...short image sequence while slightly moving the camera. Differences that often exist in the relative position of the background and the obstructing elements from the camera allow us to separate them based on their motions, and to recover the desired background scene as if the visual obstructions were not there. We show results on controlled experiments and many real and practical scenarios, including shooting through reflections, fences, and raindrop-covered windows.
We introduce a technique to manipulate small movements in videos based on an analysis of motion in complex-valued image pyramids. Phase variations of the coefficients of a complex-valued steerable ...pyramid over time correspond to motion, and can be temporally processed and amplified to reveal imperceptible motions, or attenuated to remove distracting changes. This processing does not involve the computation of optical flow, and in comparison to the previous Eulerian Video Magnification method it supports larger amplification factors and is significantly less sensitive to noise. These improved capabilities broaden the set of applications for motion processing in videos. We demonstrate the advantages of this approach on synthetic and natural video sequences, and explore applications in scientific analysis, visualization and video enhancement.
Learning and Using the Arrow of Time Wei, Donglai; Lim, Joseph; Zisserman, Andrew ...
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Conference Proceeding
Odprti dostop
We seek to understand the arrow of time in videos - what makes videos look like they are playing forwards or backwards? Can we visualize the cues? Can the arrow of time be a supervisory signal useful ...for activity analysis? To this end, we build three large-scale video datasets and apply a learning-based approach to these tasks. To learn the arrow of time efficiently and reliably, we design a ConvNet suitable for extended temporal footprints and for class activation visualization, and study the effect of artificial cues, such as cinematographic conventions, on learning. Our trained model achieves state-of-the-art performance on large-scale real-world video datasets. Through cluster analysis and localization of important regions for the prediction, we examine learned visual cues that are consistent among many samples and show when and where they occur. Lastly, we use the trained ConvNet for two applications: self-supervision for action recognition, and video forensics - determining whether Hollywood film clips have been deliberately reversed in time, often used as special effects.
With the advent of the Internet, billions of images are now freely available online and constitute a dense sampling of the visual world. Using a variety of non-parametric methods, we explore this ...world with the aid of a large dataset of 79,302,017 images collected from the Internet. Motivated by psychophysical results showing the remarkable tolerance of the human visual system to degradations in image resolution, the images in the dataset are stored as 32 x 32 color images. Each image is loosely labeled with one of the 75,062 non-abstract nouns in English, as listed in the Wordnet lexical database. Hence the image database gives a comprehensive coverage of all object categories and scenes. The semantic information from Wordnet can be used in conjunction with nearest-neighbor methods to perform object classification over a range of semantic levels minimizing the effects of labeling noise. For certain classes that are particularly prevalent in the dataset, such as people, we are able to demonstrate a recognition performance comparable to class-specific Viola-Jones style detectors.
We propose a novel method for template matching in unconstrained environments. Its essence is the Best-Buddies Similarity (BBS), a useful, robust, and parameter-free similarity measure between two ...sets of points. BBS is based on counting the number of Best-Buddies Pairs (BBPs)-pairs of points in source and target sets that are mutual nearest neighbours, i.e., each point is the nearest neighbour of the other. BBS has several key features that make it robust against complex geometric deformations and high levels of outliers, such as those arising from background clutter and occlusions. We study these properties, provide a statistical analysis that justifies them, and demonstrate the consistent success of BBS on a challenging real-world dataset while using different types of features.
We consider the problem of detecting a large number of different classes of objects in cluttered scenes. Traditional approaches require applying a battery of different classifiers to the image, at ...multiple locations and scales. This can be slow and can require a lot of training data since each classifier requires the computation of many different image features. In particular, for independently trained detectors, the (runtime) computational complexity and the (training-time) sample complexity scale linearly with the number of classes to be detected. We present a multitask learning procedure, based on boosted decision stumps, that reduces the computational and sample complexity by finding common features that can be shared across the classes (and/or views). The detectors for each class are trained jointly, rather than independently. For a given performance level, the total number of features required and, therefore, the runtime cost of the classifier, is observed to scale approximately logarithmically with the number of classes. The features selected by joint training are generic edge-like features, whereas the features chosen by training each class separately tend to be more object-specific. The generic features generalize better and considerably reduce the computational cost of multiclass object detection
AbstractVisual testing, as one of the oldest methods for nondestructive testing (NDT), plays a large role in the inspection of civil infrastructure. As NDT has evolved, more quantitative techniques ...have emerged such as vibration analysis. New computer vision techniques for analyzing the small motions in videos, collectively called motion magnification, have been recently developed, allowing quantitative measurement of the vibration behavior of structures from videos. Video cameras offer the benefit of long range measurement and can collect a large amount of data at once because each pixel is effectively a sensor. This paper presents a video camera-based vibration measurement methodology for civil infrastructure. As a proof of concept, measurements are made of an antenna tower on top of the Green Building on the campus of the Massachusetts Institute of Technology (MIT) from a distance of over 175 m, and the resonant frequency of the antenna tower on the roof is identified with an amplitude of 0.21 mm, which was less than 1/170th of a pixel. Methods for improving the noise floor of the measurement are discussed, especially for motion compensation and the effects of video downsampling, and suggestions are given for implementing the methodology into a structural health monitoring (SHM) scheme for existing and new structures.
Video cameras offer the unique capability of collecting high density spatial data from a distant scene of interest. They can be employed as remote monitoring or inspection sensors for structures ...because of their commonplace availability, simplicity, and potentially low cost. An issue is that video data is difficult to interpret into a format familiar to engineers such as displacement. A methodology called motion magnification has been developed for visualizing exaggerated versions of small displacements with an extension of the methodology to obtain the optical flow to measure displacements. In this paper, these methods are extended to modal identification in structures and the measurement of structural vibrations. Camera-based measurements of displacement are compared against laser vibrometer and accelerometer measurements for verification. The methodology is demonstrated on simple structures, a cantilever beam and a pipe, to identify and visualize the operational deflection shapes. Suggestions for applications of this methodology and challenges in real-world implementation are given.