Sparsity in the Fourier domain is an important property that enables the dense reconstruction of signals, such as 4D light fields, from a small set of samples. The sparsity of natural spectra is ...often derived from continuous arguments, but reconstruction algorithms typically work in the discrete Fourier domain. These algorithms usually assume that sparsity derived from continuous principles will hold under discrete sampling. This article makes the critical observation that sparsity is much greater in the
continuous
Fourier spectrum than in the
discrete
spectrum. This difference is caused by a windowing effect. When we sample a signal over a finite window, we convolve its spectrum by an infinite sinc, which destroys much of the sparsity that was in the continuous domain. Based on this observation, we propose an approach to reconstruction that optimizes for sparsity in the continuous Fourier spectrum. We describe the theory behind our approach and discuss how it can be used to reduce sampling requirements and improve reconstruction quality. Finally, we demonstrate the power of our approach by showing how it can be applied to the task of recovering non-Lambertian light fields from a small number of 1D viewpoint trajectories.
Visual rhythm and beat Davis, Abe; Agrawala, Maneesh
ACM transactions on graphics,
08/2018, Volume:
37, Issue:
4
Journal Article
Peer reviewed
We present a visual analogue for musical rhythm derived from an analysis of motion in video, and show that alignment of visual rhythm with its musical counterpart results in the appearance of dance. ...Central to our work is the concept of visual beats --- patterns of motion that can be shifted in time to control visual rhythm. By warping visual beats into alignment with musical beats, we can create or manipulate the appearance of dance in video. Using this approach we demonstrate a variety of retargeting applications that control musical synchronization of audio and video: we can change what song performers are dancing to, warp irregular motion into alignment with music so that it appears to be dancing, or search collections of video for moments of accidentally dance-like motion that can be used to synthesize musical performances.
We present a system for interactively acquiring and rendering light fields using a hand‐held commodity camera. The main challenge we address is assisting a user in achieving good coverage of the 4D ...domain despite the challenges of hand‐held acquisition. We define coverage by bounding reprojection error between viewpoints, which accounts for all 4 dimensions of the light field. We use this criterion together with a recent Simultaneous Localization and Mapping technique to compute a coverage map on the space of viewpoints. We provide users with real‐time feedback and direct them toward under‐sampled parts of the light field. Our system is lightweight and has allowed us to capture hundreds of light fields. We further present a new rendering algorithm that is tailored to the unstructured yet dense data we capture. Our method can achieve piecewise‐bicubic reconstruction using a triangulation of the captured viewpoints and subdivision rules applied to reconstruction weights.
The estimation of material properties is important for scene understanding, with many applications in vision, robotics, and structural engineering. This paper connects fundamentals of vibration ...mechanics with computer vision techniques in order to infer material properties from small, often imperceptible motions in video. Objects tend to vibrate in a set of preferred modes. The frequencies of these modes depend on the structure and material properties of an object. We show that by extracting these frequencies from video of a vibrating object, we can often make inferences about that object's material properties. We demonstrate our approach by estimating material properties for a variety of objects by observing their motion in high-speed and regular frame rate video.
The estimation of material properties is important for scene understanding, with many applications in vision, robotics, and structural engineering. This paper connects fundamentals of vibration ...mechanics with computer vision techniques in order to infer material properties from small, often imperceptible motion in video. Objects tend to vibrate in a set of preferred modes. The shapes and frequencies of these modes depend on the structure and material properties of an object. Focusing on the case where geometry is known or fixed, we show how information about an object's modes of vibration can be extracted from video and used to make inferences about that object's material properties. We demonstrate our approach by estimating material properties for a variety of rods and fabrics by passively observing their motion in high-speed and regular framerate video.
We present a system for efficiently editing video of dialogue-driven scenes. The input to our system is a standard film script and multiple video takes, each capturing a different camera framing or ...performance of the complete scene. Our system then automatically selects the most appropriate clip from one of the input takes, for each line of dialogue, based on a user-specified set of film-editing idioms. Our system starts by segmenting the input script into lines of dialogue and then splitting each input take into a sequence of clips time-aligned with each line. Next, it labels the script and the clips with high-level structural information (e.g., emotional sentiment of dialogue, camera framing of clip, etc.). After this pre-process, our interface offers a set of basic idioms that users can combine in a variety of ways to build custom editing styles. Our system encodes each basic idiom as a Hidden Markov Model that relates editing decisions to the labels extracted in the pre-process. For short scenes (< 2 minutes, 8--16 takes, 6--27 lines of dialogue) applying the user-specified combination of idioms to the pre-processed inputs generates an edited sequence in 2--3 seconds. We show that this is significantly faster than the hours of user time skilled editors typically require to produce such edits and that the quick feedback lets users iteratively explore the space of edit designs.
We propose Factor Matting, an alternative formulation of the video matting problem in terms of counterfactual video synthesis that is better suited for re-composition tasks. The goal of factor ...matting is to separate the contents of a video into independent components, each representing a counterfactual version of the scene where the contents of other components have been removed. We show that factor matting maps well to a more general Bayesian framing of the matting problem that accounts for complex conditional interactions between layers. Based on this observation, we present a method for solving the factor matting problem that learns augmented patch-based appearance priors to produce useful decompositions even for video with complex cross-layer interactions like splashes, shadows, and reflections. Our method is trained per-video and does not require external training data or any knowledge about the 3D structure of the scene. Through extensive experiments, we show that it is able to produce useful decompositions of scenes with such complex interactions while performing competitively on classical matting tasks as well. We also demonstrate the benefits of our approach on a wide range of downstream video editing tasks. Our project website is at: https://factormatte.github.io/.
Eventfulness for Interactive Video Alignment Sun, Jiatian; Deng, Longxiulin; Afouras, Triantafyllos ...
ACM transactions on graphics,
01/08, Volume:
42, Issue:
4
Journal Article
Peer reviewed
Humans are remarkably sensitive to the alignment of visual events with other stimuli, which makes synchronization one of the hardest tasks in video editing. A key observation of our work is that most ...of the alignment we do involves salient localizable events that occur sparsely in time. By learning how to recognize these events, we can greatly reduce the space of possible synchronizations that an editor or algorithm has to consider. Furthermore, by learning descriptors of these events that capture additional properties of visible motion, we can build active tools that adapt their notion of eventfulness to a given task as they are being used. Rather than learning an automatic solution to one specific problem, our goal is to make a much broader class of interactive alignment tasks significantly easier and less time-consuming. We show that a suitable visual event descriptor can be learned entirely from stochastically-generated synthetic video. We then demonstrate the usefulness of learned and adaptive eventfulness by integrating it in novel interactive tools for applications including audio-driven time warping of video and the extraction and application of sound effects across different videos.
Foundation paper piecing is a popular technique for constructing fabric patchwork quilts using printed paper patterns. But, the construction process imposes constraints on the geometry of the pattern ...and the order in which the fabric pieces are attached to the quilt. Manually designing foundation paper pieceable patterns that meet all of these constraints is challenging. In this work we mathematically formalize the foundation paper piecing process and use this formalization to develop an algorithm that can automatically check if an input pattern geometry is foundation paper pieceable. Our key insight is that we can represent the geometric pattern design using a certain type of dual hypergraph where nodes represent faces and hyperedges represent seams connecting two or more nodes. We show that determining whether the pattern is paper pieceable is equivalent to checking whether this hypergraph is acyclic, and if it is acyclic, we can apply a leaf-plucking algorithm to the hypergraph to generate viable sewing orders for the pattern geometry. We implement this algorithm in a design tool that allows quilt designers to focus on producing the geometric design of their pattern and let the tool handle the tedious task of determining whether the pattern is foundation paper pieceable.
When sound hits an object, it causes small vibrations of the object's surface. We show how, using only high-speed video of the object, we can extract those minute vibrations and partially recover the ...sound that produced them, allowing us to turn everyday objects---a glass of water, a potted plant, a box of tissues, or a bag of chips---into visual microphones. We recover sounds from high-speed footage of a variety of objects with different properties, and use both real and simulated data to examine some of the factors that affect our ability to visually recover sound. We evaluate the quality of recovered sounds using intelligibility and SNR metrics and provide input and recovered audio samples for direct comparison. We also explore how to leverage the rolling shutter in regular consumer cameras to recover audio from standard frame-rate videos, and use the spatial resolution of our method to visualize how sound-related vibrations vary over an object's surface, which we can use to recover the vibration modes of an object.