The irregular domain and lack of ordering make it challenging to design deep neural networks for point cloud processing. This paper presents a novel framework named
Point Cloud Transformer
(PCT) for ...point cloud learning. PCT is based on Transformer, which achieves huge success in natural language processing and displays great potential in image processing. It is inherently permutation invariant for processing a sequence of points, making it well-suited for point cloud learning. To better capture local context within the point cloud, we enhance input embedding with the support of farthest point sampling and nearest neighbor search. Extensive experiments demonstrate that the PCT achieves the state-of-the-art performance on shape classification, part segmentation, semantic segmentation, and normal estimation tasks.
Humans can naturally and effectively find salient regions in complex scenes. Motivated by this observation, attention mechanisms were introduced into computer vision with the aim of imitating this ...aspect of the human visual system. Such an attention mechanism can be regarded as a dynamic weight adjustment process based on features of the input image. Attention mechanisms have achieved great success in many visual tasks, including image classification, object detection, semantic segmentation, video understanding, image generation, 3D vision, multimodal tasks, and self-supervised learning. In this survey, we provide a comprehensive review of various attention mechanisms in computer vision and categorize them according to approach, such as channel attention, spatial attention, temporal attention, and branch attention; a related repository
https://github.com/MenghaoGuo/Awesome-Vision-Attentions
is dedicated to collecting related work. We also suggest future directions for attention mechanism research.
Attention mechanisms, especially self-attention, have played an increasingly important role in deep feature representation for visual tasks. Self-attention updates the feature at each position by ...computing a weighted sum of features using pair-wise affinities across all positions to capture the long-range dependency within a single sample. However, self-attention has quadratic complexity and ignores potential correlation between different samples. This article proposes a novel attention mechanism which we call external attention , based on two external, small, learnable, shared memories, which can be implemented easily by simply using two cascaded linear layers and two normalization layers; it conveniently replaces self-attention in existing popular architectures. External attention has linear complexity and implicitly considers the correlations between all data samples. We further incorporate the multi-head mechanism into external attention to provide an all-MLP architecture, external attention MLP (EAMLP), for image classification. Extensive experiments on image classification, object detection, semantic segmentation, instance segmentation, image generation, and point cloud analysis reveal that our method provides results comparable or superior to the self-attention mechanism and some of its variants, with much lower computational and memory costs.
Global Contrast Based Salient Region Detection Cheng, Ming-Ming; Mitra, Niloy J.; Huang, Xiaolei ...
IEEE transactions on pattern analysis and machine intelligence,
2015-March-1, 2015-Mar, 2015-3-1, 20150301, Letnik:
37, Številka:
3
Journal Article
Recenzirano
Odprti dostop
Automatic estimation of salient object regions across images, without any prior assumption or knowledge of the contents of the corresponding scenes, enhances many computer vision and computer ...graphics applications. We introduce a regional contrast based salient object detection algorithm, which simultaneously evaluates global contrast differences and spatial weighted coherence scores. The proposed algorithm is simple, efficient, naturally multi-scale, and produces full-resolution, high-quality saliency maps. These saliency maps are further used to initialize a novel iterative version of GrabCut, namely SaliencyCut, for high quality unsupervised salient object segmentation. We extensively evaluated our algorithm using traditional salient object detection datasets, as well as a more challenging Internet image dataset. Our experimental results demonstrate that our algorithm consistently outperforms 15 existing salient object detection and segmentation methods, yielding higher precision and better recall rates. We also show that our algorithm can be used to efficiently extract salient object masks from Internet images, enabling effective sketch-based image retrieval (SBIR) via simple shape comparisons. Despite such noisy internet images, where the saliency regions are ambiguous, our saliency guided image retrieval achieves a superior retrieval rate compared with state-of-the-art SBIR methods, and additionally provides important target object region information.
Photorealistic Audio-driven Video Portraits Wen, Xin; Wang, Miao; Richardt, Christian ...
IEEE transactions on visualization and computer graphics,
12/2020, Letnik:
26, Številka:
12
Journal Article
Recenzirano
Odprti dostop
Video portraits are common in a variety of applications, such as videoconferencing, news broadcasting, and virtual education and training. We present a novel method to synthesize photorealistic video ...portraits for an input portrait video, automatically driven by a person's voice. The main challenge in this task is the hallucination of plausible, photorealistic facial expressions from input speech audio. To address this challenge, we employ a parametric 3D face model represented by geometry, facial expression, illumination, etc., and learn a mapping from audio features to model parameters. The input source audio is first represented as a high-dimensional feature, which is used to predict facial expression parameters of the 3D face model. We then replace the expression parameters computed from the original target video with the predicted one, and rerender the reenacted face. Finally, we generate a photorealistic video portrait from the reenacted synthetic face sequence via a neural face renderer. One appealing feature of our approach is the generalization capability for various input speech audio, including synthetic speech audio from text-to-speech software. Extensive experimental results show that our approach outperforms previous general-purpose audio-driven video portrait methods. This includes a user study demonstrating that our results are rated as more realistic than previous methods.
Spatially and temporally adaptive algorithms can substantially improve the computational efficiency of many numerical schemes in computational mechanics and physics‐based animation. Recently, a ...crucial need for temporal adaptivity in the Material Point Method (MPM) is emerging due to the potentially substantial variation of material stiffness and velocities in multi‐material scenes. In this work, we propose a novel temporally adaptive symplectic Euler scheme for MPM with regional time stepping (RTS), where different time steps are used in different regions. We design a time stepping scheduler operating at the granularity of small blocks to maintain a natural consistency with the hybrid particle/grid nature of MPM. Our method utilizes the Sparse Paged Grid (SPGrid) data structure and simultaneously offers high efficiency and notable ease of implementation with a practical multi‐threaded particle‐grid transfer strategy. We demonstrate the efficacy of our asynchronous MPM method on various examples including elastic objects, granular media, and fluids.
Video stabilization techniques are essential for most hand-held captured videos due to high-frequency shakes. Several 2D-, 2.5D-, and 3D-based stabilization techniques have been presented previously, ...but to the best of our knowledge, no solutions based on deep neural networks had been proposed to date. The main reason for this omission is shortage in training data as well as the challenge of modeling the problem using neural networks. In this paper, we present a video stabilization technique using a convolutional neural network. Previous works usually propose an off-line algorithm that smoothes a holistic camera path based on feature matching. Instead, we focus on low-latency, real-time camera path smoothing that does not explicitly represent the camera path and does not use future frames. Our neural network model, called StabNet, learns a set of mesh-grid transformations progressively for each input frame from the previous set of stabilized camera frames and creates stable corresponding latent camera paths implicitly. To train the network, we collect a dataset of synchronized steady and unsteady video pairs via a specially designed hand-held hardware. Experimental results show that our proposed online method performs comparatively to the traditional off-line video stabilization methods without using future frames while running about 10 times faster. More importantly, our proposed StabNet is able to handle low-quality videos, such as night-scene videos, watermarked videos, blurry videos, and noisy videos, where the existing methods fail in feature extraction or matching.
Two-Layer QR Codes Tailing Yuan; Yili Wang; Kun Xu ...
IEEE transactions on image processing,
09/2019, Letnik:
28, Številka:
9
Journal Article
Recenzirano
Odprti dostop
A quick-response code (QR code) is a two-dimensional code akin to a barcode that encodes a message of limited length. In this paper, we present a variant of QR code, a two-layer QR code. Its ...two-layer structure can display two alternative messages when scanned from two different directions. We propose a method to generate such two-layer QR codes encoding two given messages in a few seconds. We also demonstrate the robustness of our method on both synthetic and fabricated examples. All source code will be made publicly available (https://github.com/yuantailing/two-layer-qrcode).
Patch-based image synthesis methods have been successfully applied for various editing tasks on still images, videos and stereo pairs. In this work we extend patch-based synthesis to plenoptic images ...captured by consumer-level lenselet-based devices for interactive, efficient light field editing. In our method the light field is represented as a set of images captured from different viewpoints. We decompose the central view into different depth layers, and present it to the user for specifying the editing goals. Given an editing task, our method performs patch-based image synthesis on all affected layers of the central view, and then propagates the edits to all other views. Interaction is done through a conventional 2D image editing user interface that is familiar to novice users. Our method correctly handles object boundary occlusion with semi-transparency, thus can generate more realistic results than previous methods. We demonstrate compelling results on a wide range of applications such as hole-filling, object reshuffling and resizing, changing object depth, light field upscaling and parallax magnification.