Talking face generation is to synthesize a lip-synchronized talking face video by inputting an arbitrary face image and corresponding audio clips. The current talking face model can be divided into ...four parts: visual feature extraction, audio feature processing, multimodal feature fusion, and rendering module. For the visual feature extraction part, existing methods face the challenge of complex learning task with noisy features, this paper introduces an attention-based disentanglement module to disentangle the face into Audio-face and Identity-face using speech-related facial action unit (AU) information. For the multimodal feature fusion part, existing methods ignore not only the interaction and relationship of cross-modal information but also the local driving information of the mouth muscles. This study proposes a novel generative framework that incorporates a dilated non-causal temporal convolutional self-attention network as a multimodal fusion module to enhance the learning of cross-modal features. The proposed method employs both audio- and speech-related facial action units (AUs) as driving information. Speech-related AU information can facilitate more accurate mouth movements. Given the high correlation between speech and speech-related AUs, we propose an audio-to-AU module to predict speech-related AU information. Finally, we present a diffusion model for the synthesis of talking face images. We verify the effectiveness of the proposed model on the GRID and TCD-TIMIT datasets. An ablation study is also conducted to verify the contribution of each component. The results of quantitative and qualitative experiments demonstrate that our method outperforms existing methods in terms of both image quality and lip-sync accuracy. Code is available at https://mftfg-au.github.io/Multimodal_Fusion/ .
Denoising diffusion probabilistic models (DDPM) have shown impressive performance in various domains as a class of deep generative models. In this paper, we introduce the Mixture noise-based DDPM ...(Mix-DDPM), which considers the Markov diffusion posterior as a Gaussian mixture model. Specifically, Mix-DDPM randomly selects a Gaussian component and then adds the chosen Gaussian noise, which can be demonstrated as a more efficient way to perturb the signals into a simple known distribution. We further define the reverse probabilistic model as a parameterized Gaussian mixture kernel. Due to the intractability in calculating the KL divergence between Gaussian mixture models, we derive a variational bound to maximize the likelihood, offering a concise formulation for optimizing the denoising model and valuable insights for designing the sampling strategies. Our theoretical derivation highlights that Mix-DDPM need only shift image which requires the inclusion of a global stochastic offset in both the diffusion and reverse processes , which can be efficiently implemented with just several lines of code. The global stochastic offset effectively fits a Gaussian mixture distribution enhancing the degrees of freedom of the entire diffusion model. Furthermore, we present three streamlined sampling strategies that interface with diverse fast dedicated solvers for diffusion ordinary differential equations, boosting the efficacy of image representation in the sampling phase and alleviating the issue of slow generation speed, thereby enhancing both efficiency and accuracy. Extensive experiments on benchmark datasets demonstrate the effectiveness of Mix-DDPM and its superiority over the original DDPM.
We present the incorporation of a surrogate Gaussian process regression (GPR) atomistic model to greatly accelerate the rate of convergence of classical nudged elastic band (NEB) calculations. In our ...surrogate model approach, the cost of converging the elastic band no longer scales with the number of moving images on the path. This provides a far more efficient and robust transition state search. In contrast to a conventional NEB calculation, the algorithm presented here eliminates any need for manipulating the number of images to obtain a converged result. This is achieved by inventing a new convergence criteria that exploits the probabilistic nature of the GPR to use uncertainty estimates of all images in combination with the force in the saddle point in the target model potential. Our method is an order of magnitude faster in terms of function evaluations than the conventional NEB method with no accuracy loss for the converged energy barrier values.
Full text
Available for:
CMK, CTK, FMFMET, IJS, NUK, PNG, UL, UM
Synthesizing the high dynamic range (HDR) image from multi-exposure images has been extensively studied by exploiting convolutional neural networks (CNNs) recently. Despite the remarkable progress, ...existing CNN-based methods have the intrinsic limitation of local receptive field, which hinders the model’s capability of capturing long-range correspondence and large motions across under/over-exposure images, resulting in ghosting artifacts of dynamic scenes. To address the above challenge, we propose a novel Ed ge-gu i ded T ransf or mer framework (EdiTor) customized for ghost-free HDR reconstruction, where the long-range motions across different exposures can be delicately modeled by incorporating the edge prior. Specifically, EdiTor calculates patch-wise correlation maps on both image and edge domains, enabling the network to effectively model the global movements and the fine-grained shifts across multiple exposures. Based on this framework, we further propose an exposure-masked loss to adaptively compensate for the severely distorted regions ( e.g. , highlights and shadows). Experiments demonstrate that EdiTor outperforms state-of-the-art methods both quantitatively and qualitatively, achieving appealing HDR visualization with unified textures and colors.
Consistent video depth estimation Luo, Xuan; Huang, Jia-Bin; Szeliski, Richard ...
ACM transactions on graphics,
07/2020, Volume:
39, Issue:
4
Journal Article
Peer reviewed
Open access
We present an algorithm for reconstructing dense, geometrically consistent depth for all pixels in a monocular video. We leverage a conventional structure-from-motion reconstruction to establish ...geometric constraints on pixels in the video. Unlike the ad-hoc priors in classical reconstruction, we use a learning-based prior, i.e., a convolutional neural network trained for single-image depth estimation. At test time, we fine-tune this network to satisfy the geometric constraints of a particular input video, while retaining its ability to synthesize plausible depth details in parts of the video that are less constrained. We show through quantitative validation that our method achieves higher accuracy and a higher degree of geometric consistency than previous monocular reconstruction methods. Visually, our results appear more stable. Our algorithm is able to handle challenging hand-held captured input videos with a moderate degree of dynamic motion. The improved quality of the reconstruction enables several applications, such as scene reconstruction and advanced video-based visual effects.
Generative adversarial networks (GANs) have been extensively studied in the past few years. Arguably their most significant impact has been in the area of computer vision where great advances have ...been made in challenges such as plausible image generation, image-to-image translation, facial attribute manipulation, and similar domains. Despite the significant successes achieved to date, applying GANs to real-world problems still poses significant challenges, three of which we focus on here. These are as follows: (1) the generation of high quality images, (2) diversity of image generation, and (3) stabilizing training. Focusing on the degree to which popular GAN technologies have made progress against these challenges, we provide a detailed review of the state-of-the-art in GAN-related research in the published scientific literature. We further structure this review through a convenient taxonomy we have adopted based on variations in GAN architectures and loss functions. While several reviews for GANs have been presented to date, none have considered the status of this field based on their progress toward addressing practical challenges relevant to computer vision. Accordingly, we review and critically discuss the most popular architecture-variant, and loss-variant GANs, for tackling these challenges. Our objective is to provide an overview as well as a critical analysis of the status of GAN research in terms of relevant progress toward critical computer vision application requirements. As we do this we also discuss the most compelling applications in computer vision in which GANs have demonstrated considerable success along with some suggestions for future research directions. Codes related to the GAN-variants studied in this work is summarized on https://github.com/sheqi/GAN_Review.
Full text
Available for:
IZUM, KILJ, NUK, PILJ, SAZU, UL, UM, UPUK
Over the last few decades, rapid progress in AI, machine learning, and deep learning has resulted in new techniques and various tools for manipulating multimedia. Though the technology has been ...mostly used in legitimate applications such as for entertainment and education, etc., malicious users have also exploited them for unlawful or nefarious purposes. For example, high-quality and realistic fake videos, images, or audios have been created to spread misinformation and propaganda, foment political discord and hate, or even harass and blackmail people. The manipulated, high-quality and realistic videos have become known recently as Deepfake. Various approaches have since been described in the literature to deal with the problems raised by Deepfake. To provide an updated overview of the research works in Deepfake detection, we conduct a systematic literature review (SLR) in this paper, summarizing 112 relevant articles from 2018 to 2020 that presented a variety of methodologies. We analyze them by grouping them into four different categories: deep learning-based techniques, classical machine learning-based methods, statistical techniques, and blockchain-based techniques. We also evaluate the performance of the detection capability of the various methods with respect to different datasets and conclude that the deep learning-based methods outperform other methods in Deepfake detection.
Images can be maliciously manipulated to hide content or duplicate certain objects. Detecting an elaborate copy-move forgery is very challenging for both humans and machines, and current methods ...cannot detect copy-move images with the precision required, especially for pixel-level tampered images, which is a challenge for the current existing methods. In this paper we present our own dataset (CMF58K) - the first pixel-level copy-move dataset, which consists of 580,000 images covering copy-move tampered objects in various life scenes with more than 32 object classes. Furthermore, we propose a network for detecting and locating copy-move forgeries: Efficient and Robust Identification Network (ERINet). It mainly includes four main modules: the efficient feature pyramid network (EFPN), the residual receptive field block (RRFB), the hierarchical decoding identification (HDI), and the cascaded group-reversal attention (GRA) blocks. Considering the inevitable external factors of rotation, scaling, blurring, compression and noise can hide traces of tampering and increase the difficulty of detection, we applied MaxBlurPool to our network and obtained a strong robustness. ERINet outperforms various state-of-the-art manipulation detection baselines on four image manipulation datasets. The inference speed is ∼ 49 fps on a single GPU without I/O time on the test dataset.
Full text
Available for:
CEKLJ, EMUNI, FIS, FZAB, GEOZS, GIS, IJS, IMTLJ, KILJ, KISLJ, MFDPS, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, SBMB, SBNM, UKNU, UL, UM, UPUK, VKSCE, ZAGLJ
We tackle the problem of semantic image layout manipulation, which aims to manipulate an input image by editing its semantic label map. A core problem of this task is how to transfer visual details ...from the input images to the new semantic layout while making the resulting image visually realistic. Recent work on learning cross-domain correspondence has shown promising results for global layout transfer with dense attention-based warping. However, this method tends to lose texture details due to the resolution limitation and the lack of smoothness constraint on correspondence. To adapt this paradigm for the layout manipulation task, we propose a high-resolution sparse attention module that effectively transfers visual details to new layouts at a resolution up to 512x512. To further improve visual quality, we introduce a novel generator architecture consisting of a semantic encoder and a two-stage decoder for coarse-to-fine synthesis. Experiments on the ADE20k and Places365 datasets demonstrate that our proposed approach achieves substantial improvements over the existing inpainting and layout manipulation methods.
Inspired by the ability of StyleGAN to generate highly realistic images in a variety of domains, much recent work has focused on understanding how to use the latent spaces of StyleGAN to manipulate ...generated and real images. However, discovering semantically meaningful latent manipulations typically involves painstaking human examination of the many degrees of freedom, or an annotated collection of images for each desired manipulation. In this work, we explore leveraging the power of recently introduced Contrastive Language-Image Pre-training (CLIP) models in order to develop a text-based interface for StyleGAN image manipulation that does not require such manual effort. We first introduce an optimization scheme that utilizes a CLIP-based loss to modify an input latent vector in response to a user-provided text prompt. Next, we describe a latent mapper that infers a text-guided latent manipulation step for a given input image, allowing faster and more stable text-based manipulation. Finally, we present a method for mapping text prompts to input-agnostic directions in StyleGAN's style space, enabling interactive text-driven image manipulation. Extensive results and comparisons demonstrate the effectiveness of our approaches.