Due to the fast inference and good performance, discriminative learning methods have been widely studied in image denoising. However, these methods mostly learn a specific model for each noise level, ...and require multiple models for denoising images with different noise levels. They also lack flexibility to deal with spatially variant noise, limiting their applications in practical denoising. To address these issues, we present a fast and flexible denoising convolutional neural network, namely FFDNet, with a tunable noise level map as the input. The proposed FFDNet works on downsampled sub-images, achieving a good trade-off between inference speed and denoising performance. In contrast to the existing discriminative denoisers, FFDNet enjoys several desirable properties, including: 1) the ability to handle a wide range of noise levels (i.e., 0, 75) effectively with a single network; 2) the ability to remove spatially variant noise by specifying a non-uniform noise level map; and 3) faster speed than benchmark BM3D even on CPU without sacrificing denoising performance. Extensive experiments on synthetic and real noisy images are conducted to evaluate FFDNet in comparison with state-of-the-art denoisers. The results show that FFDNet is effective and efficient, making it highly attractive for practical denoising applications.
Facial attribute editing aims to manipulate single or multiple attributes on a given face image, i.e., to generate a new face image with desired attributes while preserving other details. Recently, ...the generative adversarial net (GAN) and encoder-decoder architecture are usually incorporated to handle this task with promising results. Based on the encoder-decoder architecture, facial attribute editing is achieved by decoding the latent representation of a given face conditioned on the desired attributes. Some existing methods attempt to establish an attribute-independent latent representation for further attribute editing. However, such attribute-independent constraint on the latent representation is excessive because it restricts the capacity of the latent representation and may result in information loss, leading to over-smooth or distorted generation. Instead of imposing constraints on the latent representation, in this work, we propose to apply an attribute classification constraint to the generated image to just guarantee the correct change of desired attributes, i.e., to change what you want. Meanwhile, the reconstruction learning is introduced to preserve attribute-excluding details, in other words, to only change what you want. Besides, the adversarial learning is employed for visually realistic editing. These three components cooperate with each other forming an effective framework for high quality facial attribute editing, referred as AttGAN. Furthermore, the proposed method is extended for attribute style manipulation in an unsupervised manner. Experiments on two wild datasets, CelebA and LFW, show that the proposed method outperforms the state-of-the-art on realistic attribute editing with other facial details well preserved.
Extracting informative image features and learning effective approximate hashing functions are two crucial steps in image retrieval. Conventional methods often study these two steps separately, e.g., ...learning hash functions from a predefined hand-crafted feature space. Meanwhile, the bit lengths of output hashing codes are preset in the most previous methods, neglecting the significance level of different bits and restricting their practical flexibility. To address these issues, we propose a supervised learning framework to generate compact and bit-scalable hashing codes directly from raw images. We pose hashing learning as a problem of regularized similarity learning. In particular, we organize the training images into a batch of triplet samples, each sample containing two images with the same label and one with a different label. With these triplet samples, we maximize the margin between the matched pairs and the mismatched pairs in the Hamming space. In addition, a regularization term is introduced to enforce the adjacency consistency, i.e., images of similar appearances should have similar codes. The deep convolutional neural network is utilized to train the model in an end-to-end fashion, where discriminative image features and hash functions are simultaneously optimized. Furthermore, each bit of our hashing codes is unequally weighted, so that we can manipulate the code lengths by truncating the insignificant bits. Our framework outperforms state-of-the-arts on public benchmarks of similar image search and also achieves promising results in the application of person re-identification in surveillance. It is also shown that the generated bit-scalable hashing codes well preserve the discriminative powers with shorter code lengths.
We propose an effective online background subtraction method, which can be robustly applied to practical videos that have variations in both foreground and background. Different from previous methods ...which often model the foreground as Gaussian or Laplacian distributions, we model the foreground for each frame with a specific mixture of Gaussians (MoG) distribution, which is updated online frame by frame. Particularly, our MoG model in each frame is regularized by the learned foreground/background knowledge in previous frames. This makes our online MoG model highly robust, stable and adaptive to practical foreground and background variations. The proposed model can be formulated as a concise probabilistic MAP model, which can be readily solved by EM algorithm. We further embed an affine transformation operator into the proposed model, which can be automatically adjusted to fit a wide range of video background transformations and make the method more robust to camera movements. With using the sub-sampling technique, the proposed method can be accelerated to execute more than 250 frames per second on average, meeting the requirement of real-time background subtraction for practical video processing tasks. The superiority of the proposed method is substantiated by extensive experiments implemented on synthetic and real videos, as compared with state-of-the-art online and offline background subtraction methods.
Deep learning-based object detection and instance segmentation have achieved unprecedented progress. In this article, we propose complete-IoU (CIoU) loss and Cluster-NMS for enhancing geometric ...factors in both bounding-box regression and nonmaximum suppression (NMS), leading to notable gains of average precision (AP) and average recall (AR), without the sacrifice of inference efficiency. In particular, we consider three geometric factors, that is: 1) overlap area; 2) normalized central-point distance; and 3) aspect ratio, which are crucial for measuring bounding-box regression in object detection and instance segmentation. The three geometric factors are then incorporated into CIoU loss for better distinguishing difficult regression cases. The training of deep models using CIoU loss results in consistent AP and AR improvements in comparison to widely adopted <inline-formula> <tex-math notation="LaTeX">\ell _{n} </tex-math></inline-formula>-norm loss and IoU-based loss. Furthermore, we propose Cluster-NMS, where NMS during inference is done by implicitly clustering detected boxes and usually requires fewer iterations. Cluster-NMS is very efficient due to its pure GPU implementation, and geometric factors can be incorporated to improve both AP and AR. In the experiments, CIoU loss and Cluster-NMS have been applied to state-of-the-art instance segmentation (e.g., YOLACT and BlendMask-RT), and object detection (e.g., YOLO v3, SSD, and Faster R-CNN) models. Taking YOLACT on MS COCO as an example, our method achieves performance gains as +1.7 AP and +6.2 AR 100 for object detection, and +1.1 AP and +3.5 AR 100 for instance segmentation, with 27.1 FPS on one NVIDIA GTX 1080Ti GPU. All the source code and trained models are available at https://github.com/Zzh-tju/CIoU .
We propose a novel reconstruction-based transfer learning method called latent sparse domain transfer (LSDT) for domain adaptation and visual categorization of heterogeneous data. For handling ...cross-domain distribution mismatch, we advocate reconstructing the target domain data with the combined source and target domain data points based on ℓ 1 -norm sparse coding. Furthermore, we propose a joint learning model for simultaneous optimization of the sparse coding and the optimal subspace representation. In addition, we generalize the proposed LSDT model into a kernel-based linear/nonlinear basis transformation learning framework for tackling nonlinear subspace shifts in reproduced kernel Hilbert space. The proposed methods have three advantages: 1) the latent space and the reconstruction are jointly learned for pursuit of an optimal subspace transfer; 2) with the theory of sparse subspace clustering, a few valuable source and target data points are formulated to reconstruct the target data with noise (outliers) from source domain removed during domain adaptation, such that the robustness is guaranteed; and 3) a nonlinear projection of some latent space with kernel is easily generalized for dealing with highly nonlinear domain shift (e.g., face poses). Extensive experiments on several benchmark vision data sets demonstrate that the proposed approaches outperform other state-of-the-art representation-based domain adaptation methods.
We propose a multi-task end-to-end optimized deep neural network (MEON) for blind image quality assessment (BIQA). MEON consists of two sub-networks-a distortion identification network and a quality ...prediction network-sharing the early layers. Unlike traditional methods used for training multi-task networks, our training process is performed in two steps. In the first step, we train a distortion type identification sub-network, for which large-scale training samples are readily available. In the second step, starting from the pre-trained early layers and the outputs of the first sub-network, we train a quality prediction sub-network using a variant of the stochastic gradient descent method. Different from most deep neural networks, we choose biologically inspired generalized divisive normalization (GDN) instead of rectified linear unit as the activation function. We empirically demonstrate that GDN is effective at reducing model parameters/layers while achieving similar quality prediction performance. With modest model complexity, the proposed MEON index achieves state-of-the-art performance on four publicly available benchmarks. Moreover, we demonstrate the strong competitiveness of MEON against state-of-the-art BIQA models using the group maximum differentiation competition methodology.