Semantic segmentation is a crucial image understanding task, where each pixel of image is categorized into a corresponding label. Since the pixel-wise labeling for ground-truth is tedious and labor ...intensive, in practical applications, many works exploit the synthetic images to train the model for real-word image semantic segmentation, i.e., Synthetic-to-Real Semantic Segmentation (SRSS). However, Deep Convolutional Neural Networks (CNNs) trained on the source synthetic data may not generalize well to the target real-world data. To address this problem, there has been rapidly growing interest in Domain Adaption technique to mitigate the domain mismatch between the synthetic and real-world images. Besides, Domain Generalization technique is another solution to handle SRSS. In contrast to Domain Adaption, Domain Generalization seeks to address SRSS without accessing any data of the target domain during training. In this work, we propose two simple yet effective texture randomization mechanisms, Global Texture Randomization (GTR) and Local Texture Randomization (LTR), for Domain Generalization based SRSS. GTR is proposed to randomize the texture of source images into diverse unreal texture styles. It aims to alleviate the reliance of the network on texture while promoting the learning of the domain-invariant cues. In addition, we find the texture difference is not always occurred in entire image and may only appear in some local areas. Therefore, we further propose a LTR mechanism to generate diverse local regions for partially stylizing the source images. Finally, we implement a regularization of Consistency between GTR and LTR (CGL) aiming to harmonize the two proposed mechanisms during training. Extensive experiments on five publicly available datasets (i.e., GTA5, SYNTHIA, Cityscapes, BDDS and Mapillary) with various SRSS settings (i.e., GTA5/SYNTHIA to Cityscapes/BDDS/Mapillary) demonstrate that the proposed method is superior to the state-of-the-art methods for domain generalization based SRSS.
To address the high computational and memory cost in 3-D volumetric convolutional neural networks (CNNs), we propose an approach to train binary volumetric CNNs for 3-D object recognition. Our method ...is specifically designed for 3-D data, in which it transforms the inputs and weights in convolutional/fully connected layers to binary values, which can potentially accelerate the networks by efficient bitwise operations. Two loss calculation methods are designed to solve the accuracy decrease problem when the weights in the last layer are binarized. Four binary volumetric CNNs are obtained from their corresponding floating-point networks using our approach. Evaluations on three public datasets from different domains (Computer Aided Design (CAD), light detection and ranging (LiDAR), and RGB-D) show that our binary volumetric CNNs can achieve a comparable recognition performance as their floating-point counterparts but consume less computational and memory resources.
Mask-Aware Networks for Crowd Counting Jiang, Shengqin; Lu, Xiaobo; Lei, Yinjie ...
IEEE transactions on circuits and systems for video technology,
09/2020, Letnik:
30, Številka:
9
Journal Article
Recenzirano
Odprti dostop
Crowd counting problem aims to count the number of objects within an image or a frame in the videos and is usually solved by estimating the density map generated from the object location annotations. ...The values in the density map, by nature, take two possible states: zero indicating no object around, a non-zero value indicating the existence of objects and the value denoting the local object density. In contrast to traditional methods which do not differentiate the density prediction of these two states, we propose to use a dedicated network branch to predict the object/non-object mask and then combine its prediction with the input image to produce the density map. Our rationale is that the mask prediction could be better modeled as a binary segmentation problem and the difficulty of estimating the density could be reduced if the mask is known. A key to the proposed scheme is the strategy of incorporating the mask prediction into the density map estimator. To this end, we study five possible solutions, and via analysis and experimental validation we identify the most effective one. Through extensive experiments on three public datasets, we demonstrate the superior performance of the proposed approach over the baselines and show that our network could achieve the state-of-the-art performance.
Street Scene Change Detection (SSCD) aims to locate the changed regions between a given street-view image pair captured at different times, which is an important yet challenging task in the computer ...vision community. The intuitive way to solve the SSCD task is to fuse the extracted image feature pairs, and then directly measure the dissimilarity parts for producing a change map. Therefore, the key for the SSCD task is to design an effective feature fusion method that can improve the accuracy of the corresponding change maps. To this end, we present a novel Hierarchical Paired Channel Fusion Network (HPCFNet), which utilizes the adaptive fusion of paired feature channels. Specifically, the features of a given image pair are jointly extracted by a Siamese Convolutional Neural Network (SCNN) and hierarchically combined by exploring the fusion of channel pairs at multiple feature levels. In addition, based on the observation that the distribution of scene changes is diverse, we further propose a Multi-Part Feature Learning (MPFL) strategy to detect diverse changes. Based on the MPFL strategy, our framework achieves a novel approach to adapt to the scale and location diversities of the scene change regions. Extensive experiments on three public datasets (i.e., PCD, VL-CMU-CD and CDnet2014) demonstrate that the proposed framework achieves superior performance which outperforms other state-of-the-art methods with a considerable margin.
Street Scene Parsing (SSP) is a fundamental and important step for autonomous driving and traffic scene understanding. Recently, Fully Convolutional Network (FCN) based methods have delivered ...expressive performances with the help of large-scale dense-labeling datasets. However, in urban traffic environments, not all the labels contribute equally for making the control decision. Certain labels such as pedestrian, car, bicyclist, road lane or sidewalk would be more important in comparison with labels for vegetation, sky or building. Based on this fact, in this paper we propose a novel deep learning framework, named Residual Atrous Pyramid Network (RAPNet), for importance-aware SSP. More specifically, to incorporate the importance of various object classes, we propose an Importance-Aware Feature Selection (IAFS) mechanism which automatically selects the important features for label predictions. The IAFS can operate in each convolutional block, and the semantic features with different importance are captured in different channels so that they are automatically assigned with corresponding weights. To enhance the labeling coherence, we also propose a Residual Atrous Spatial Pyramid (RASP) module to sequentially aggregate global-to-local context information in a residual refinement manner. Extensive experiments on two public benchmarks have shown that our approach achieves new state-of-the-art performances, and can consistently obtain more accurate results on the semantic classes with high importance levels.
In recent years, Salient Object Detection (SOD) has shown great success with the achievements of large-scale benchmarks and deep learning techniques. However, existing SOD methods mainly focus on ...natural images with low-resolutions, e.g., <inline-formula> <tex-math notation="LaTeX">400\times 400 </tex-math></inline-formula> or less. This drawback hinders them for advanced practical applications, which need high-resolution, detail-aware results. Besides, lacking of the boundary detail and semantic context of salient objects is also a key concern for accurate SOD. To address these issues, in this work we focus on the High-Resolution Salient Object Detection (HRSOD) task. Technically, we propose the first end-to-end learnable framework, named Dual ReFinement Network (DRFNet), for fully automatic HRSOD. More specifically, the proposed DRFNet consists of a shared feature extractor and two effective refinement heads. By decoupling the detail and context information, one refinement head adopts a global-aware feature pyramid. Without increasing too much computational burden, it can boost the spatial detail information, which narrows the gap between high-level semantics and low-level details. In parallel, the other refinement head adopts hybrid dilated convolutional blocks and group-wise upsamplings, which are very efficient in extracting contextual information. Based on the dual refinements, our approach can enlarge receptive fields and obtain more discriminative features from high-resolution images. Experimental results on high-resolution benchmarks (the public DUT-HRSOD and the proposed DAVIS-SOD) demonstrate that our method is not only efficient but also performs more accurate than other state-of-the-arts. Besides, our method generalizes well on typical low-resolution benchmarks.
Unsupervised Domain Adaptation (UDA) is quite challenging due to the large distribution discrepancy between the source domain and the target domain. Inspired by diffusion models which have strong ...capability to gradually convert data distributions across a large gap, we consider to explore the diffusion technique to handle the challenging UDA task. However, using diffusion models to convert data distribution across different domains is a non-trivial problem as the standard diffusion models generally perform conversion from the Gaussian distribution instead of from a specific domain distribution. Besides, during the conversion, the semantics of the source-domain data needs to be preserved to classify correctly in the target domain. To tackle these problems, we propose a novel Domain-Adaptive Diffusion (DAD) module accompanied by a Mutual Learning Strategy (MLS), which can gradually convert data distribution from the source domain to the target domain while enabling the classification model to learn along the domain transition process. Consequently, our method successfully eases the challenge of UDA by decomposing the large domain gap into small ones and gradually enhancing the capacity of classification model to finally adapt to the target domain. Our method outperforms the current state-of-the-arts by a large margin on three widely used UDA datasets.
Novel view synthesis aims at rendering any posed images from sparse observations of the scene. Recently, neural radiance fields (NeRF) have demonstrated their effectiveness in synthesizing novel ...views of a bounded scene. However, most existing methods cannot be directly extended to 360° unbounded scenes where the camera orientations and scene depths are unconstrained with large variations. In this paper, we present a spherical radiance field (SRF) for efficient novel view synthesis in 360° unbounded scenes. Specifically, we represent a 3D scene as multiple concentric spheres with different radii. In particular, each sphere encodes its corresponding layered scene into implicit representations and is parameterized with an equirectangular projection image. A shallow multi-layer perceptron (MLP) is then used to infer the density and color from these sphere representations for volume rendering. Moreover, an occupancy grid is introduced to cache the density field and guide the ray sampling, which accelerates the training and rendering procedures by reducing the number of samples along the ray. Experiments show that our method can well fit 360° unbounded scenes and produces state-of-the-art results on three benchmark datasets with less than 30 minutes of training time on a 3090 GPU, surpassing Mip-NeRF 360 with a <inline-formula> <tex-math notation="LaTeX">400\times </tex-math></inline-formula> speedup. In addition, our method achieves competitive performance in terms of both accuracy and efficiency on a bounded dataset. Project page: https://minglin-chen.github.io/SphericalRF
We propose a weakly supervised approach for salient object detection from multi-modal RGB-D data. Our approach only relies on labels from scribbles, which are much easier to annotate, compared with ...dense labels used in conventional fully supervised setting. In contrast to existing methods that employ supervision signals on the output space, our design regularizes the intermediate latent space to enhance discrimination between salient and non-salient objects. We further introduce a contour detection branch to implicitly constrain the semantic boundaries and achieve precise edges of detected salient objects. To enhance the long-range dependencies among local features, we introduce a Cross-Padding Attention Block (CPAB). Extensive experiments on seven benchmark datasets demonstrate that our method not only outperforms existing weakly supervised methods, but is also on par with several fully-supervised state-of-the-art models. Code is available at https://github.com/leolyj/DHFR-SOD.
This paper presents a computationally efficient 3D face recognition system based on a novel facial signature called Angular Radial Signature (ARS) which is extracted from the semi-rigid region of the ...face. Kernel Principal Component Analysis (KPCA) is then used to extract the mid-level features from the extracted ARSs to improve the discriminative power. The mid-level features are then concatenated into a single feature vector and fed into a Support Vector Machine (SVM) to perform face recognition. The proposed approach addresses the expression variation problem by using facial scans with various expressions of different individuals for training. We conducted a number of experiments on the Face Recognition Grand Challenge (FRGC v2.0) and the 3D track of Shape Retrieval Contest (SHREC 2008) datasets, and a superior recognition performance has been achieved. Our experimental results show that the proposed system achieves very high Verification Rates (VRs) of 97.8% and 88.5% at a 0.1% False Acceptance Rate (FAR) for the “neutral vs. nonneutral” experiments on the FRGC v2.0 and the SHREC 2008 datasets respectively, and 96.7% for the ROC III experiment of the FRGC v2.0 dataset. Our experiments also demonstrate the computational efficiency of the proposed approach.
•Novel facial Angular Radial Signatures (ARSs) are proposed for 3D face recognition.•The Signatures are extracted from the semi-rigid facial regions.•A two-stage mapping-based classification strategy is used to perform face recognition.•ARSs combined with machine learning techniques can handle expression variations.•State-of-the-art performance on two public datasets with high efficiency is achieved.