Semantic segmentation is a fundamental task in computer vision, and it has various applications in fields such as robotic sensing, video surveillance, and autonomous driving. A major research topic ...in urban road semantic segmentation is the proper integration and use of cross-modal information for fusion. Here, we attempt to leverage inherent multimodal information and acquire graded features to develop a novel multilabel-learning network for RGB-thermal urban scene semantic segmentation. Specifically, we propose a strategy for graded-feature extraction to split multilevel features into junior, intermediate, and senior levels. Then, we integrate RGB and thermal modalities with two distinct fusion modules, namely a shallow feature fusion module and deep feature fusion module for junior and senior features. Finally, we use multilabel supervision to optimize the network in terms of semantic, binary, and boundary characteristics. Experimental results confirm that the proposed architecture, the graded-feature multilabel-learning network, outperforms state-of-the-art methods for urban scene semantic segmentation, and it can be generalized to depth data.
Under ideal environmental conditions, RGB-based deep convolutional neural networks can achieve high performance for salient object detection (SOD). In scenes with cluttered backgrounds and many ...objects, depth maps have been combined with RGB images to better distinguish spatial positions and structures during SOD, achieving high accuracy. However, under low-light and uneven lighting conditions, RGB and depth information may be insufficient for detection. Thermal images are insensitive to lighting and weather conditions, being able to capture important objects even during nighttime. By combining thermal images and RGB images, we propose an effective and consistent feature fusion network (ECFFNet) for RGB-T SOD. In ECFFNet, an effective cross-modality fusion module fully fuses features of corresponding sizes from the RGB and thermal modalities. Then, a bilateral reversal fusion module performs bilateral fusion of foreground and background information, enabling the full extraction of salient object boundaries. Finally, a multilevel consistent fusion module combines features across different levels to obtain complementary information. Comprehensive experiments on three RGB-T SOD datasets show that the proposed ECFFNet outperforms 12 state-of-the-art methods under different evaluation indicators.
Most recent methods for RGB (red-green-blue)-thermal salient object detection (SOD) involve several floating-point operations and have numerous parameters, resulting in slow inference, especially on ...common processors, and impeding their deployment on mobile devices for practical applications. To address these problems, we propose a lightweight spatial boosting network (LSNet) for efficient RGB-thermal SOD with a lightweight MobileNetV2 backbone to replace a conventional backbone (e.g., VGG, ResNet). To improve feature extraction using a lightweight backbone, we propose a boundary boosting algorithm that optimizes the predicted saliency maps and reduces information collapse in low-dimensional features. The algorithm generates boundary maps based on predicted saliency maps without incurring additional calculations or complexity. As multimodality processing is essential for high-performance SOD, we adopt attentive feature distillation and selection and propose semantic and geometric transfer learning to enhance the backbone without increasing the complexity during testing. Experimental results demonstrate that the proposed LSNet achieves state-of-the-art performance compared with 14 RGB-thermal SOD methods on three datasets while improving the numbers of floating-point operations (1.025G) and parameters (5.39M), model size (22.1 MB), and inference speed (9.95 fps for PyTorch, batch size of 1, and Intel i5-7500 processor; 93.53 fps for PyTorch, batch size of 1, and NVIDIA TITAN V graphics processor; 936.68 fps for PyTorch, batch size of 20, and graphics processor; 538.01 fps for TensorRT and batch size of 1; and 903.01 fps for TensorRT/FP16 and batch size of 1). The code and results can be found from the link of https://github.com/zyrant/LSNet.
Infrared and visible image fusion, which highlights radiometric and detailed texture information and completely and accurately describes objects, is a long-standing and well-studied task in computer ...vision. Existing convolutional neural network-based approaches that leverage end-to-end networks to fuse infrared and visible images have made significant progress. However, most approaches typically extract the features in the encoder segment and use a coarse fusion strategy. Unlike these algorithms, this study proposes a multiscale receptive field amplification fusion network (MRANet) to effectively extract the local and global features from images. Particularly, we extract long-range information in the encoder segment using a convolutional residual structure as the main backbone and a simplified uniformer as an auxiliary backbone, both of which are ResNet-inspired. Additionally, we propose an effective multiscale fusion strategy based on an attention mechanism to integrate the two modalities. Extensive experiments demonstrate that MRANet performs efficiently on image fusion datasets.
We recently demonstrated the remarkable performance of scene parsing, and one of its aspects was shown to be relevant to performance, namely, generation of multilevel feature representations. ...However, most existing scene parsing methods obtain multilevel feature representations with weak distinctions and large spans. Therefore, despite using complex mechanisms, the effects on the feature representations are minimal. To address this, we leverage the inherent multilevel cross-modal data and back propagation to develop a novel feature reconstruction network (FRNet) for RGB-D indoor scene parsing. Specifically, a feature construction encoder is proposed to obtain the features layerwise in a top-down manner, where the feature nodes in the higher layer flow to the adjacent low layer by dynamically changing their structure. In addition, we propose a cross-level enriching module in the encoder to selectively refine and weight the features in each layer in the RGB and depth modalities as well as a cross-modality awareness module to generate the feature nodes containing the modality data. Finally, we integrate the multilevel feature representations simply via dilated convolutions at different rates. Extensive quantitative and qualitative experiments were conducted, and the results demonstrate that the proposed FRNet is comparable to state-of-the-art RGB-D indoor scene parsing methods on two public indoor datasets.
In this article, to solve a time-varying quadratic programming with equation constraint, a new time-specified zeroing neural network (TSZNN) is proposed and analyzed. Unlike the existing methods such ...as the Zhang neural network with different activation functions and a finite-time neural network, the TSZNN model is incorporated into a terminal attractor, and the convergent error can be guaranteed to reduce to zero in advance (instead of the finite-time property). The main advantage of the TSZNN model is that it is independent of the initial state of the systematic dynamics, which is much astonishing to the finite convergence relying on the initial conditions and comprehensively modifies the convergent performance. Mathematical analyses substantiate the prespecified convergence of the TSZNN model and high convergent precision under the situation of various convergent time settings. The prespecified convergence of the TSZNN model for a quadratic programming problem has been mathematically proved under different convergent constant settings. In addition, simulation applications conducted on a repeatable trajectory planning of the redundant manipulator are studied to demonstrate the validity of the proposed TSZNN model.
Scene parsing of high spatial resolution (HSR) remote sensing images has achieved notable progress in recent years by the adoption of convolutional neural networks. However, for scene parsing of ...multimodal remote sensing images, effectively integrating complementary information remains challenging. For instance, the decrease in feature map resolution through a neural network causes loss of spatial information, likely leading to blurred object boundaries and misclassification of small objects. In addition, object scales on a remote sensing image vary substantially, undermining the parsing performance. To solve these problems, we propose an end-to-end common extraction and gate fusion network (CEGFNet) to capture both high-level semantic features and low-level spatial details for scene parsing of remote sensing images. Specifically, we introduce a gate fusion module to extract complementary features from spectral data and digital surface model data. A gate mechanism removes redundant features in the data stream and extracts complementary features that improve multimodal feature fusion. In addition, a global context module and a multilayer aggregation decoder handle scale variations between objects and the loss of spatial details due to downsampling, respectively. The proposed CEGFNet was quantitatively evaluated on benchmark scene parsing datasets containing HSR remote sensing images, and it achieved state-of-the-art performance.
Semantic segmentation of remote sensing images has received increasing attention in recent years; however, using a single imaging modality limits the segmentation performance. Thus, digital surface ...models have been integrated into semantic segmentation to improve performance. Nevertheless, existing methods based on neural networks simply combine data from the two modalities, mostly neglecting the similarities and differences between multimodal features. Consequently, the complementarity between multimodal features cannot be exploited, and excess noise is introduced during feature processing. To solve these problems, we propose a multimodal fusion module to explore the similarities and differences between features from the two information modalities for adequate fusion. In addition, although downsampling operations such as pooling and striding can improve the feature representativeness, they discard spatial details and often lead to segmentation errors. Thus, we introduce hierarchical feature interactions to mitigate the adverse effects of downsampling and introduce a two-way interactive pyramid pooling module to extract multiscale context features for guiding feature fusion. Extensive experiments performed on two benchmark datasets show that the proposed network integrating our novel modules substantially outperforms state-of-the-art semantic segmentation methods. The code and results can be found at https://github.com/NIT-JJH/CIMFNet .
Salient object detection (SOD) based on convolutional neural networks has achieved remarkable success. However, further improving the detection performance on challenging scenes (e.g., low-light ...scenes) requires additional investigation. Thermal infrared imaging captures thermal radiation from the surface of objects. Thus, it is insensitive to lighting conditions and can provide uniform imaging of objects. Accordingly, we propose a two-stage fusion network (TSFNet) integrating RGB and thermal information for RGB-T SOD. For the first fusion stage, we propose a feature-wise fusion module that captures and aggregates united information and intersecting information in each local region of the RGB and thermal images, and then independent decoding is applied to the RGB and thermal features. For the second fusion stage, we propose a bilateral auxiliary fusion module that extracts auxiliary spatial features from the foreground and background of the thermal and RGB modalities. Finally, we use multiple supervision to further improve the SOD performance. Comprehensive experiments demonstrate that TSFNet outperforms 11 state-of-the-art models under various indicators on three RGB-T SOD datasets.
Exploiting RGB and depth information can boost the performance of semantic segmentation. However, owing to the differences between RGB images and the corresponding depth maps, such multimodal ...information should be effectively used and combined. Most existing methods use the same fusion strategy to explore multilevel complementary information at various levels, likely ignoring different feature contributions at various levels for segmentation. To address this problem, we propose a network using a two-stage cascaded decoder (TCD), embedding a detail polishing module, to effectively integrate high- and low-level features and suppress noise from low-level details. Additionally, we introduce a depth filter and fusion module to extract informative regions from depth cues with the guidance of RGB images. The proposed TCD network achieves comparable performance to state-of-the-art RGB-D semantic segmentation methods on the benchmark NYUDv2 and SUN RGB-D datasets.