Recent mask proposal models have significantly improved the performance of open-vocabulary semantic segmentation. However, the use of a 'background' embedding during training in these methods is ...problematic as the resulting model tends to over-learn and assign all unseen classes as the background class instead of their correct labels. Furthermore, they ignore the semantic relationship of text embeddings, which arguably can be highly informative for open-vocabulary prediction as some classes may have close relationship with other classes. To this end, this paper proposes novel class enhancement losses to bypass the use of the 'background' embbedding during training, and simultaneously exploit the semantic relationship between text embeddings and mask proposals by ranking the similarity scores. To further capture the relationship between base and novel classes, we propose an effective pseudo label generation pipeline using the pretrained vision-language model. Extensive experiments on several benchmark datasets show that our method achieves overall the best performance for open-vocabulary semantic segmentation. Our method is flexible, and can also be applied to the zero-shot semantic segmentation problem.
As for the classification network that is constantly emerging with each passing day, different classification network as the backbone of the semantic segmentation network may show different ...performance. This paper selected the road extraction data set of CVPR DeepGlobe, and compared the performance differences of VGG-16 as the backbone of Unet, ResNet34, ResNet101 and Xception as the backbone of AD-LinkNet. When VGG-16 is used as the backbone of the semantic segmentation network, it performs better in the face of long and wide road extraction. As the backbone of the semantic segmentation network, ResNet has a higher ability to extract small roads. When Xception is used as the backbone of the semantic segmentation network, it not only retains the characteristics of ResNet34, but also can effectively deal with the complex situation of extracting target covered by occlusions.
This paper focuses on the unsupervised domain adaptation of transferring the knowledge from the source domain to the target domain in the context of semantic segmentation. Existing approaches usually ...regard the pseudo label as the ground truth to fully exploit the unlabeled target-domain data. Yet the pseudo labels of the target-domain data are usually predicted by the model trained on the source domain. Thus, the generated labels inevitably contain the incorrect prediction due to the discrepancy between the training domain and the test domain, which could be transferred to the final adapted model and largely compromises the training process. To overcome the problem, this paper proposes to explicitly estimate the prediction uncertainty during training to rectify the pseudo label learning for unsupervised semantic segmentation adaptation. Given the input image, the model outputs the semantic segmentation prediction as well as the uncertainty of the prediction. Specifically, we model the uncertainty via the prediction variance and involve the uncertainty into the optimization objective. To verify the effectiveness of the proposed method, we evaluate the proposed method on two prevalent synthetic-to-real semantic segmentation benchmarks, i.e., GTA5
→
Cityscapes and SYNTHIA
→
Cityscapes, as well as one cross-city benchmark, i.e., Cityscapes
→
Oxford RobotCar. We demonstrate through extensive experiments that the proposed approach (1) dynamically sets different confidence thresholds according to the prediction variance, (2) rectifies the learning from noisy pseudo labels, and (3) achieves significant improvements over the conventional pseudo label learning and yields competitive performance on all three benchmarks.
•A novel learning method, i.e., Transformer based Refinement Learning (TRL), is proposed.•A Dual-Cross Transformer Network (DCTN) is designed.•State-of-the-art results on two public datasets reveal ...the superiority of our method.
This paper studies a new yet practical setting of semi-supervised semantic segmentation, i.e., hybrid-supervised semantic segmentation, where a small number of pixel-level (strong) annotations and a large number of image-level (weak) annotations are provided. It is a common practice to utilize pseudo labels to mitigate the issue of lacking strong annotations. However, most of the existing works focus on improving the model representation with unlabeled data, while ignoring the quality of pseudo labels, leading to poor segmentation performance. It is difficult to directly learn a model with limited images to produce high-quality pseudo labels. To address this problem, we propose a novel learning method, i.e., Transformer based Refinement Learning (TRL), which explores a learning process under the assistance of weak annotations and the supervision of strong annotations. TRL progressively refines heat maps from the poor qualities to the better ones to obtain satisfactory pseudo labels. Specifically, we propose a Dual-Cross Transformer Network (DCTN) to perform the refinement learning. DCTN extracts the features from both images and heat maps by a dual-stream network. The cross attentions inside DCTN hierarchically fuse the dual-stream features.
The experiments on the PASCAL VOC and COCO datasets show that TRL outperforms the state-of-the-art methods for hybrid-supervised semantic segmentation.
Semantic segmentation is a fundamental task in computer vision, and it has various applications in fields such as robotic sensing, video surveillance, and autonomous driving. A major research topic ...in urban road semantic segmentation is the proper integration and use of cross-modal information for fusion. Here, we attempt to leverage inherent multimodal information and acquire graded features to develop a novel multilabel-learning network for RGB-thermal urban scene semantic segmentation. Specifically, we propose a strategy for graded-feature extraction to split multilevel features into junior, intermediate, and senior levels. Then, we integrate RGB and thermal modalities with two distinct fusion modules, namely a shallow feature fusion module and deep feature fusion module for junior and senior features. Finally, we use multilabel supervision to optimize the network in terms of semantic, binary, and boundary characteristics. Experimental results confirm that the proposed architecture, the graded-feature multilabel-learning network, outperforms state-of-the-art methods for urban scene semantic segmentation, and it can be generalized to depth data.