Monocular depth estimation is a fundamental task in computer vision and has drawn increasing attention. Recently, attention-based models and encoder-decoder architectures have led to great ...improvements in monocular depth estimation. Typically, most of the previous methods used repeated simple up-sampling operations during decoding, which may not make full use of the potential properties of the features extracted by the encoder, and there are problems of inaccurate prediction of the edge and depth maximum region. We propose an attention-based feature fusion module for encoder and decoder. We treat the monocular depth estimation as a pixel-level optimization problem, where the coarsest encoder feature is used to initialize the pixel-level optimization, which is then refined to higher resolution by the proposed attentional feature fusion (AFF). We formulate the prediction problem as ordinal regression over the bin centers that discretize the continuous depth range. It predicts a correspondingly different distribution of bins based on different pictures and we predict bins at the coarsest level using global pooling and MLP layers. In the NYUV2 dataset, the proposed architecture improving original model by 2.5.% and 1.1%, in terms of Log10 and Absolute relative error, respectively.
The precise estimation of 3D human pose from monocular videos poses a formidable challenge, primarily attributed to the complexities introduced by depth blur and self-occlusion. We observed ...significant differences in the movement of the same joint at different times. However, previous methods were unable to effectively simulate the corresponding relationships of the same joint at different times. For this purpose, we advocate for the adoption of a Transformer-centric design, which we've dubbed the Cross Connection Transformer (CCT). It can learn the dependency relationships between joints at different times and stages, and then capture information from multiple stages for cross attention and feature fusion. This task can be decomposed into two stages: first, employ a spatio-temporal encoder, to capture the temporal motion patterns of individual joints and learn the spatial correlation between joints. In the subsequent phase, the process entails acquiring proficiency in cross-spatial communication and amalgamating diverse spatial features to formulate the ultimate 3D pose. Through two innovative interaction modules, this model explicitly encodes local and global dependencies between body joints, providing a rich representation of body joints. This is crucial for capturing small changes across frames, namely the representation between features. Following rigorous experimental validation, our CCT model showcased good performance on the demanding datasets Human3.6M and MPI-INF-3DHP. The outcomes underscore our model's superiority, surpassing the advanced pure Transformer method MixSTE. Notably, our model achieves a performance improvement of 2% beyond the optimal results on Human3.6M.
Semantic segmentation is one of the most important research directions in the field of computer vision, and has a wide range of applications for autonomous driving, medical imaging, intelligent ...security, etc. Unsupervised domain adaptation is the mainstream research topic in recent years, which can use a large number of labeled source samples to complete the segmentation task in target domain without labeled target samples. In this paper, we propose a prototype-guided unsupervised domain adaptation for semantic segmentation based on ProDA model. Due to lacking of labeled target samples and the prior probability, a prototype distance loss based on target domain is proposed to optimize the distribution of features by measuring the distance between features and the updated prototype and designing an adaptive threshold strategy. Meanwhile, a smoothing loss is proposed to alleviate the impact of source samples on our model and improve the prediction performance of the network. By conducting experiments on the GTA5 to Cityscapes scenarios, the results show that compared with the original model, the loss optimization improves mIoU by1.52.
Crow search algorithm (CSA) simulate the intelligent behavior of crows to solve multi-dimensional, linear and nonlinear problems with appreciable. Despite high performance of CSA, stagnation in local ...optima and slow convergence speed are two probable problems in solving challenging optimization problems. In this paper, the standard CSA is improved to enhance its exploration and exploitation capacities and convergence speed by introducing adaptive inertia weight factor and roulette wheel selection scheme. Performance of the improved CSA (ICSA) is assessed by implementing it on a range of standard unconstrained benchmark functions having different characteristics. The results of optimization obtained using the ICSA algorithm are validated by comparing them with those obtained using the basic CSA and other optimization algorithms available in the literature.
Research on Multi-level Attention-based Human Pose Estimation Gao, Jun; Huang, Hua; Li, QiShen ...
2022 2nd International Conference on Algorithms, High Performance Computing and Artificial Intelligence (AHPCAI),
2022-Oct.-21
Conference Proceeding
Detecting human key points from a single image is very challenging due to occlusion, blurring, illumination and scale changes. In this paper, this problem is addressed by designing an effective ...network structure. Since global and local information plays an important role in reasoning about human body structure and invisible keypoints, Multi-level Attention Network (MAN) is proposed. First, compared with traditional multi-resolution networks, it enables multi-resolution feature maps with greater information variance by generating them directly from the highest resolution feature map, which in turn increases the abundance of feature information after final fusion. Secondly, it effectively integrates global and local information in different resolution feature maps through the Feature Alignment Attention Block(FAAB), and intensifies them in a targeted manner. On the COCO dataset, with HRNet (Sun K. et al 1) as the baseline network, HRNet of inserted MAN improves 1.1-2.3 AP points over the baseline network.
Instance segmentation is a comprehensive computer vision task that involves a wide range of other tasks. Recently, the study of real-time instance segmentation methods has received more attention for ...the development of autonomous driving. Although existing real-time instance segmentation methods are fast, their accuracy does not meet practical needs. Most methods go for segmentation based on object detection, and their effectiveness is overly dependent on the effectiveness of detection. This paper proposes a new attention-based multiscale information fusion method based on Cheng, T. et al. 1. Firstly, the PPM module of the baseline network is replaced with the module Multiscale Context Attention (MSCA) designed in this paper based on the baseline network, which uses atrous convolution with different ratios to obtain information of four scales, and then uses non-local attention to enhance the information of features. It can effectively suppress the interference of redundant information on the instance segmentation results. Secondly, a new feature fusion approach is designed, which no longer uses bilinear interpolation, but sub-pixel up sampling combined with attention. We did experiments related to this module on the coco dataset and demonstrated its effectiveness, with a 0.5% improvement over the baseline network.
Multi-scale Features Fusion Network for Single Image Deraining Lai, Yanming; Li, Qishen; Huang, Hua ...
2022 IEEE 10th Joint International Information Technology and Artificial Intelligence Conference (ITAIC),
2022-June-17, Letnik:
10
Conference Proceeding
Single image rain removal is an important research direction in the field of computer vision. In this paper, the Multi-scale Features Fusion Network (MFFN) is presented for rain removal. MFFN is ...mainly composed of Multi-features Fusion Module (MFM) and Dual Attention Module (DAM). In the MFM, we make the improved dense block and the dilation convolutions to form the feature extraction branches, which is conducive to improve the receptive fields of network. Subsequently, the bottom-up connection method is adopted between branches to help different branches make full use of image information. Then, feature branches are merged to fusion different features. DAM is consisted of purposed SA Block and standard SE Block. The purpose of DAM is to reduce the non-rain features which is extracted by the network. Experiments show that MFFN can obtain the better result of rain removal than several advanced methods.
Arbitrary shape scene text detection becomes a challenge task due to its background complexity and shape diversity. In this paper, we propose a dual-branch multi-resolution feature-aware enhancement ...network (DMFE), the lower branch constructs multi-resolution features through a bidirectional feature pyramid network with weights, and the upper branch enhances the perception of multi-scale text at each level through parallel pooling modules with receptive field enhancement. The global and local co-action will integrate high-level semantic information and low-level location information, so as to generate high-quality feature map. Extensive experiments on ICDAR2015, CTW1500 and Total-Text datasets show that the proposed method effectively improves the detection performance of natural scene text.
The purpose of few-shot learning is to learn a classification model with good generalization performance. When there is only one sample or several samples, the model can also be quickly generalized ...to new categories. In this paper, we use the method based on metric learning to complete the few-shot learning image classification task. In this method, we introduce a feature learning module to improve the representation ability of the feature extraction network. Through category traversal, the module extracts the intra-class common features and inter-class unique features of images. Finally, by comparing the similarity between the input image and the local descriptors of each category, the classification task is completed by the image-to-class measure. This metric is achieved by k -NN search in the local descriptors, in this way, we can make full use of all the local features of a category, thus expressing the distribution of this class more richly and effectively.