•A novel unified end-to-end convolutional neural network architecture for small face detection is proposed.•A regression branch is introduced to the GAN-based architecture for further refining the ...locations of small faces in the wild.•New losses are designed to train the GAN-based network for small face detection in the wild.•Contextual information around face regions is further utilized to detect hard faces in the real-world scenarios.•The performance of our method outperforms previous state-of-the-art approaches by a large margin on WIDER FACE dataset, especially on the most challenging Hard subset.
Face detection techniques have been developed for decades, and one of the remaining open challenges is detecting small faces in unconstrained conditions. The reason is that tiny faces are often lacking detailed information and blurry. In this paper, we proposed an algorithm to directly generate a clear high-resolution face from a small blurry one by adopting a generative adversarial network (GAN). Toward this end, the basic GAN formulation achieves it by super-resolving and refining sequentially (e.g. SR-GAN and Cycle-GAN). However, we design a novel network to address the problem of super-resolving and refining jointly. Moreover, we also introduce new training losses (i.e. classification loss and regression loss) to promote the generator network to recover fine details of the small faces and to guide the discriminator network to distinguish face vs. non-face and to refine location simultaneously. Additionally, considering the importance of contextual information when detecting tiny faces in crowded cases, the context around face regions is combined to train the proposed GAN-based network for mining those very small faces from unconstrained scenarios. Extensive experiments on the challenging datasets WIDER FACE and FDDB demonstrate the effectiveness of the proposed method in restoring a clear high-resolution face from a small blurry one, and show that the achieved performance outperforms previous state-of-the-art methods by a large margin.
Due to the shortcomings of the weakly supervised and fully supervised object detection (i.e., unsatisfactory performance and expensive annotations, respectively), leveraging partially labeled images ...in a cost-effective way to train an object detector has attracted much attention. In this paper, we formulate this challenging task as a missing bounding-boxes' object detection problem. Specifically, we develop a pseudo ground truth mining procedure to automatically find the missing bounding boxes for the unlabeled instances, called pseudo ground truths here, in the training data, and then combine the mined pseudo ground truths and the labeled annotations to train a fully supervised object detector. Furthermore, we propose an incremental learning framework to gradually incorporate the results of the trained fully supervised detector to improve the performance of the missing bounding-boxes' object detection. More importantly, we find an effective way to label the massive images with limited labors and funds, which is crucial when building a large-scale weakly/webly labeled dataset for object detection. The extensive experiments on the PASCAL VOC and COCO benchmarks demonstrate that our proposed method can narrow the gap between the fully supervised and weakly supervised object detectors, and outperform the previous state-of-the-art weakly supervised detectors by a large margin (more than 3% mAP absolutely) when the missing rate equals 0.9. Moreover, our proposed method with 30% missing bounding-box annotations can achieve comparable performance to some fully supervised detectors.
A fundamental and challenging problem in deep learning is catastrophic forgetting, i.e., the tendency of neural networks to fail to preserve the knowledge acquired from old tasks when learning new ...tasks. This problem has been widely investigated in the research community and several Incremental Learning (IL) approaches have been proposed in the past years. While earlier works in computer vision have mostly focused on image classification and object detection, more recently some IL approaches for semantic segmentation have been introduced. These previous works showed that, despite its simplicity, knowledge distillation can be effectively employed to alleviate catastrophic forgetting. In this paper, we follow this research direction and, inspired by recent literature on contrastive learning, we propose a novel distillation framework, Uncertainty-aware Contrastive Distillation (UCD). In a nutshell, UCDis operated by introducing a novel distillation loss that takes into account all the images in a mini-batch, enforcing similarity between features associated to all the pixels from the same classes, and pulling apart those corresponding to pixels from different classes. In order to mitigate catastrophic forgetting, we contrast features of the new model with features extracted by a frozen model learned at the previous incremental step. Our experimental results demonstrate the advantage of the proposed distillation technique, which can be used in synergy with previous IL approaches, and leads to state-of-art performance on three commonly adopted benchmarks for incremental semantic segmentation.
In this paper, we study the task of source-free domain adaptation (SFDA), where the source data are not available during target adaptation. Previous works on SFDA mainly focus on aligning the ...cross-domain distributions. However, they ignore the generalization ability of the pretrained source model, which largely influences the initial target outputs that are vital to the target adaptation stage. To address this, we make the interesting observation that the model accuracy is highly correlated with whether attention is focused on the objects in an image. To this end, we propose a generic and effective framework based on Transformer, named TransDA, for learning a generalized model for SFDA. First, we apply the Transformer blocks as the attention module and inject it into a convolutional network. By doing so, the model is encouraged to turn attention towards the object regions, which can effectively improve the model’s generalization ability on unseen target domains. Second, a novel self-supervised knowledge distillation approach is proposed to adapt the Transformer with target pseudo-labels, further encouraging the network to focus on the object regions. Extensive experiments conducted on three domain adaptation tasks, including closed-set, partial-set, and open-set adaption, demonstrate that TransDA can significantly improve the accuracy over the source model and can produce state-of-the-art results on all settings. The source code and pretrained models are publicly available at:
https://github.com/ygjwd12345/TransDA
.
•A strong object detector is proposed to perform robust actor detection at frame-level.•A stabilized action localization network is designed for easier end-to-end training.•An anchor refine branch is ...introduced to generate deformable region proposals for modeling complicated actions.•Extensive experiments on widely used datasets demonstrate the effectiveness of the proposed method.
We address the problem of spatio-temporal action localization in videos in this paper. Current state-of-the-art methods for this challenging task rely on an object detector to localize actors at frame-level firstly, and then link or track the detections across time. Most of these methods commonly pay more attention to leveraging the temporal context of videos for action detection while ignoring the importance of the object detector itself. In this paper, we prove the importance of the object detector in the pipeline of action localization, and propose a strong object detector for better action localization in videos, which is based on the single shot multibox detector (SSD) framework. Different from SSD, we introduce an anchor refine branch at the end of the backbone network to refine the input anchors, and add a batch normalization layer before concatenating the intermediate feature maps at frame-level and after stacking feature maps at clip-level. The proposed strong detector have two contributions: (1) reducing the phenomenon of missing target objects at frame-level; (2) generating deformable anchor cuboids for modeling temporal dynamic actions. Extensive experiments on UCF-Sports, J-HMDB and UCF-101 validate our claims, and we outperform the previous state-of-the-art methods by a large margin in terms of frame-mAP and video-mAP, especially at a higher overlap threshold.
•A novel Transformer-based architecture for very low resolution images object detection is proposed.•An image down-sampling module that utilizes the feature extraction capabilities of CNNs to ...generate thumbnail images from the original ones.•A distillation-boost supervision strategy is introduced to maintain the detection performance of thumbnail images as the original-size inputs.•Our method achieves a satisfactory detection performance (i.e. 32.3% in mAP) while drastically reduce computation and memory requirements (speed up 1.26×).•The performance of our method outperforms bicubic method by a large margin (3.2% in mAP) and achieves comparable results compared to SOTA method on COCO dataset.
Computer vision fields have witnessed great success thanks to deep convolutional neural networks (CNNs). However, state-of-the-art methods often benefit from large models and datasets, which introduce heavy parameters and computational requirements. Deploying such large models in real-world applications is very difficult because of the limited computing resources. Although many researchers focus on designing efficient block structures to compress model parameters, they ignore that the role of large-scale input images is also an important factor for algorithm efficiency. Reducing input resolution is a useful method to boost runtime efficiency, however, traditional interpolation methods assume a fixed degradation criterion that greatly hurts performance. To solve the above problems, in this paper, we propose a novel framework named ThumbDet for reducing model computation while maintaining detection accuracy. In our framework, we first design an image down-sampling module to learn a small-scale image that looks realistic and contains discriminative properties. Furthermore, we propose a distillation-boost supervision strategy to maintain the detection performance of small-scaled images as the original-size inputs. Extensive experiments conducted on a standard object detection dataset MS COCO demonstrate the effectiveness of the proposed method when using very low-resolution images (i.e. 4× down-sampling) as inputs. In particular, ThumbDet achieves satisfactory detection performance (i.e. 32.3% in mAP) while drastically reducing computation and memory requirements (i.e. speed up of 1.26×), outperforming the traditional interpolation methods (e.g. bicubic) by +3.2% absolutely in terms of mAP.
Regional socio-economics has multiple and far-reaching influences on the development of top-tier education, and the development of top-tier education also represents the strength and level of the ...regional socio-economics to a large extent. The present study investigates the spatial and temporal differences in the distribution pattern of doctoral disciplines of Chinese “double first-class” construction universities between 1996 and 2022 and the influence of regional socio-economics through two perspectives of economic regions and provinces. The study shows that there are differences in the development speed, scale, and level of top-tier education among different economic divisions and provinces, and the overall pattern of “fast in the east and slow in the west”, “more in the east and less in the west”, and “strong in the east and weak in the west” is unbalanced. At the same time, there is a high degree of concentration in different economic regions. The main reasons include the historical problem of the unbalanced allocation of China’s educational resources, as well as the uneven geographical distribution of relevant national policies, and the influence of regional socio-economic development differences, especially GDP and literacy rate. In terms of the influence mechanism, the development mindset of universities themselves, especially the strategy for talent cultivation and introduction, is the fundamental factor for the formation of the top-tier education development gap among different economic regions and provinces.
Convolutional neural networks have enabled major progresses in addressing pixel-level prediction tasks such as semantic segmentation, depth estimation, surface normal prediction and so on, benefiting ...from their powerful capabilities in visual representation learning. Typically, state of the art models integrate attention mechanisms for improved deep feature representations. Recently, some works have demonstrated the significance of learning and combining both spatial- and channel-wise attentions for deep feature refinement. In this paper, we aim at effectively boosting previous approaches and propose a unified deep framework to jointly learn both spatial attention maps and channel attention vectors in a principled manner so as to structure the resulting attention tensors and model interactions between these two types of attentions. Specifically, we integrate the estimation and the interaction of the attentions within a probabilistic representation learning framework, leading to VarIational STructured Attention networks (VISTA-Net). We implement the inference rules within the neural network, thus allowing for end-to-end learning of the probabilistic and the CNN front-end parameters. As demonstrated by our extensive empirical evaluation on six large-scale datasets for dense visual prediction, VISTA-Net outperforms the state-of-the-art in multiple continuous and discrete prediction tasks, thus confirming the benefit of the proposed approach in joint structured spatial-channel attention estimation for deep representation learning. The code is available at https://github.com/ygjwd12345/VISTA-Net.
Over the past years, semantic segmentation, similar to many other tasks in computer vision, has benefited from the progress in deep neural networks, resulting in significantly improved performance. ...However, deep architectures trained with gradient-based techniques suffer from catastrophic forgetting, which is the tendency to forget previously learned knowledge while learning new tasks. Aiming at devising strategies to counteract this effect, incremental learning approaches have gained popularity over the past years. However, the first incremental learning methods for semantic segmentation appeared only recently. While effective, these approaches do not account for a crucial aspect in pixel-level dense prediction problems, i.e. , the role of attention mechanisms. To fill this gap, in this paper, we introduce a novel attentive feature distillation approach to mitigate catastrophic forgetting while accounting for semantic spatial- and channel-level dependencies. Furthermore, we propose a continual attentive fusion structure, which takes advantage of the attention learned from the new and the old tasks while learning features for the new task. Finally, we also introduce a novel strategy to account for the background class in the distillation loss, thus preventing biased predictions. We demonstrate the effectiveness of our approach with an extensive evaluation on Pascal-VOC 2012 and ADE20 K, setting a new state of theart.