While analyzing the performance of state-of-the-art R-CNN based generic object detectors, we find that the detection performance for objects with low object-region-percentages (ORPs) of the bounding ...boxes are much lower than the overall average. Elongated objects are examples. To address the problem of low ORPs for elongated object detection, we propose a hybrid approach which employs a Faster R-CNN to achieve robust detections of object parts, and a novel model-driven clustering algorithm to group the related partial detections and suppress false detections. First, we train a Faster R-CNN with partial region proposals of suitable and stable ORPs. Next, we introduce a deep CNN (DCNN) for orientation classification on the partial detections. Then, on the outputs of the Faster R-CNN and DCNN, the algorithm of adaptive model-driven clustering first initializes a model of an elongated object with a data-driven process on local partial detections, and refines the model iteratively by model-driven clustering and data-driven model updating. By exploiting Faster R-CNN to produce robust partial detections and model-driven clustering to form a global representation, our method is able to generate a tight oriented bounding box for elongated object detection. We evaluate the effectiveness of our approach on two typical elongated objects in the COCO dataset, and other typical elongated objects, including rigid objects (pens, screwdrivers and wrenches) and non-rigid objects (cracks). Experimental results show that, compared with the state-of-the-art approaches, our method achieves a large margin of improvements for both detection and localization of elongated objects in images.
•A novel hybrid approach which integrates a Faster R-CNN for crack patch detection, a DCNN for crack orientation recognition, and a Bayesian algorithm for integration. It provides a novel framework ...to combine deep learning models and Bayesian analysis to address challenging vision problems where the deep learning approaches with simple end-to-end learning strategy might not be effective.•A distinctive approach to apply Faster R-CNN for the challenging task of crack detection by training it to detect crack patches of suitable SNR, and a semi-automatic method to annotate crack patches of suitable scales to train a Faster R-CNN.•A new Bayesian integration algorithm based on local spatial proximity, orientation consistency and alignment consistency to connect associated neighboring crack patches and suppress false detections, as well as an efficient algorithm to learn the optimal parameters.
Vision-based crack detection is of crucial importance in various industries, and it is very challenging due to weak signals in noisy backgrounds. In this paper, we propose a novel hybrid approach for crack detection in raw images, which combines deep learning models and Bayesian probabilistic analysis for robust crack detection. First, we re-train a state-of-the-art object detector (e.g. a Faster R-CNN) to detect crack patches of suitable SNR (signal-noise-ratio). We design a semi-automatic method to generate ground truths of crack patches along crack lines for training. To further improve the accuracy of crack detections over the whole image, we propose a Bayesian integration algorithm to suppress false detections. Specifically, we use a deep CNN to recognize the orientation of the crack segment in each detected patch. Then, a Bayesian probability is computed on the accumulated evidence from detected adjacent patches within a neighborhood based on spatial proximity, orientation consistency and alignment consistency. The patch which lacks local supports is suppressed as false detection. An algorithm to learn the parameters of Bayesian integration is also derived. Extensive experiments and evaluations are performed on a new comprehensive dataset of crack images. The results show that our approach outperforms the state-of-the-art baseline approach on deep CNN classifier. Ablation experiments are also conducted to show the effectiveness of proposed techniques.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
3.
Learning Deep Hierarchical Visual Feature Coding Hanlin Goh; Thome, Nicolas; Cord, Matthieu ...
IEEE transaction on neural networks and learning systems,
12/2014, Volume:
25, Issue:
12
Journal Article
In this paper, we propose a hybrid architecture that combines the image modeling strengths of the bag of words framework with the representational power and adaptability of learning deep ...architectures. Local gradient-based descriptors, such as SIFT, are encoded via a hierarchical coding scheme composed of spatial aggregating restricted Boltzmann machines (RBM). For each coding layer, we regularize the RBM by encouraging representations to fit both sparse and selective distributions. Supervised fine-tuning is used to enhance the quality of the visual representation for the categorization task. We performed a thorough experimental evaluation using three image categorization data sets. The hierarchical coding scheme achieved competitive categorization accuracies of 79.7% and 86.4% on the Caltech-101 and 15-Scenes data sets, respectively. The visual representations learned are compact and the model's inference is fast, as compared with sparse coding methods. The low-level representations of descriptors that were learned using this method result in generic features that we empirically found to be transferrable between different image data sets. Further analysis reveal the significance of supervised fine-tuning when the architecture has two layers of representations as opposed to a single layer.
Although object detection has achieved significant progress in the past decade, detecting small objects is still far from satisfactory due to the high variability of object scales and complex ...backgrounds. The common way to enhance small object detection is to use high-resolution (HR) images. However, this method incurs huge computational resources which grow squarely with the resolution of images. To achieve both accuracy and efficiency, we propose a novel reinforcement learning framework that employs an efficient policy network consisting of a Spatial Transformation Network to enhance the state representation learning and a Transformer model with early convolution to improve feature extraction. Our method has two main steps: (1) coarse location query (CLQ), where an RL agent is trained to predict the locations of small objects on low-resolution (LR) (down-sampled version of HR) images; (2) context-sensitive object detection where HR image patches are used to detect objects on the selected coarse locations and LR image patches on background areas (containing no small objects). In this way, we can obtain high detection performance on small objects while avoiding unnecessary computation on background areas. The proposed method has been tested and benchmarked on various datasets. On the Caltech Pedestrians Detection and Web Pedestrians datasets, the proposed method improves the detection accuracy by 2%, while reducing the number of processed pixels. On the Vision meets Drone object detection dataset and the Oil and Gas Storage Tank dataset, the proposed method outperforms the state-of-the-art (SotA) methods. On MS COCO mini-val set, our method outperforms SotA methods on small object detection, while also achieving comparable performance on medium and large objects.
Models for image semantics understanding, such as deep learning (DL) models and mathematical models, are often trained on specific dataset or configured with specific parameters. When deploying such ...models on new tasks in a different test environment, it requires considerable effort to re-train the model or extensive expertise to tune the parameters. In this paper, we propose a smart reinforcement learning (RL) agent that could learn to tune parameters automatically to enhance model performance. The learning process is formulated as a generic control task for parameter adjustment, and applied to two use scenarios: (1) image attributes tuning to improve object detection performance on fixed DL model, and (2) parameter tuning of the mathematical model (Level Set) for image segmentation. We design a novel dynamic threshold mechanism in a multi-branch RL agent to effectively tune parameters of image qualities (for object detection) and Level Set models (for object segmentation). We conduct experiments on Pascal-VOC testing set, MS COCO validation set and a proprietary dataset of industrial components, where we achieve substantial improvement on object detection accuracy. We also perform experiments on the automatic parameter tuning of Level Set models. Results show that our method facilitates considerable performance improvement on public datasets compared with baseline method.
The introduction of wearable video cameras (e.g., GoPro) in the consumer market has promoted video life-logging, motivating users to generate large amounts of video data. This increasing flow of ...first-person video has led to a growing need for automatic video summarization adapted to the characteristics and applications of egocentric video. With this paper, we provide the first comprehensive survey of the techniques used specifically to summarize egocentric videos. We present a framework for first-person view summarization and compare the segmentation methods and selection algorithms used by the related work in the literature. Next, we describe the existing egocentric video datasets suitable for summarization and, then, the various evaluation methods. Finally, we analyze the challenges and opportunities in the field and propose new lines of research.
•An EGA-Net is proposed for weakly-supervised temporal action localization, which treats background frames as out-of-domain samples.•A novel Entropy Guided Loss is proposed to leverage entropy to ...distinguish action and background.•A new Global Similarity Loss is designed to enhance action features by pushing them to approach their corresponding class centers.•Extensive experiments are conducted on three challenging benchmarking datasets with superior results.
One major challenge of Weakly-supervised Temporal Action Localization (WTAL) is to handle diverse backgrounds in videos. To model background frames, most existing methods treat them as an additional action class. However, because background frames usually do not share common semantics, squeezing all the different background frames into a single class hinders network optimization. Moreover, the network would be confused and tends to fail when tested on videos with unseen background frames. To address this problem, we propose an Entropy Guided Attention Network (EGA-Net) to treat background frames as out-of-domain samples. Specifically, we design a two-branch module, where a domain branch detects whether a frame is an action by learning a class-agnostic attention map, and an action branch recognizes the action category of the frame by learning a class-specific attention map. By aggregating the two attention maps to model the joint domain-class distribution of frames, our EGA-Net can handle varying backgrounds. To train the class-agnostic attention map with only the video-level class labels, we propose an Entropy Guided Loss (EGL), which employs entropy as the supervision signal to distinguish action and background. Moreover, we propose a Global Similarity Loss (GSL) to enhance the action-specific attention map via action class center. Extensive experiments on THUMOS14, ActivityNet1.2 and ActivityNet1.3 datasets demonstrate the effectiveness of our EGA-Net.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
This paper presents a visual saliency modeling technique that is efficient and tolerant to the image scale variation. Different from existing approaches that rely on a large number of filters or ...complicated learning processes, the proposed technique computes saliency from image histograms. Several two-dimensional image co-occurrence histograms are used, which encode not only "how many" (occurrence) but also "where and how" (co-occurrence) image pixels are composed into a visual image, hence capturing the "unusualness" of an object or image region that is often perceived by either global "uncommonness" (i.e., low occurrence frequency) or local "discontinuity" with respect to the surrounding (i.e., low co-occurrence frequency). The proposed technique has a number of advantageous characteristics. It is fast and very easy to implement. At the same time, it involves minimal parameter tuning, requires no training, and is robust to image scale variation. Experiments on the AIM dataset show that a superior shuffled AUC (sAUC) of 0.7221 is obtained, which is higher than the state-of-the-art sAUC of 0.7187.
The main challenge in video question answering (VideoQA) is to capture and understand the complex spatial and temporal relations between objects based on given questions. Existing graph-based methods ...for VideoQA usually ignore keywords in questions and employ a simple graph to aggregate features without considering relative relations between objects, which may lead to inferior performance. In this paper, we propose a Keyword-aware Relative Spatio-Temporal (KRST) graph network for VideoQA. First, to make question features aware of keywords, we employ an attention mechanism to assign high weights to keywords during question encoding. The keyword-aware question features are then used to guide video graph construction. Second, because relations are relative, we integrate the relative relation modeling to better capture the spatio-temporal dynamics among object nodes. Moreover, we disentangle the spatio-temporal reasoning into an object-level spatial graph and a frame-level temporal graph, which reduces the impact of spatial and temporal relation reasoning on each other. Extensive experiments on the TGIF-QA, MSVD-QA and MSRVTT-QA datasets demonstrate the superiority of our KRST over multiple state-of-the-art methods.
Scene graphs connect individual objects with visual relationships. They serve as a comprehensive scene representation for downstream multimodal tasks. However, by exploring recent progress in Scene ...Graph Generation (SGG), we find that the performance of recent works is highly limited by the pairwise relationship modeling by naive feature concatenation. Such pairwise features lack sufficient object interaction due to the mis-aligned object parts, resulting in non-discriminative pairwise features for visual relationship prediction. For example, naive concatenated pairwise feature usually make the model fail to discriminate between riding and feeding for object pair person and horse . To this end, we design a meta-architecture- learning-to-align - for dynamic object feature concatenation. We call our model: Align R-CNN . Specifically, we introduce a novel attention-based multiple region alignment module that can be jointly optimized with SGG. Experiments on the large-scale SGG benchmark Visual Genome show that the proposed Align R-CNN can replace the naive feature concatenation and thus boost all the existing SGG methods.