Devising automated procedures for accurate vessel segmentation (retinal) is crucial for timely prognosis of vision-threatening eye diseases. In this paper, a novel supervised deep learning-based ...approach is proposed which extends a variant of the fully convolutional neural network. The existing fully convolutional neural network-based counterparts have associated critical drawbacks of involving a large number of tunable hyper-parameters and an increased end-to-end training time furnished by their decoder structure. The proposed approach addresses these intricate challenges by using a skip-connections strategy by sharing indices obtained through max-pooling to the decoder from the encoder stage (respective stages) for enhancing the resolution of the feature map. This significantly reduces the number of required tunable hyper-parameters and the computational overhead of the training as well as testing stages. Furthermore, the proposed approach particularly helps in eradicating the requirement for employing both post-processing and pre-processing steps. In the proposed approach, the retinal vessel segmentation problem is formulated as a semantic pixel-wise segmentation task which helps in spanning the gap between semantic segmentation and medical image segmentation. A prime contribution of the proposed approach is the introduction of external skip-connection for passing the preserved low-level semantic edge information in order to reliably detect tiny vessels in the retinal fundus images. The performance of the proposed scheme is analyzed based on the three publicly available notable fundus image datasets, while the widely recognized evaluation metrics of specificity, sensitivity, accuracy, and the Receiver Operating Characteristics curves are used. Based on the assessment of the images in {DRIVE, CHASE_DB1, and STARE}; datasets, the proposed approach achieves a sensitivity, specificity, accuracy, and ROC performance of {0.8252, 0.8440, and 0.8397};, {0.9787, 0.9810, and 0.9792};, {0.9649, 0.9722, and 0.9659};, and {0.9780, 0.9830, and 0.9810};, respectively. The reduced computational complexity and memory overhead along with improved segmentation performance advocates employing the proposed approach in the automated diagnostic systems for eye diseases.
•Comprehensive comparison of state-of-the-art segmentation models to occluded branches.•Generative adversarial network applied to occluded segmentation of branches.•Development of effective ...difficulty indices for occluded oriented evaluation.
Fruit tree pruning and fruit thinning require a powerful vision system that can provide high resolution segmentation of the fruit trees and their branches. Recent works only consider the dormant season where there are minimal occlusions on the branches or fit a polynomial curve to reconstruct branch shape, losing information about branch thickness. In this work, we apply two state-of-the-art supervised learning models: U-Net and DeepLabv3, and a conditional Generative Adversarial Network Pix2Pix (with and without the discriminator) to segment partially occluded 2D-open-V apple trees. Binary accuracy, Mean IoU, Boundary F1 score and Occluded branch recall are used to evaluate the performances of the models. DeepLabv3 outperforms the other models in Binary accuracy, Mean IoU and Boundary F1 score, but is surpassed by Pix2Pix (without discriminator) and U-Net in Occluded branch recall. We define two difficulty indices to quantify the difficulty of the task: (1) Occlusion Difficulty Index and (2) Depth Difficulty Index. The worst 10 images are analyzed in both difficulty indices by means of Branch Recall and Occluded Branch Recall. U-Net outperforms the other two models in the current metrics. On the other hand, Pix2Pix (without discriminator) provides more information on branch paths, which is not reflected by the metrics. This highlights the need for more specific metrics on recovering occluded information. Future work is required to further enhance the models to recover more information from occlusions such that this technology can be applied to automating agricultural tasks in a commercial environment.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
Over the past few years, deep convolutional neural network-based methods have made great progress in semantic segmentation of street scenes. Some recent methods align feature maps to alleviate the ...semantic gap between them and achieve high segmentation accuracy. However, they usually adopt the feature alignment modules with the same network configuration in the decoder and thus ignore the different roles of stages of the decoder during feature aggregation, leading to a complex decoder structure. Such a manner greatly affects the inference speed. In this paper, we present a novel Stage-aware Feature Alignment Network (SFANet) based on the encoder-decoder structure for real-time semantic segmentation of street scenes. Specifically, a Stage-aware Feature Alignment module (SFA) is proposed to align and aggregate two adjacent levels of feature maps effectively. In the SFA, by taking into account the unique role of each stage in the decoder, a novel stage-aware Feature Enhancement Block (FEB) is designed to enhance spatial details and contextual information of feature maps from the encoder. In this way, we are able to address the misalignment problem with a very simple and efficient multi-branch decoder structure. Moreover, an auxiliary training strategy is developed to explicitly alleviate the multi-scale object problem without bringing additional computational costs during the inference phase. Experimental results show that the proposed SFANet exhibits a good balance between accuracy and speed for real-time semantic segmentation of street scenes. In particular, based on ResNet-18, SFANet respectively obtains 78.1% and 74.7% mean of class-wise Intersection-over-Union (mIoU) at inference speeds of 37 FPS and 96 FPS on the challenging Cityscapes and CamVid test datasets by using only a single GTX 1080Ti GPU.
Masonry structures represent the highest proportion of building stock worldwide. Currently, the structural condition of such structures is predominantly manually inspected which is a laborious, ...costly and subjective process. With developments in computer vision, there is an opportunity to use digital images to automate the visual inspection process. The aim of this study is to examine deep learning techniques for crack detection on images from masonry walls. A dataset with photos from masonry structures is produced containing complex backgrounds and various crack types and sizes. Different deep learning networks are considered and by leveraging the effect of transfer learning crack detection on masonry surfaces is performed on patch level with 95.3% accuracy and on pixel level with 79.6% F1 score. This is the first implementation of deep learning for pixel-level crack segmentation on masonry surfaces. Codes, data and networks relevant to the herein study are available in: github.com/dimitrisdais/crack_detection_CNN_masonry.
•Crack detection on masonry surfaces performed on patch level with 95.3% accuracy.•Crack detection on masonry surfaces performed on pixel level with 79.6% F1 score.•DL for pixel-level masonry crack segmentation is implemented for the first time.•Transfer learning boosts the performance of crack classification and segmentation.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
Although deep neural networks have made significant progress in semantic segmentation, speed and computational cost still can't meet the strict requirements of real-world applications. In this paper, ...we present an enhanced asymmetric convolution network (EACNet) to seek a balance between accuracy and speed. Specifically, we design a pair of enhancing asymmetric convolution modules constructed by depth-wise asymmetric convolution and dilated convolution to extract short-range and long-range features, which are efficient and powerful. Additionally, we apply a bilateral structure in which the detail branch preserves low-level spatial details while the semantic branch captures high-level context information. The two branches are merged at different stages of the network to strengthen information propagation between different levels. The experiments on the Cityscapes dataset show that our method achieves high accuracy and speed with relatively small parameters. Compared with other real-time semantic segmentation methods, our network attains a good trade-off among parameters, speed, and accuracy.
It is well known that the Unet has been widely used in the area of medical image segmentation because of the cascade connection in the up-sampling process. However, it does not perform well in ...dealing with complex medical images, such as brain MRI. In order to achieve better segmentation performance by adopting the Unet, many researchers have paid more attention to stacking the Unet. However, the stacking process leads to a large increase in the number of parameters. This is not a good choice when considering the tradeoff between precision and efficiency. Another problem is that as the depth of the network increases, the excessive loss of information is also a tricky problem. To address those problems, in this paper, we are trying to improve the network structure of Unet to make it more suitable for brain tumor segmentation. We propose a novel framework called Stack Multi-Connection Simple Reducing_Net(SMCSRNet) that are stacked by our basic blocks called Simple Reducing_Net(SRNet). The basic block SRNet is improved from the original Unet, which consists of four downsampling and upsampling operations during the encoding and decoding. Only one convolution operation is performed before each downsampling process. The operation of copy and crop is preserved between encoding and decoding. The main advantage of the SRNet is that the amount of parameters is reduced by 4/5 by comparing with the original Unet. Except for the problem of parameters number, we also proposed a series of bridge connections among the stacked cascade network to improve the loss of information. More specifically, some bridge connections will be adopted before the pooling operation in each layer during the downsampling process. It means that each layer in one basic block has a bridge connection with the same feature size from the previous basic block before pooling, and it is worth noting that the training time of the proposed framework is much less than the original stacked Unet. Moreover, the performance of the proposed method is also improved compared to the stacked Unet. When further comparing with other state-of-the-art segmentation networks, it can be found that the performance is as good as the most popular DenseNet or ResNet. Overall, by evaluating the proposed framework on the BRAT2015, it can be proven that the proposed segmentation network has the ability to accurately extract the brain tumor boundary so as to obtain higher recognition quality with high efficiency.
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to ...vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at https://github.com/microsoft/Swin-Transformer.
Synthetic aperture radar (SAR) can be used to obtain remote sensing images of different growth stages of crops under all weather conditions. Such time-series SAR images can provide an abundance of ...temporal and spatial features for use in large-scale crop mapping and analysis. In this study, we propose a temporal feature-based segmentation (TFBS) model for accurate crop mapping using time-series SAR images. This model first extracts deep-seated temporal features and then learns the spatial context of the extracted temporal features for crop mapping. The results indicate that the TFBS model significantly outperforms traditional long short-term memory (LSTM), U-network, and convolutional LSTM models in crop mapping based on time-series SAR images. TFBS demonstrates better generalizability than other models in the study area, which makes it more transferable, and the results show that data augmentation can significantly improve this generalizability. The visualization of the temporal features extracted by the TFBS shows that there is a high degree of intraclass homogeneity among rice fields and interclass heterogeneity between rice fields and other features. TFBS also achieved the highest accuracy of the four deep learning models for multicrop classification in the study area. This study presents a feasible way of producing high-accuracy large-scale crop maps based on the proposed model.
Transformers have recently lead to encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three ...designs: (i) a linear complexity attention layer, (ii) an overlapping patch embedding, and (iii) a convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linearity and provides significant improvements on fundamental vision tasks such as classification, detection, and segmentation. In particular, PVT v2 achieves comparable or better performance than recent work such as the Swin transformer. We hope this work will facilitate state-of-the-art transformer research in computer vision. Code is available at
https://github.com/whai362/PVT
.
Recently, significant improvement has been made on semantic object segmentation due to the development of deep convolutional neural networks (DCNNs). Training such a DCNN usually relies on a large ...number of images with pixel-level segmentation masks, and annotating these images is very costly in terms of both finance and human effort. In this paper, we propose a simple to complex (STC) framework in which only image-level annotations are utilized to learn DCNNs for semantic segmentation. Specifically, we first train an initial segmentation network called Initial-DCNN with the saliency maps of simple images (i.e., those with a single category of major object(s) and clean background). These saliency maps can be automatically obtained by existing bottom-up salient object detection techniques, where no supervision information is needed. Then, a better network called Enhanced-DCNN is learned with supervision from the predicted segmentation masks of simple images based on the Initial-DCNN as well as the image-level annotations. Finally, more pixel-level segmentation masks of complex images (two or more categories of objects with cluttered background), which are inferred by using Enhanced-DCNN and image-level annotations, are utilized as the supervision information to learn the Powerful-DCNN for semantic segmentation. Our method utilizes 40K simple images from Flickr.com and 10K complex images from PASCAL VOC for step-wisely boosting the segmentation network. Extensive experimental results on PASCAL VOC 2012 segmentation benchmark well demonstrate the superiority of the proposed STC framework compared with other state-of-the-arts.