Deep Learning for 3D Point Clouds: A Survey Guo, Yulan; Wang, Hanyun; Hu, Qingyong ...
IEEE transactions on pattern analysis and machine intelligence,
2021-Dec.-1, 2021-12-1, 20211201, Volume:
43, Issue:
12
Journal Article
Peer reviewed
Open access
Point cloud learning has lately attracted increasing attention due to its wide applications in many areas, such as computer vision, autonomous driving, and robotics. As a dominating technique in AI, ...deep learning has been successfully used to solve various 2D vision problems. However, deep learning on point clouds is still in its infancy due to the unique challenges faced by the processing of point clouds with deep neural networks. Recently, deep learning on point clouds has become even thriving, with numerous methods being proposed to address different problems in this area. To stimulate future research, this paper presents a comprehensive review of recent progress in deep learning methods for point clouds. It covers three major tasks, including 3D shape classification, 3D object detection and tracking, and 3D point cloud segmentation. It also presents comparative results on several publicly available datasets, together with insightful observations and inspiring future research directions.
Instance segmentation in videos, which aims to segment and track multiple objects in video frames, has garnered a flurry of research attention in recent years. In this paper, we present a novel ...weakly supervised framework with Spatio-Temporal Collaboration for instance Segmentation in videos, namely STC-Seg. Concretely, STC-Seg demonstrates four contributions. First, we leverage the complementary representations from unsupervised depth estimation and optical flow to produce effective pseudo-labels for training deep networks and predicting high-quality instance masks. Second, to enhance the mask generation, we devise a puzzle loss, which enables end-to-end training using box-level annotations. Third, our tracking module jointly utilizes bounding-box diagonal points with spatio-temporal discrepancy to model movements, which largely improves the robustness to different object appearances. Finally, our framework is flexible and enables image-level instance segmentation methods to operate the video-level task. We conduct an extensive set of experiments on the KITTI MOTS and YT-VIS datasets. Experimental results demonstrate that our method achieves strong performance and even outperforms fully supervised TrackR-CNN and MaskTrack R-CNN. We believe that STC-Seg can be a valuable addition to the community, as it reflects the tip of an iceberg about the innovative opportunities in the weakly supervised paradigm for instance segmentation in videos.
We propose an end-to-end learning framework for segmenting generic objects in both images and videos. Given a novel image or video, our approach produces a pixel-level mask for all "object-like" ...regions-even for object categories never seen during training. We formulate the task as a structured prediction problem of assigning an object/background label to each pixel, implemented using a deep fully convolutional network. When applied to a video, our model further incorporates a motion stream, and the network learns to combine both appearance and motion and attempts to extract all prominent objects whether they are moving or not. Beyond the core model, a second contribution of our approach is how it leverages varying strengths of training annotations. Pixel-level annotations are quite difficult to obtain, yet crucial for training a deep network approach for segmentation. Thus we propose ways to exploit weakly labeled data for learning dense foreground segmentation. For images, we show the value in mixing object category examples with image-level labels together with relatively few images with boundary-level annotations. For video, we show how to bootstrap weakly annotated videos together with the network trained for image segmentation. Through experiments on multiple challenging image and video segmentation benchmarks, our method offers consistently strong results and improves the state-of-the-art for fully automatic segmentation of generic (unseen) objects. In addition, we demonstrate how our approach benefits image retrieval and image retargeting, both of which flourish when given our high-quality foreground maps. Code, models, and videos are at: http://vision.cs.utexas.edu/projects/pixelobjectness/.
This paper conducts a systematic study on the role of visual attention in video object pattern understanding. By elaborately annotating three popular video segmentation datasets ...(DAVIS<inline-formula><tex-math notation="LaTeX">_{16}</tex-math> <mml:math><mml:msub><mml:mrow/><mml:mn>16</mml:mn></mml:msub></mml:math><inline-graphic xlink:href="shen-ieq1-2966453.gif"/> </inline-formula>, Youtube-Objects, and SegTrack<inline-formula><tex-math notation="LaTeX">_{V_2}</tex-math> <mml:math><mml:msub><mml:mrow/><mml:msub><mml:mi>V</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:msub></mml:math><inline-graphic xlink:href="shen-ieq2-2966453.gif"/> </inline-formula>) with dynamic eye-tracking data in the unsupervised video object segmentation (UVOS) setting. For the first time, we quantitatively verified the high consistency of visual attention behavior among human observers, and found strong correlation between human attention and explicit primary object judgments during dynamic, task-driven viewing. Such novel observations provide an in-depth insight of the underlying rationale behind video object pattens. Inspired by these findings, we decouple UVOS into two sub-tasks: UVOS-driven Dynamic Visual Attention Prediction (DVAP) in spatiotemporal domain, and Attention-Guided Object Segmentation (AGOS) in spatial domain. Our UVOS solution enjoys three major advantages: 1) modular training without using expensive video segmentation annotations, instead, using more affordable dynamic fixation data to train the initial video attention module and using existing fixation-segmentation paired static/image data to train the subsequent segmentation module; 2) comprehensive foreground understanding through multi-source learning; and 3) additional interpretability from the biologically-inspired and assessable attention. Experiments on four popular benchmarks show that, even without using expensive video object mask annotations, our model achieves compelling performance compared with state-of-the-arts and enjoys fast processing speed (10 fps on a single GPU). Our collected eye-tracking data and algorithm implementations have been made publicly available at https://github.com/wenguanwang/AGS .
Metal surface defect segmentation can play an important role in dealing with the issue of quality control during the production and manufacturing stages. There are still two major challenges in ...industrial applications. One is the case that the number of metal surface defect samples is severely insufficient, and the other is that the most existing algorithms can only be used for specific surface defects and it is difficult to generalize to other metal surfaces. In this work, a theory of few-shot metal generic surface defect segmentation is introduced to solve these challenges. Simultaneously, the Triplet-Graph Reasoning Network (TGRNet) and a novel dataset Surface Defects-<inline-formula> <tex-math notation="LaTeX">4^{i} </tex-math></inline-formula> are proposed to achieve this theory. In our TGRNet, the surface defect triplet (including triplet encoder and trip loss) is proposed and is used to segment background and defect area, respectively. Through triplet, the few-shot metal surface defect segmentation problem is transformed into few-shot semantic segmentation problem of defect area and background area. For few-shot semantic segmentation, we propose a method of multi-graph reasoning to explore the similarity relationship between different images. And to improve segmentation performance in the industrial scene, an adaptive auxiliary prediction module is proposed. For Surface Defects-<inline-formula> <tex-math notation="LaTeX">4^{i} </tex-math></inline-formula>, it includes multiple categories of metal surface defect images to verify the generalization performance of our TGRNet and adds the nonmetal categories (leather and tile) as extensions. Through extensive comparative experiments and ablation experiments, it is proved that our architecture can achieve state-of-the-art results.
Deep neural network-based semantic segmentation generally requires large-scale cost extensive annotations for training to obtain better performance. To avoid pixel-wise segmentation annotations that ...are needed for most methods, recently some researchers attempted to use object-level labels (e.g., bounding boxes) or image-level labels (e.g., image categories). In this paper, we propose a novel recursive coarse-to-fine semantic segmentation framework based on only image-level category labels. For each image, an initial coarse mask is first generated by a convolutional neural network-based unsupervised foreground segmentation model and then is enhanced by a graph model. The enhanced coarse mask is fed to a fully convolutional neural network to be recursively refined. Unlike the existing image-level label-based semantic segmentation methods, which require labeling of all categories for images that contain multiple types of objects, our framework only needs one label for each image and can handle images that contain multi-category objects. Only trained on ImageNet, our framework achieves comparable performance on the PASCAL VOC dataset with other image-level label-based state-of-the-art methods of semantic segmentation. Furthermore, our framework can be easily extended to foreground object segmentation task and achieves comparable performance with the state-of-the-art supervised methods on the Internet object dataset.
We propose an unsupervised video segmentation approach by simultaneously tracking multiple holistic figure-ground segments. Segment tracks are initialized from a pool of segment proposals generated ...from a figure-ground segmentation algorithm. Then, online non-local appearance models are trained incrementally for each track using a multi-output regularized least squares formulation. By using the same set of training examples for all segment tracks, a computational trick allows us to track hundreds of segment tracks efficiently, as well as perform optimal online updates in closed-form. Besides, a new composite statistical inference approach is proposed for refining the obtained segment tracks, which breaks down the initial segment proposals and recombines for better ones by utilizing high-order statistic estimates from the appearance model and enforcing temporal consistency. For evaluating the algorithm, a dataset, SegTrack v2, is collected with about 1,000 frames with pixel-level annotations. The proposed framework outperforms state-of-the-art approaches in the dataset, showing its efficiency and robustness to challenges in different video sequences.
Incorporation of prior knowledge about organ shape and location is key to improve performance of image analysis approaches. In particular, priors can be useful in cases where images are corrupted and ...contain artefacts due to limitations in image acquisition. The highly constrained nature of anatomical objects can be well captured with learning-based techniques. However, in most recent and promising techniques such as CNN-based segmentation it is not obvious how to incorporate such prior knowledge. State-of-the-art methods operate as pixel-wise classifiers where the training objectives do not incorporate the structure and inter-dependencies of the output. To overcome this limitation, we propose a generic training strategy that incorporates anatomical prior knowledge into CNNs through a new regularisation model, which is trained end-to-end. The new framework encourages models to follow the global anatomical properties of the underlying anatomy (e.g. shape, label structure) via learnt non-linear representations of the shape. We show that the proposed approach can be easily adapted to different analysis tasks (e.g. image enhancement, segmentation) and improve the prediction accuracy of the state-of-the-art models. The applicability of our approach is shown on multi-modal cardiac data sets and public benchmarks. In addition, we demonstrate how the learnt deep models of 3-D shapes can be interpreted and used as biomarkers for classification of cardiac pathologies.
The semantic image segmentation task consists of classifying each pixel of an image into an instance, where each instance corresponds to a class. This task is a part of the concept of scene ...understanding or better explaining the global context of an image. In the medical image analysis domain, image segmentation can be used for image-guided interventions, radiotherapy, or improved radiological diagnostics. In this review, we categorize the leading deep learning-based medical and non-medical image segmentation solutions into six main groups of deep architectural, data synthesis-based, loss function-based, sequenced models, weakly supervised, and multi-task methods and provide a comprehensive review of the contributions in each of these groups. Further, for each group, we analyze each variant of these groups and discuss the limitations of the current approaches and present potential future research directions for semantic image segmentation.
Full text
Available for:
CEKLJ, EMUNI, FZAB, GEOZS, IJS, IMTLJ, KILJ, KISLJ, MFDPS, NUK, OILJ, PNG, SAZU, SBCE, SBJE, SBMB, SBNM, UKNU, UL, UM, UPUK, VKSCE, ZAGLJ