Depth sensors of low-cost acquisition devices (e.g. Microsoft Kinect, Asus Xtion) are coming into widespread use; however, 3D acquired data are generally large, heterogeneous, and complex to analyse ...and interpret. In this context, our overall goal is the analysis of the action of a subject in a 3D video, e.g. the action of a human or the movement of its subparts. To this end, the action classification is achieved through the analysis of the temporal variation of geometric (e.g. centroid path, volume variation, activated voxels) and kinematic (e.g. speed) properties in consecutive frames. Then, these descriptors and the corresponding histograms are used to search a frame in a 3D video and to compare 3D videos. Our approach is applied to 3D videos represented as triangle meshes or point sets, and eventually to an underlying skeleton or to markers (if available). Our tests on the MIT, Berkley, i3DPost, NTU, and DUTH data sets confirm the usefulness of the proposed approach for the analysis and comparison of 3D videos, as well as for action classification.
The High Efficiency Video Coding (HEVC) standard has recently been extended to support efficient representation of multiview video and depth-based 3D video formats. The multiview extension, MV-HEVC, ...allows efficient coding of multiple camera views and associated auxiliary pictures, and can be implemented by reusing single-layer decoders without changing the block-level processing modules since block-level syntax and decoding processes remain unchanged. Bit rate savings compared with HEVC simulcast are achieved by enabling the use of inter-view references in motion-compensated prediction. The more advanced 3D video extension, 3D-HEVC, targets a coded representation consisting of multiple views and associated depth maps, as required for generating additional intermediate views in advanced 3D displays. Additional bit rate reduction compared with MV-HEVC is achieved by specifying new block-level video coding tools, which explicitly exploit statistical dependencies between video texture and depth and specifically adapt to the properties of depth maps. The technical concepts and features of both extensions are presented in this paper.
3D reconstruction is a longstanding ill-posed problem, which has been explored for decades by the computer vision, computer graphics, and machine learning communities. Since 2015, image-based 3D ...reconstruction using convolutional neural networks (CNN) has attracted increasing interest and demonstrated an impressive performance. Given this new era of rapid evolution, this article provides a comprehensive survey of the recent developments in this field. We focus on the works which use deep learning techniques to estimate the 3D shape of generic objects either from a single or multiple RGB images. We organize the literature based on the shape representations, the network architectures, and the training mechanisms they use. While this survey is intended for methods which reconstruct generic objects, we also review some of the recent works which focus on specific object classes such as human body shapes and faces. We provide an analysis and comparison of the performance of some key papers, summarize some of the open problems in this field, and discuss promising directions for future research.
Digital rights management of the 3D contents is a crucial open issue in the 3D video industry. A novel robust fingerprinting algorithm is proposed for protecting the copyright of the 3D video. Unlike ...the existing algorithms extracting visual features separately from the 2D videos and the depth maps, in our algorithm a novel local stereo space is constructed according to the depth information of the pixels around the extracted local feature points in the 2D videos, and the 3D videos are processed in a holistic manner. In the proposed space, the 3D-transform-feature is extracted and aggregated into a feature matrix, and then the compact 3D video fingerprints are obtained from the eigenspace of the matrix. Our comprehensive experiments are conducted on a 3D video database, and the results have demonstrated the robustness and discrimination of the proposed algorithm. Moreover, our fingerprints cost less storage spaces than the existing approaches.
Recent world events have caused a dramatic rise in the use of video conferencing solutions such as Zoom and FaceTime. Although 3D capture and display technologies are becoming common in consumer ...products (e.g., Apple iPhone TrueDepth sensors, Microsoft Kinect devices, and Meta Quest VR headsets), 3D telecommunication has not yet seen any appreciable adoption. Researchers have made great progress in developing advanced 3D telepresence systems, but often with burdensome hardware and network requirements. In this work, we present HoloKinect, an open-source, user-friendly, and GPU-accelerated platform for enabling live, two-way 3D video conferencing on commodity hardware and a standard broadband internet connection. A Microsoft Azure Kinect serves as the capture device and a Looking Glass Portrait multiscopically displays the final reconstructed 3D mesh for a hologram-like effect. HoloKinect packs color and depth information into a single video stream, leveraging multiwavelength depth (MWD) encoding to store depth maps in standard RGB video frames. The video stream is compressed with highly optimized and hardware-accelerated video codecs such as H.264. A search of the depth and video encoding parameter space was performed to analyze the quantitative and qualitative losses resulting from HoloKinect’s lossy compression scheme. Visual results were acceptable at all tested bitrates (3–30 Mbps), while the best results were achieved with higher video bitrates and full 4:4:4 chroma sampling. RMSE values of the recovered depth measurements were low across all settings permutations.
This paper describes an extension of the high efficiency video coding (HEVC) standard for coding of multi-view video and depth data. In addition to the known concept of disparity-compensated ...prediction, inter-view motion parameter, and inter-view residual prediction for coding of the dependent video views are developed and integrated. Furthermore, for depth coding, new intra coding modes, a modified motion compensation and motion vector coding as well as the concept of motion parameter inheritance are part of the HEVC extension. A novel encoder control uses view synthesis optimization, which guarantees that high quality intermediate views can be generated based on the decoded data. The bitstream format supports the extraction of partial bitstreams, so that conventional 2D video, stereo video, and the full multi-view video plus depth format can be decoded from a single bitstream. Objective and subjective results are presented, demonstrating that the proposed approach provides 50% bit rate savings in comparison with HEVC simulcast and 20% in comparison with a straightforward multi-view extension of HEVC without the newly developed coding tools.
Previous works for LiDAR-based 3D object detection mainly focus on the single-frame paradigm. In this paper, we propose to detect 3D objects by exploiting temporal information in multiple frames, ...i.e., point cloud videos . We empirically categorize the temporal information into short-term and long-term patterns. To encode the short-term data, we present a Grid Message Passing Network (GMPNet), which considers each grid (i.e., the grouped points) as a node and constructs a <inline-formula><tex-math notation="LaTeX">k</tex-math> <mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="shen-ieq1-3125981.gif"/> </inline-formula>-NN graph with the neighbor grids. To update features for a grid, GMPNet iteratively collects information from its neighbors, thus mining the motion cues in grids from nearby frames. To further aggregate long-term frames, we propose an Attentive Spatiotemporal Transformer GRU (AST-GRU), which contains a Spatial Transformer Attention (STA) module and a Temporal Transformer Attention (TTA) module. STA and TTA enhance the vanilla GRU to focus on small objects and better align moving objects. Our overall framework supports both online and offline video object detection in point clouds. We implement our algorithm based on prevalent anchor-based and anchor-free detectors. Evaluation results on the challenging nuScenes benchmark show superior performance of our method, achieving first on the leaderboard (at the time of paper submission) without any "bells and whistles." Our source code is available at https://github.com/shenjianbing/GMP3D .
This paper describes extensions to the High Efficiency Video Coding (HEVC) standard that are active areas of current development in the relevant international standardization committees. While the ...first version of HEVC is sufficient to cover a wide range of applications, needs for enhancing the standard in several ways have been identified, including work on range extensions for color format and bit depth enhancement, embedded-bitstream scalability, and 3D video. The standardization of extensions in each of these areas will be completed in 2014, and further work is also planned. The design for these extensions represents the latest state of the art for video coding and its applications.
This paper provides an overview of Scalable High efficiency Video Coding (SHVC), the scalable extensions of the High Efficiency Video Coding (HEVC) standard, published in the second version of HEVC. ...In addition to the temporal scalability already provided by the first version of HEVC, SHVC further provides spatial, signal-to-noise ratio, bit depth, and color gamut scalability functionalities, as well as combinations of any of these. The SHVC architecture design enables SHVC implementations to be built using multiple repurposed single-layer HEVC codec cores, with the addition of interlayer reference picture processing modules. The general multilayer high-level syntax design common to all multilayer HEVC extensions, including SHVC, MV-HEVC, and 3D HEVC, is described. The interlayer reference picture processing modules, including texture and motion resampling and color mapping, are also described. Performance comparisons are provided for SHVC versus simulcast HEVC and versus the scalable video coding extension to H.264/advanced video coding.
The increasing popularity of video (i.e., audio-visual) applications or services over both wired and wireless links has prompted recent growing interests in the investigations of quality of ...experience (QoE) in online video transmission. Conventional video quality metrics, such as peak-signal-to-noise-ratio and quality of service, only focus on the reception quality from the systematic perspective. As a result, they cannot represent the true visual experience of an individual user. Instead, the QoE introduces a user experience-driven strategy which puts special emphasis on the contextual and human factors in addition to the transmission system. This advantage has raised the popularity and widespread usage of QoE in video transmission. In this paper, we present an overview of selected issues pertaining to QoE and its recent applications in video transmission, with consideration of the compelling features of QoE (i.e., context and human factors). The selected issues include QoE modeling with influence factors in the end-to-end chain of video transmission, QoE assessment (including subjective test and objective QoE monitoring) and QoE management of video transmission over different types of networks. Through the literature review, we observe that the context and human factors in QoE-aware video transmission have attracted significant attentions since the past two to three years. A vast number of high quality works were published in this area, and will be highlighted in this survey. In addition to a thorough summary of recent progresses, we also present an outlook of future developments on QoE assessment and management in video transmission, especially focusing on the context and human factors that have not been addressed yet and the technical challenges that have not been completely solved so far. We believe that our overview and findings can provide a timely perspective on the related issues and the future research directions in QoE-oriented services over video communications.