Despite the significant progress for monocular 3D human pose estimation, it still faces challenges due to self-occlusions and depth ambiguities. To tackle those issues, we propose a novel Dynamic ...Graph Transformer (DGFormer) to exploit local and global relationships between skeleton joints for pose estimation. Specifically, the proposed DGFormer mainly consists of three core modules: Transformer Encoder (TE), immobile Graph Convolutional Network (GCN), and dynamic GCN. TE module leverages the self-attention mechanism to learn the complex global relationships among skeleton joints. The immobile GCN is responsible for capturing the local physical connections between human joints, while the dynamic GCN concentrates on learning the sparse dynamic K-nearest neighbor interactions according to different action poses. By building the adequately global long-range, local physical, and sparse dynamic dependencies of human joints, experiments on Human3.6M and MPI-INF-3DHP datasets demonstrate that our method can predict 3D pose with lower errors outperforming the recent state-of-the-art image-based performance. Furthermore, experiments on in-the-wild videos demonstrate the impressive generalization abilities of our method. Code will be available at: https://github.com/czmmmm/DGFormer.
•We propose a novel dynamic Graph Transformer model for 3D human pose estimation (3D-HPE).•The Transformer is applied to exploit the global relationships among skeleton joints.•The proposed immobile GCN captures the local physical connections.•The proposed dynamic GCN learns the sparse dynamic K-nearest neighbor interactions.•Our method outperforms state-of-the-art methods for image-based 3D-HPE.
Realtime multi-person 2D pose estimation is a key component in enabling machines to have an understanding of people in images and videos. In this work, we present a realtime approach to detect the 2D ...pose of multiple people in an image. The proposed method uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. This bottom-up system achieves high accuracy and realtime performance, regardless of the number of people in the image. In previous work, PAFs and body part location estimation were refined simultaneously across training stages. We demonstrate that a PAF-only refinement rather than both PAF and body part location refinement results in a substantial increase in both runtime performance and accuracy. We also present the first combined body and foot keypoint detector, based on an internal annotated foot dataset that we have publicly released. We show that the combined detector not only reduces the inference time compared to running them sequentially, but also maintains the accuracy of each component individually. This work has culminated in the release of OpenPose, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints.
We propose UniPose+, a unified framework for 2D and 3D human pose estimation in images and videos. The UniPose+ architecture leverages multi-scale feature representations to increase the ...effectiveness of backbone feature extractors, with no significant increase in network size and no postprocessing. Current pose estimation methods heavily rely on statistical postprocessing or predefined anchor poses for joint localization. The UniPose+ framework incorporates contextual information across scales and joint localization with Gaussian heatmap modulation at the decoder output to estimate 2D and 3D human pose in a single stage with state-of-the-art accuracy, without relying on predefined anchor poses. The multi-scale representations allowed by the waterfall module in the UniPose+ framework leverage the efficiency of progressive filtering in the cascade architecture, while maintaining multi-scale fields-of-view comparable to spatial pyramid configurations. Our results on multiple datasets demonstrate that UniPose+, with a HRNet, ResNet or SENet backbone and waterfall module, is a robust and efficient architecture for single person 2D and 3D pose estimation in single images and videos.
Vision-based monocular human pose estimation, as one of the most fundamental and challenging problems in computer vision, aims to obtain posture of the human body from input images or video ...sequences. The recent developments of deep learning techniques have been brought significant progress and remarkable breakthroughs in the field of human pose estimation. This survey extensively reviews the recent deep learning-based 2D and 3D human pose estimation methods published since 2014. This paper summarizes the challenges, main frameworks, benchmark datasets, evaluation metrics, performance comparison, and discusses some promising future research directions.
High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks ...first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions in series (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams in parallel and (ii) repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems. All the codes are available at https://github.com/HRNet .
Deep 3D human pose estimation: A review Wang, Jinbao; Tan, Shujie; Zhen, Xiantong ...
Computer vision and image understanding,
September 2021, 2021-09-00, Volume:
210
Journal Article
Peer reviewed
Open access
Three-dimensional (3D) human pose estimation involves estimating the articulated 3D joint locations of a human body from an image or video. Due to its widespread applications in a great variety of ...areas, such as human motion analysis, human–computer interaction, robots, 3D human pose estimation has recently attracted increasing attention in the computer vision community, however, it is a challenging task due to depth ambiguities and the lack of in-the-wild datasets. A large number of approaches, with many based on deep learning, have been developed over the past decade, largely advancing the performance on existing benchmarks. To guide future development, a comprehensive literature review is highly desired in this area. However, existing surveys on 3D human pose estimation mainly focus on traditional methods and a comprehensive review on deep learning based methods remains lacking in the literature. In this paper, we provide a thorough review of existing deep learning based works for 3D pose estimation, summarize the advantages and disadvantages of these methods and provide an in-depth understanding of this area. Furthermore, we also explore the commonly-used benchmark datasets on which we conduct a comprehensive study for comparison and analysis. Our study sheds light on the state of research development in 3D human pose estimation and provides insights that can facilitate the future design of models and algorithms.
•The recent methods for deep 3D pose estimation are categorized and thoroughly analyzed.•Provide an extensive review of related datasets and evaluation metrics.•Compare the pros and cons of the deep 3D models valuated on the datasets and draw a conclusion.•We discuss the potential research orientations of future.