Physical rehabilitation plays a crucial role in restoring motor function following injuries or surgeries. However, the challenge of overcrowded waiting lists often hampers doctors’ ability to monitor ...patients’ recovery progress in person. Deep Learning methods offer a solution by enabling doctors to optimize their time with each patient and distinguish between those requiring specific attention and those making positive progress. Doctors use the flexion angle of limbs as a cue to assess a patient’s mobility level during rehabilitation. From a Computer Vision perspective, this task can be framed as automatically estimating the pose of the target body limbs in an image. The objectives of this study can be summarized as follows: (i) evaluating and comparing multiple pose estimation methods; (ii) analyzing how the subject’s position and camera viewpoint impact the estimation; and (iii) determining whether 3D estimation methods are necessary or if 2D estimation suffices for this purpose. To conduct this technical study, and due to the limited availability of public datasets related to physical rehabilitation exercises, we introduced a new dataset featuring 27 individuals performing eight diverse physical rehabilitation exercises focusing on various limbs and body positions. Each exercise was recorded using five RGB cameras capturing different viewpoints of the person. An infrared tracking system named OptiTrack was utilized to establish the ground truth positions of the joints in the limbs under study. The results, supported by statistical tests, show that not all state-of-the-art pose estimators perform equally in the presented situations (e.g., patient lying on the stretcher vs. standing). Statistical differences exist between camera viewpoints, with the frontal view being the most convenient. Additionally, the study concludes that 2D pose estimators are adequate for estimating joint angles given the selected camera viewpoints.
Fiducial Objects: Custom Design and Evaluation García-Ruiz, Pablo; Romero-Ramirez, Francisco J; Muñoz-Salinas, Rafael ...
Sensors (Basel, Switzerland),
12/2023, Letnik:
23, Številka:
24
Journal Article
Recenzirano
Odprti dostop
Camera pose estimation is vital in fields like robotics, medical imaging, and augmented reality. Fiducial markers, specifically ArUco and Apriltag, are preferred for their efficiency. However, their ...accuracy and viewing angle are limited when used as single markers. Custom fiducial objects have been developed to address these limitations by attaching markers to 3D objects, enhancing visibility from multiple viewpoints and improving precision. Existing methods mainly use square markers on non-square object faces, leading to inefficient space use. This paper introduces a novel approach for creating fiducial objects with custom-shaped markers that optimize face coverage, enhancing space utilization and marker detectability at greater distances. Furthermore, we present a technique for the precise configuration estimation of these objects using multiviewpoint images. We provide the research community with our code, tutorials, and an application to facilitate the building and calibration of these objects. Our empirical analysis assesses the effectiveness of various fiducial objects for pose estimation across different conditions, such as noise levels, blur, and scale variations. The results suggest that our customized markers significantly outperform traditional square markers, marking a positive advancement in fiducial marker-based pose estimation methods.
Environment landmarks are generally employed by visual SLAM (vSLAM) methods in the form of keypoints. However, these landmarks are unstable over time because they belong to areas that tend to change, ...e.g., shadows or moving objects. To solve this, some other authors have proposed the combination of keypoints and artificial markers distributed in the environment so as to facilitate the tracking process in the long run. Artificial markers are special elements (similar to beacons) that can be permanently placed in the environment to facilitate tracking. In any case, these systems keep a set of keypoints that is not likely to be reused, thus unnecessarily increasing the computing time required for tracking. This paper proposes a novel visual SLAM approach that efficiently combines keypoints and artificial markers, allowing for a substantial reduction in the computing time and memory required without noticeably degrading the tracking accuracy. In the first stage, our system creates a map of the environment using both keypoints and artificial markers, but once the map is created, the keypoints are removed and only the markers are kept. Thus, our map stores only long-lasting features of the environment (i.e., the markers). Then, for localization purposes, our algorithm uses the marker information along with temporary keypoints created just in the time of tracking, which are removed after a while. Since our algorithm keeps only a small subset of recent keypoints, it is faster than the state-of-the-art vSLAM approaches. The experimental results show that our proposed sSLAM compares favorably with ORB-SLAM2, ORB-SLAM3, OpenVSLAM and UcoSLAM in terms of speed, without statistically significant differences in accuracy.
•This work tackles the marker identification process as a classification problem.•Methodology proposed to train the classifiers with a synthetic dataset of markers.•Our proposal can identify markers ...under very difficult image conditions.•The proposed method performs significantly better than previous approaches.
Many intelligent systems, such as assistive robots, augmented reality trainers or unmanned vehicles, need to know their physical location in the environment in order to fulfill their task. While relying exclusively on natural landmarks for that task is the preferred option, their use is somewhat limited because the proposed methods are complex, require high computational power, and are not reliable in all environments. On the other hand, artificial landmarks can be placed in order to alleviate these problems. In particular, square fiducial markers are one of the most popular tools for camera pose estimation due to their high performance and precision. However, the state-of-the-art methods still perform poorly under difficult image conditions, such as camera defocus, motion blur, small scale or non-uniform lighting.
This paper proposes a method to robustly detect this type of landmarks under challenging image conditions present in realistic scenarios. To do so, we re-define the marker identification problem as a classification one based on state-of-the-art machine learning techniques. Second, we propose a procedure to create a training dataset of synthetically generated images affected by several challenging transformations. Third, we show that, in this problem, a classifier can be trained using exclusively synthetic data, performing well in real and challenging conditions. Different types of classifiers have been tested to prove the validity of our proposal (namely, Multilayer Perceptron (MLP), Convolutional Neural Network (CNN) and Support Vector Machine (SVM)), and statistical analyses have been performed in order to determine the best approach for our problem. Finally, the obtained classifiers have been compared to the ArUco and AprilTags fiducial marker systems in challenging video sequences. The results obtained show that the proposed method performs significantly better than previous approaches, making the use of this technology more reliable in a wider range of realistic scenarios such as outdoor scenes or fast moving cameras.
People identification in video based on the way they walk (i.e., gait) is a relevant task in computer vision using a noninvasive approach. Standard and current approaches typically derive gait ...signatures from sequences of binary energy maps of subjects extracted from images, but this process introduces a large amount of non-stationary noise, thus conditioning their efficacy. In contrast, in this paper we focus on the raw pixels, or simple functions derived from them, letting advanced learning techniques to extract relevant features. Therefore, we present a comparative study of different convolutional neural network (CNN) architectures by using three different modalities (i.e., gray pixels, optical flow channels and depth maps) on two widely adopted and challenging datasets: TUM-GAID and CASIA-B. In addition, we perform a comparative study between different early and late fusion methods used to combine the information obtained from each kind of modalities. Our experimental results suggest that (1) the raw pixel values represent a competitive input modality, compared to the traditional state-of-the-art silhouette-based features (e.g., GEI), since equivalent or better results are obtained; (2) the fusion of the raw pixel information with information from optical flow and depth maps allows to obtain state-of-the-art results on the gait recognition task with an image resolution several times smaller than the previously reported results; and (3) the selection and the design of the CNN architecture are critical points that can make a difference between state-of-the-art results or poor ones.
Human interaction recognition (HIR) is a significant challenge in computer vision that focuses on identifying human interactions in images and videos. HIR presents a great complexity due to factors ...such as pose diversity, varying scene conditions, or the presence of multiple individuals. Recent research has explored different approaches to address it, with an increasing emphasis on human pose estimation. In this work, we propose Proxemics-Net++, an extension of the Proxemics-Net model, capable of addressing the problem of recognizing human interactions in images through two different tasks: the identification of the types of “touch codes” or proxemics and the identification of the type of social relationship between pairs. To achieve this, we use RGB and body pose information together with the state-of-the-art deep learning architecture, ConvNeXt, as the backbone. We performed an ablative analysis to understand how the combination of RGB and body pose information affects these two tasks. Experimental results show that body pose information contributes significantly to proxemic recognition (first task) as it allows to improve the existing state of the art, while its contribution in the classification of social relations (second task) is limited due to the ambiguity of labelling in this problem, resulting in RGB information being more influential in this task.
•Universal fall detection pipeline that can be used in any dataset.•Multi-task learning scheme to produce multiple outputs from a single input.•State-of-the-art results on four different datasets.
...Background and Objective: Fall detection is an important problem for vulnerable sectors of the population such as elderly people, who frequently live alone. Note that a fall can be very dangerous for them if they cannot ask for help. Hence, in those situations, an automatic system that detected and informed to emergency services about the fall and subject identity could help to save lives. This way, they would know not only when but also who to help. Thus, our objective is to develop a new approach, based on deep learning, for fall detection and people identification that can be used in different datasets without any fine-tuning of the model parameters.
Methods: We present a dataset-independent deep learning-based model that, by employing a multi-task learning approach, uses raw inertial information as input to solve simultaneously two tasks: fall detection and subject identification. By this way, our approach is able to automatically learn the best representations without any constraint introduced by the pre-processed features.
Results: Our cross-dataset classifier is able to detect falls with more than a 98% of accuracy in four datasets recorded under different conditions (i.e. accelerometer device, sampling rate, sequence length, age of the subjects, etc.). Moreover, the number of false positives is very low – on average less than 1.6% – establishing a new state-of-the-art. Finally, our classifier is also capable of correctly identifying people with an average accuracy of 79.6%.
Conclusions: The presented approach performs both tasks (fall detection and people identification) by using a single model and achieving real-time execution. The obtained results allow us to assert that a single model can be used for both fall detection and people identification under different conditions, easing its real implementation, as it is not necessary to train the model for new subjects.
LAEO-Net++: Revisiting People Looking at Each Other in Videos Marin-Jimenez, Manuel J.; Kalogeiton, Vicky; Medina-Suarez, Pablo ...
IEEE transactions on pattern analysis and machine intelligence,
06/2022, Letnik:
44, Številka:
6
Journal Article
Recenzirano
Odprti dostop
Capturing the 'mutual gaze' of people is essential for understanding and interpreting the social interactions between them. To this end, this paper addresses the problem of detecting people Looking ...At Each Other (LAEO) in video sequences. For this purpose, we propose LAEO-Net++, a new deep CNN for determining LAEO in videos. In contrast to previous works, LAEO-Net++ takes spatio-temporal tracks as input and reasons about the whole track. It consists of three branches, one for each character's tracked head and one for their relative position. Moreover, we introduce two new LAEO datasets: UCO-LAEO and AVA-LAEO. A thorough experimental evaluation demonstrates the ability of LAEO-Net++ to successfully determine if two people are LAEO and the temporal window where it happens. Our model achieves state-of-the-art results on the existing TVHID-LAEO video dataset, significantly outperforming previous approaches. Finally, we apply LAEO-Net++ to a social network, where we automatically infer the social relationship between pairs of people based on the frequency and duration that they LAEO, and show that LAEO can be a useful tool for guided search of human interactions in videos.