Synthetic datasets, for which we propose the term synthsets, are not a novelty but have become a necessity. Although they have been used in computer vision since 1989, helping to solve the problem of ...collecting a sufficient amount of annotated data for supervised machine learning, intensive development of methods and techniques for their generation belongs to the last decade. Nowadays, the question shifts from whether you should use synthetic datasets to how you should optimally create them. Motivated by the idea of discovering best practices for building synthetic datasets to represent dynamic environments (such as traffic, crowds, and sports), this study provides an overview of existing synthsets in the computer vision domain. We have analyzed the methods and techniques of synthetic datasets generation: from the first low-res generators to the latest generative adversarial training methods, and from the simple techniques for improving realism by adding global noise to those meant for solving domain and distribution gaps. The analysis extracts nine unique but potentially intertwined methods and reveals the synthsets generation diagram, consisting of 17 individual processes that synthset creators should follow and choose from, depending on the specific requirements of their task.
Due to a growing number of people who carry out various adrenaline activities or adventure tourism and stay in the mountains and other inaccessible places, there is an increasing need to organize a ...search and rescue operation (SAR) to provide assistance and health care to the injured. The goal of SAR operation is to search the largest area of the territory in the shortest time possible and find a lost or injured person. Today, drones (UAVs or drones) are increasingly involved in search operations, as they can capture a large, controlled area in a short amount of time. However, a detailed examination of a large amount of recorded material remains a problem. Even for an expert, it is not easy to find searched people who are relatively small considering the area where they are, often sheltered by vegetation or merged with the ground and in unusual positions due to falls, injuries, or exhaustion. Therefore, the automatic detection of persons and objects in images/videos taken by drones in these operations is very significant. In this paper, the reliability of existing state-of-the-art detectors such as Faster R-CNN, YOLOv4, RetinaNet, and Cascade R-CNN on a VisDrone benchmark and custom-made dataset SARD build to simulate rescue scenes was investigated. After training the models on selected datasets, detection results were compared. Because of the high speed and accuracy and the small number of false detections, the YOLOv4 detector was chosen for further examination. YOLOv4 model results related to different network sizes, different detection accuracies, and transfer learning settings were analyzed. The model robustness to weather conditions and motion blur were also investigated. The paper proposes a model that can be used in SAR operations because of the excellent results in detecting people in search and rescue scenarios.
Global terrorist threats and illegal migration have intensified concerns for the security of citizens, and every effort is made to exploit all available technological advances to prevent adverse ...events and protect people and their property. Due to the ability to use at night and in weather conditions where RGB cameras do not perform well, thermal cameras have become an important component of sophisticated video surveillance systems. In this paper, we investigate the task of automatic person detection in thermal images using convolutional neural network models originally intended for detection in RGB images. We compare the performance of the standard state-of-the-art object detectors such as Faster R-CNN, SSD, Cascade R-CNN, and YOLOv3, that were retrained on a dataset of thermal images extracted from videos that simulate illegal movements around the border and in protected areas. Videos are recorded at night in clear weather, rain, and in the fog, at different ranges, and with different movement types. YOLOv3 was significantly faster than other detectors while achieving performance comparable with the best, so it was used in further experiments. We experimented with different training dataset settings in order to determine the minimum number of images needed to achieve good detection results on test datasets. We achieved excellent detection results with respect to average accuracy for all test scenarios although a modest set of thermal images was used for training. We test our trained model on different well known and widely used thermal imaging datasets as well. In addition, we present the results of the recognition of humans and animals in thermal images, which is particularly important in the case of sneaking around objects and illegal border crossings. Also, we present our original thermal dataset used for experimentation that contains surveillance videos recorded at different weather and shooting conditions.
In team sports training scenes, it is common to have many players on the court, each with his own ball performing different actions. Our goal is to detect all players in the handball court and ...determine the most active player who performs the given handball technique. This is a very challenging task, for which, apart from an accurate object detector, which is able to deal with complex cluttered scenes, additional information is needed to determine the active player. We propose an active player detection method that combines the Yolo object detector, activity measures, and tracking methods to detect and track active players in time. Different ways of computing player activity were considered and three activity measures are proposed based on optical flow, spatiotemporal interest points, and convolutional neural networks. For tracking, we consider the use of the Hungarian assignment algorithm and the more complex Deep SORT tracker that uses additional visual appearance features to assist the assignment process. We have proposed the evaluation measure to evaluate the performance of the proposed active player detection method. The method is successfully tested on a custom handball video dataset that was acquired in the wild and on basketball video sequences. The results are commented on and some of the typical cases and issues are shown.
Human Action Recognition (HAR) is a challenging task used in sports such as volleyball, basketball, soccer, and tennis to detect players and recognize their actions and teams' activities during ...training, matches, warm-ups, or competitions. HAR aims to detect the person performing the action on an unknown video sequence, determine the action's duration, and identify the action type. The main idea of HAR in sports is to monitor a player's performance, that is, to detect the player, track their movements, recognize the performed action, compare various actions, compare different kinds and skills of acting performances, or make automatic statistical analysis.
As an action that can occur in the sports field refers to a set of physical movements performed by a player in order to complete a task using their body or interacting with objects or other persons, actions can be of different complexity. Because of that, a novel systematization of actions based on complexity and level of performance and interactions is proposed.
The overview of HAR research focuses on various methods performed on publicly available datasets, including actions of everyday activities. That is just a good starting point; however, HAR is increasingly represented in sports and is becoming more directed towards recognizing similar actions of a particular sports domain. Therefore, this paper presents an overview of HAR applications in sports primarily based on Computer Vision as the main contribution, along with popular publicly available datasets for this purpose.
Machine Learning, Human Action Recognition, Action systematization, Sports Dataset, Human Action Recognition in Sports, Sport.
In this paper, we present automatic, deep-learning methods for pipeline detection in underwater environments. Seafloor pipelines are critical infrastructure for oil and gas transport. The inspection ...of those pipelines is required to verify their integrity and determine the need for maintenance. Underwater conditions present a harsh environment that is challenging for image recognition due to light refraction and absorption, poor visibility, scattering, and attenuation, often causing poor image quality. Modern machine-learning object detectors utilize Convolutional Neural Network (CNN), requiring a training dataset of sufficient quality. In the paper, six different deep-learning CNN detectors for underwater object detection were trained and tested: five are based on the You Only Look Once (YOLO) architectures (YOLOv4, YOLOv4-Tiny, CSP-YOLOv4, YOLOv4@Resnet, YOLOv4@DenseNet), and one on the Faster Region-based CNN (RCNN) architecture. The models' performances were evaluated in terms of detection accuracy, mean average precision (mAP), and processing speed measured with the Frames Per Second (FPS) on a custom dataset containing underwater pipeline images. In the study, the YOLOv4 outperformed other models for underwater pipeline object detection resulting in an mAP of 94.21% with the ability to detect objects in real-time. Based on the literature review, this is one of the pioneering works in this field.
Automatic analysis and the recognition and prediction of the behaviour of large-scale crowds in video-surveillance data is a research field of paramount importance for the security of modern ...societies. It serves to predict and help prevent disasters in public places where crowds of people gather. The paper proposes a novel method for generating meta-tracklets and recognition of dominant motion patterns as a basis for automatic crowd behaviour analysis at the macroscopic level, where a crowd is treated as an entity. The basic characteristic of macroscopic crowd scenes is that it is impossible to detect and track individuals in the scene. The idea of the method proposed in this paper is to recognize dominant crowd motion patterns, by avoiding time-consuming and error-sensitive crowd segmentation, crowd tracking and detection of regions of interest. Thus, the process of determining dominant motion patterns and recognizing crowd behaviour is accelerated. The method is inspired by a quantum mechanical approach. It combines a set of particles, which are considered as particles in quantum mechanics, tracklets of particles’ advection in a video clip, and the interaction of wave functions spread out from particle positions. A wave function is expressed in the form of an asymmetric potential function. Peaks of the wave field define the most probable particle flow, which defines a meta-tracklet. Dominant motion patterns are recognized by applying the functions of fuzzy predicates, which represent a combination of common-sense and human expert knowledge about crowd motions, to the meta-tracklets. The experimental results of the proposed method are presented for a subset of UCF dataset and AGORASET crowd simulation videos and have shown promising results in dominant motion pattern recognition.
Player pose estimation is particularly important for sports because it provides more accurate monitoring of athlete movements and performance, recognition of player actions, analysis of techniques, ...and evaluation of action execution accuracy. All of these tasks are extremely demanding and challenging in sports that involve rapid movements of athletes with inconsistent speed and position changes, at varying distances from the camera with frequent occlusions, especially in team sports when there are more players on the field. A prerequisite for recognizing the player’s actions on the video footage and comparing their poses during the execution of an action is the detection of the player’s pose in each element of an action or technique. First, a 2D pose of the player is determined in each video frame, and converted into a 3D pose, then using the tracking method all the player poses are grouped into a sequence to construct a series of elements of a particular action. Considering that action recognition and comparison depend significantly on the accuracy of the methods used to estimate and track player pose in real-world conditions, the paper provides an overview and analysis of the methods that can be used for player pose estimation and tracking using a monocular camera, along with evaluation metrics on the example of handball scenarios. We have evaluated the applicability and robustness of 12 selected 2-stage deep learning methods for 3D pose estimation on a public and a custom dataset of handball jump shots for which they have not been trained and where never-before-seen poses may occur. Furthermore, this paper proposes methods for retargeting and smoothing the 3D sequence of poses that have experimentally shown a performance improvement for all tested models. Additionally, we evaluated the applicability and robustness of five state-of-the-art tracking methods on a public and a custom dataset of a handball training recorded with a monocular camera. The paper ends with a discussion apostrophizing the shortcomings of the pose estimation and tracking methods, reflected in the problems of locating key skeletal points and generating poses that do not follow possible human structures, which consequently reduces the overall accuracy of action recognition.
In recent years, multi-person pose forecasting has gained significant attention due to its potential applications in various fields such as computer vision, robotics, sports analysis, and human-robot ...interaction. In this paper, we propose a novel deep learning model for multi-person pose forecasting called MPFSIR (multi-person pose forecasting and social interaction recognition) that achieves comparable results with state-of-the-art models, but with up to 30 times fewer parameters. In addition, the model includes a social interaction prediction component to model and predict interactions between individuals. We evaluate our model on three benchmark datasets: 3DPW, CMU-Mocap, and MuPoTS-3D, compare it with state-of-the-art methods, and provide an ablation study to analyze the impact of the different model components. Experimental results show the effectiveness of MPFSIR in accurately predicting future poses and capturing social interactions. Furthermore, we introduce the metric MW-MPJPE to evaluate the performance of pose forecasting, which focuses on motion dynamics. Overall, our results highlight the potential of MPFSIR for predicting the poses of multiple people and understanding social dynamics in complex scenes and in various practical applications, especially where computational resources are limited. The code is available at https://github.com/RomeoSajina/MPFSIR.
People enjoy spending time in the wilderness for numerous reasons. However, they occasionally get lost or injured, and their survival depends on being efficiently found and rescued in the shortest ...possible time. A search and rescue operation (SAR) is launched after the accident is reported, and all possible resources are activated. The inclusion of drones in SAR operations has enabled the use of computer vision methods to detect persons in aerial imagery automatically. When searching by drone, preference is given to oblique photographs that cover a larger area within a single image, reducing the search time. Unlike vertical photographs, oblique photographs include a significant scale change, making it challenging to locate a person in the real world and determine their distance from the drone. In order to solve this problem, encouraged by our previous successful simulations, we explored the possibility of applying the raycast method for person geolocalization and distance determination for use in real-world scenarios. In this paper, we propose a system able to precisely geolocate persons automatically detected in offline processed images recorded during the SAR mission. After a series of experiments on terrains of different configurations and complexity, using a custom-made 3D terrain generator and raycaster, along with a deep neural network-based person detector trained on our custom dataset, we defined a method for geolocating detected person based on raycast, which allows using low-cost commercial drones with a monocular camera and no Real-Time Kinematic module while enabling laser rangefinder emulation during offline image analysis. Our person geolocating method overcomes the problems faced by previous methods and, using a single flight sequence with only 4 consecutive detections, significantly outperforms the previous best results, with reliability of 42,85% (geolocating error of 0.7 m on recording from a 30 m height). Also, a short time of only 247 s enables offline processing of data recorded during a 21-minute drone flight covering approximately an area of 10 ha, proving that the proposed method can be effectively used in actual SAR missions. We also proposed a new evaluation metric (ErrDist) for person geolocalization and provided recommendations for using the proposed system for person detection and geolocation in real-world scenarios.