고정형 CCTV의 경우 팬-틸트(pan-tilt)와 줌(zoom) 기능을 활용하여 가시범위를 최대화하더라도 음영지역이 발생하는 문제가 있다. 이에 대한 대표적인 해결방안으로 다수의 고정형 CCTV를 운영하는 것이다. 이는 CCTV의 개수와 비례한 다량의 부가장비(예: 전선, 설비, 모니터 등)가 필요하다. 다른 해결방안으로 드론을 활용하는 것이다. 문제는 ...운영시간에 대하여 고정형 CCTV 대비 훨씬 짧다. 운영시간을 연장하기 위해 다수의 드론을 활용하여 한 대씩 임무를 교대하면서 수행하는 방식이 있다. 이 경우 배터리 충전이 필요한 드론은 드론 포트(drone port)에서 준비상태(재충전 완료)로 재진입하여 연속된 임무가 가능하게 하는 것이다. 본 논문은 고정된 전방을 향한 단안 카메라가 탑재된 초소형 드론을 활용하여 방범용 CCTV의 기능으로 활용될 때, 원격지에서 효율적인 운영과 후속 임무의 원활한 연속 수행을 위한 드론포트에 안정적인 착지를 위한 시스템을 제안하고 구현한다. 그리고 이에 대한 운영을 통해 활용 가능성을 입증한다.
In the case of fixed CCTV, there is a problem in that a shadow area occurs, even if the visible range is maximized by utilizing the pan-tilt and zoom functions. The representative solution for that problem is that a plurality of fixed CCTVs are used. This requires a large amount of additional equipment (e.g., wires, facilities, monitors, etc.) proportional to the number of the CCTVs. Another solution is to use drones that are equipped with cameras and fly. However, Drone’s operation time is much short. In order to extend the time, we can use multiple drones and can fly one by one. In this case, drones that need to recharge their batteries re-enter into a ready state at the drone port for next operation. In this paper, we propose a system for precised positioning and stable landing on the drone port by utilizing a small drone equipped with a fixed forward-facing monocular camera. For our conclusion, we implement our proposed system, operate, and finally verify our feasibility.
We study the problem of unknown foreground discovery in image streaming scenarios, where no prior information about the dynamic scene is assumed. Contrary to existing co-segmentation principles where ...the entire dataset is given, in streams new information emerges as content appears and disappears continually. Any object classes to be observed in the scene are unknown, therefore no detection model can be trained for the specific class set. We also assume there is no available repository of trained features from convolutional neural nets, i.e., transfer learning is not applicable. We focus on the progressive discovery of foreground, which may or may not correspond to contextual objects of interest, depending on the camera trajectory, or, in general, the perceived motion. Without any form of supervision, we construct in a bottom up fashion dynamic graphs that capture region saliency and relative topology. Such graphs are continually updated over time, and along with occlusion information, as fundamental property of the foreground–background relationship, foreground is computed for each frame of the stream. We validate our method using indoor and outdoor scenes of varying complexity with respect to content, objects motion, camera trajectory, and occlusions.
Display omitted
•Bottom-up construction of the foreground, dynamically.•Fully unsupervised, without any prior knowledge or training.•Expands the existing notion of unsupervised detection to streaming data.•It is based on principles of co-segmentation while accommodating unseen data.•Light computation for integration with path planning/visual coverage in active exploration.
Cloud-based computing systems can get oversubscribed due to the budget constraints of their users or limitations in certain resource types. The oversubscription can, in turn, degrade the users ...perceived Quality of Service (QoS). The approach we investigate to mitigate both the oversubscription and the incurred cost is based on smart reusing of the computation needed to process the service requests (i.e., tasks). We propose a reusing paradigm for the tasks that are waiting for execution. This paradigm can be particularly impactful in serverless platforms where multiple users can request similar services simultaneously. Our motivation is a multimedia streaming engine that processes the media segments in an on-demand manner. We propose a mechanism to identify various types of "mergeable" tasks and aggregate them to improve the QoS and mitigate the incurred cost. We develop novel approaches to determine when and how to perform task aggregation such that the QoS of other tasks is not affected. Evaluation results show that the proposed mechanism can improve the QoS by significantly reducing the percentage of tasks missing their deadlines and reduce the overall time (and subsequently the incurred cost) of utilizing cloud services by more than 9 percent.
•In this paper, we address an online action detection problem.•This is challenging since a limited amount of information is available.•Future frame generation network is proposed to overcome this ...limitation.•Video data augmentation method is also exploited to resolve temporal variation.
Online temporal action localization from an untrimmed video stream is a challenging problem in computer vision. It is challenging because of i) in an untrimmed video stream, more than one action instance may appear, including background scenes, and ii) in online settings, only past and current information is available. Therefore, temporal priors, such as the average action duration of training data, which have been exploited by previous action detection methods, are not suitable for this task because of the high intra-class variation in human actions. We propose a novel online action detection framework that considers actions as a set of temporally ordered subclasses and leverages a future frame generation network to cope with the limited information issue associated with the problem outlined above. Additionally, we augment our data by varying the lengths of videos to allow the proposed method to learn about the high intra-class variation in human actions. We evaluate our method using two benchmark datasets, THUMOS’14 and ActivityNet, for an online temporal action localization scenario and demonstrate that the performance is comparable to state-of-the-art methods that have been proposed for offline settings.
Querying for Interactions Xarchakos, Ioannis; Koudas, Nick
IEEE transactions on knowledge and data engineering,
02/2023, Letnik:
35, Številka:
2
Journal Article
Recenzirano
Deep Learning and Computer Vision advances enabled sophisticated information extraction out of images and videos. Recent research aims to make objects, their types and relative locations, first class ...citizens for query processing purposes. We initiate research to explore declarative queries for real time video streams involving objects and their interactions. We seek to efficiently identify frames in which an object is interacting with another in a specific way. We propose progressive filters (PF) algorithm which deploys a sequence of inexpensive and less accurate filters to detect the presence of query specified objects on frames. We demonstrate that PF derives a least cost sequence of filters given the query objects' current selectivities. Since selectivities may vary as the video evolves, we present a statistical test to determine when to trigger filters' re-optimization. Finally, we present Interaction Sheave, a filtering approach that uses learned spatial information about objects and interactions to prune frames that are unlikely to involve the query specified action between them, thus improving the frame processing rate. We present the results of a thorough experimental evaluation involving real datasets. We experimentally demonstrate that our techniques can improve query performance (up to an order of magnitude) while maintaining competitive F1-score.
Face recognition tasks have seen a significantly improved performance due to ConvNets. However, less attention has been given to face verification from videos. This paper makes two contributions ...along these lines. First, we propose a method, called stream loss, for learning ConvNets using unlabeled videos in the wild. Second, we present an approach for generating a face verification dataset from videos in which the labeled streams can be created automatically without human annotation intervention. Using this approach, we have assembled a widely scalable dataset, FaceSequence, which includes 1.5M streams capturing ∼ 500K individuals. Using this dataset, we trained our network to minimize the stream loss. The network achieves accuracy comparable to the state-of-the-art on the LFW and YTF datasets with much smaller model complexity. We also fine-tuned the network using the IJB-A dataset. The validation results show competitive accuracy compared with the best previous video face verification results.
Aim: The purpose of the article is to present the hypothesis that the use of discrepancies in audiovisual materials can significantly increase the effectiveness of detecting various types of deepfake ...and related threats. In order to verify this hypothesis, the authors proposed a new method that reveals inconsistencies in both multiple modalities simultaneously and within individual modalities separately, enabling them to effectively distinguish between authentic and altered public speaking videos. Project and methods: The proposed approach is to integrate audio and visual signals in a so-called fine-grained manner, and then carry out binary classification processes based on calculated adjustments to the classification results of each modality. The method has been tested using various network architectures, in particular Capsule networks – for deep anomaly detection and Swin Transformer – for image classification. Pre-processing included frame extraction and face detection using the MTCNN algorithm, as well as conversion of audio to mel spectrograms to better reflect human auditory perception. The proposed technique was tested on multimodal deepfake datasets, namely FakeAVCeleb and TMC, along with a custom dataset containing 4,700 recordings. The method has shown high performance in identifying deepfake threats in various test scenarios. Results: The method proposed by the authors achieved better AUC and accuracy compared to other reference methods, confirming its effectiveness in the analysis of multimodal artefacts. The test results confirm that it is effective in detecting modified videos in a variety of test scenarios which can be considered an advance over existing deepfake detection techniques. The results highlight the adaptability of the method in various architectures of feature extraction networks. Conclusions: The presented method of audiovisual deepfake detection uses fine inconsistencies of multimodal features to distinguish whether the material is authentic or synthetic. It is distinguished by its ability to point out inconsistencies in different types of deepfakes and, within each individual modality, can effectively distinguish authentic content from manipulated counterparts. The adaptability has been confirmed by the successful application of the method in various feature extraction network architectures. Moreover, its effectiveness has been proven in rigorous tests on two different audiovisual deepfake datasets. Keywords: analysis of audio-video stream, detection of deepfake threats, analysis of public speeches