Human detection and pose estimation are essential for understanding human activities in images and videos. Mainstream multi-human pose estimation methods take a top-down approach, where human ...detection is first performed, then each detected person bounding box is fed into a pose estimation network. This top-down approach suffers from the early commitment of initial detections in crowded scenes and other cases with ambiguities or occlusions, leading to pose estimation failures. We propose the DetPoseNet, an end-to-end multi-human detection and pose estimation framework in a unified three-stage network. Our method consists of a coarse-pose proposal extraction sub-net, a coarse-pose based proposal filtering module, and a multi-scale pose refinement sub-net. The coarse-pose proposal sub-net extracts whole-body bounding boxes and body keypoint proposals in a single shot. The coarse-pose filtering step based on the person and keypoint proposals can effectively rule out unlikely detections, thus improving subsequent processing. The pose refinement sub-net performs cascaded pose estimation on each refined proposal region. Multi-scale supervision and multi-scale regression are used in the pose refinement sub-net to simultaneously strengthen context feature learning. Structure-aware loss and keypoint masking are applied to further improve the pose refinement robustness. Our framework is flexible to accept most existing top-down pose estimators as the role of the pose refinement sub-net in our approach. Experiments on COCO and OCHuman datasets demonstrate the effectiveness of the proposed framework. The proposed method is computationally efficient (5-6x speedup) in estimating multi-person poses with refined bounding boxes in sub-seconds.
Generative Adversarial Network (GAN) based techniques can generate and synthesize realistic faces that cause profound social concerns and security problems. Existing methods for detecting ...GAN-generated faces can perform well on limited public datasets. However, images from existing datasets do not represent real-world scenarios well enough in terms of view variations and data distributions, where real faces largely outnumber synthetic ones. The state-of-the-art methods do not generalize well in real-world problems and lack the interpretability of detection results. Performance of existing GAN-face detection models degrades accordingly when facing data imbalance issues. To address these shortcomings, we propose a robust, attentive, end-to-end framework that spots GAN-generated faces by analyzing eye inconsistencies. Our model automatically learns to identify inconsistent eye components by localizing and comparing artifacts between eyes. After the iris regions are extracted by Mask-RCNN, we design a Residual Attention Network (RAN) to examine the consistency between the corneal specular highlights of the two eyes. Our method can effectively learn from imbalanced data using a joint loss function combining the traditional cross-entropy loss with a relaxation of the ROC-AUC loss via Wilcoxon-Mann-Whitney (WMW) statistics. Comprehensive evaluations on a newly created FFHQ-GAN dataset in both balanced and imbalanced scenarios demonstrate the superiority of our method.
This paper proposes the Parallel Residual Bi-Fusion Feature Pyramid Network (PRB-FPN) for fast and accurate single-shot object detection. Feature Pyramid (FP) is widely used in recent visual ...detection, however the top-down pathway of FP cannot preserve accurate localization due to pooling shifting. The advantage of FP is weakened as deeper backbones with more layers are used. In addition, it cannot keep up accurate detection of both small and large objects at the same time. To address these issues, we propose a new parallel FP structure with bi-directional (top-down and bottom-up) fusion and associated improvements to retain high-quality features for accurate localization. We provide the following design improvements: 1) parallel bifusion FP structure with a bottom-up fusion module (BFM) to detect both small and large objects at once with high accuracy; 2) concatenation and re-organization (CORE) module provides a bottom-up pathway for feature fusion, which leads to the bi-directional fusion FP that can recover lost information from lower-layer feature maps; 3) CORE feature is further purified to retain richer contextual information. Such CORE purification in both top-down and bottom-up pathways can be finished in only a few iterations; 4) adding of a residual design to CORE leads to a new Re-CORE module that enables easy training and integration with a wide range of deeper or lighter backbones. The proposed network achieves state-of-the-art performance on the UAVDT17 and MS COCO datasets.
In this work, we present the RNN Tree (RNN-T), an adaptive learning framework for skeleton based human action recognition. Our method categorizes action classes and uses multiple Recurrent Neural ...Networks (RNNs) in a treelike hierarchy. The RNNs in RNN-T are co-trained with the action category hierarchy, which determines the structure of RNN-T. Actions in skeletal representations are recognized via a hierarchical inference process, during which individual RNNs differentiate finer-grained action classes with increasing confidence. Inference in RNN-T ends when any RNN in the tree recognizes the action with high confidence, or a leaf node is reached. RNN-T effectively addresses two main challenges of large-scale action recognition: (i) able to distinguish fine-grained action classes that are intractable using a single network, and (ii) adaptive to new action classes by augmenting an existing model. We demonstrate the effectiveness of RNN-T/ACH method and compare it with the state-of-the-art methods on a large-scale dataset and several existing benchmarks.
The new developments in deep generative networks have significantly improve the quality and efficiency in generating realistically-looking fake face videos. In this work, we describe a new method to ...expose fake face videos generated with deep neural network models. Our method is based on detection of eye blinking in the videos, which is a physiological signal that is not well presented in the synthesized fake videos. Our method is evaluated over benchmarks of eye-blinking detection datasets and shows promising performance on detecting videos generated with DNN based software DeepFake.
Geographical research using historical maps has progressed considerably as the digitalization of topological maps across years provides valuable data and the advancement of AI machine learning models ...provides powerful analytic tools. Nevertheless, analysis of historical maps based on supervised learning can be limited by the laborious manual map annotations. In this work, we propose a semi-supervised learning method that can transfer the annotation of maps across years and allow map comparison and anthropogenic studies across time. Our novel two-stage framework first performs style transfer of topographic map across years and versions, and then supervised learning can be applied on the synthesized maps with annotations. We investigate the proposed semi-supervised training with the style-transferred maps and annotations on four widely-used deep neural networks (DNN), namely U-Net, fully-convolutional network (FCN), DeepLabV3, and MobileNetV3. The best performing network of U-Net achieves Formula: see text and Formula: see text trained on style-transfer synthesized maps, which indicates that the proposed framework is capable of detecting target features (bridges) on historical maps without annotations. In a comprehensive comparison, the Formula: see text of U-Net trained on Contrastive Unpaired Translation (CUT) generated dataset (Formula: see text) achieves 57.3 % than the comparative score (Formula: see text) of the least valid configuration (MobileNetV3 trained on CycleGAN synthesized dataset). We also discuss the remaining challenges and future research directions.
Automatic learning feedback monitoring and analysis are becoming essential in modern education. We present a video analytic system capable of monitoring in-class student's learning behaviors and ...providing feedback to the instructor. It is a common practice nowadays for students to take electronic notes or browse online using laptops and cellphones in class. However the use of technology can also impact student concentration and affect learning behaviors, which can seriously hinder their learning progress if not controlled properly. In this pioneering study, we propose a non-intrusive deep-learning based computer vision system to monitor student concentration by extracting and inferring high-level visual behavior cues, including their facial expressions, gestures and activities. Our system can automatically assist instructors with situational awareness in real time. We assume only RGB color images as input and runable system on edge devices for easy deployment. We propose two video analytic components for student behavior analysis: (1) The facial analysis component operates based on Dlib face detection and facial landmark tracking to localize each student and analyze their face orientations, eye blinking, gazes, and facial expressions. (2) The activity detection and recognition component operates based on OpenPose and COCO object detection can identify eight types of in-class gestures and behaviors including raising-hand, typing, phone-answering, crooked-head, desk napping, etc. Experiments are performed on a newly collected real-world In-Class Student Activity Dataset (ICSAD), where we achieved nearly 80% activity detection rate. Our system is view-independent in handling facial and pose orientations with average angular error < 10°. The source code of this work is at: https://github.com/YiZengHsieh/ICSAD .
Trastuzumab emtansine (T-DM1) is an antibody drug conjugate (ADC) that was recently approved for the treatment of HER-2-positive metastatic breast cancer. The drug sensitivity of ADCs depends mainly ...on the internalization efficiency of the drug. Caveolin-1 was shown to promote T-DM1 internalization and enhance drug sensitivity. Whether caveolin-1 can be overexpressed to improve T-DM1 efficacy is interesting and has the potential for clinical application. In this study, diabetes drug metformin was investigated in terms of induction of caveolin-1 expression for increased efficacy of subsequent T-DM1 application. BT-474 cells were pretreated with metformin, followed by combined therapy with metformin and T-DM1. The T-DM1 internalization and drug efficacy were determined, and the protein expressions for signal transduction were also monitored. Caveolin-1 shRNA was applied to suppress endogenous caveolin-1 expression, and the ability of metformin to promote T-DM1 efficacy was investigated. Result showed that in BT-474 cells pretreated with metformin, cellular caveolin-1 overexpression was induced, which then promoted drug efficacy by enhancing T-DM1 internalization. As cellular caveolin-1 was suppressed by shRNA, the effect of metformin-enhanced T-DM1 cytotoxicity was decreased. This study demonstrated that metformin can be applied prior to T-DM1 treatment to improve the clinical efficacy of T-DM1 by enhancing caveolin-1-mediated endocytosis.
We present a multimodal sensor system for wound assessment and pressure ulcer care. Multiple imaging modalities including RGB, three- dimensional (3-D) depth, thermal, multispectral, and chemical ...sensing are integrated into a portable hand-held probe for real-time wound assessment. Analytic and quantitative algorithms for various assessments including tissue composition, wound measurement in 3-D, temperature profiling, spectral, and chemical vapor analysis are developed. After each assessment scan, 3-D models of the wound are generated on the fly for geometric measurement, while multimodal observations are analyzed to estimate healing progress. Collaboration between developers and clinical practitioners was conducted at the Charlie Norwood VA Medical Center for in-field data collection and experimental evaluation. A total of 133 assessment sessions from 23 enrolled subjects were collected, on which the multimodal data were analyzed and validated with respect to clinical notes associated with each subject. The system can be operated by nontechnical caregivers on a regular basis to aid wound assessment and care. A web portal front-end was developed for clinical decision and telehealth support, where all historical patient data including wound measurements and analysis can be organized online.
Effective multi-object tracking (MOT) methods have been developed in recent years for a wide range of applications including visual surveillance and behavior understanding. Existing performance ...evaluations of MOT methods usually separate the tracking step from the detection step by using one single predefined setting of object detection for comparisons. In this work, we propose a new University at Albany DEtection and TRACking (UA-DETRAC) dataset for comprehensive performance evaluation of MOT systems especially on detectors. The UA-DETRAC benchmark dataset consists of 100 challenging videos captured from real-world traffic scenes (over 140,000 frames with rich annotations, including illumination, vehicle type, occlusion, truncation ratio, and vehicle bounding boxes) for multi-object detection and tracking. We evaluate complete MOT systems constructed from combinations of state-of-the-art object detection and tracking methods. Our analysis shows the complex effects of detection accuracy on MOT system performance. Based on these observations, we propose effective and informative evaluation metrics for MOT systems that consider the effect of object detection for comprehensive performance analysis.
•New large scale dataset for both detection and multi-object tracking evaluation.•New protocol and evaluation metrics for multi-object tracking.•Comprehensive evaluation of complete multi-object tracking systems.