X-ray imaging machines are widely used in border control checkpoints or public transportation, for luggage scanning and inspection. Recent advances in deep learning enabled automatic object detection ...of X-ray imaging results to largely reduce labor costs. Compared to tasks on natural images, object detection for X-ray inspection are typically more challenging, due to the varied sizes and aspect ratios of X-ray images, random locations of the small target objects within the redundant background region, etc. In practice, we show that directly applying off-the-shelf deep learning-based detection algorithms for X-ray imagery can be highly time-consuming and ineffective. To this end, we propose a Task-Driven Cropping scheme, dubbed TDC, for improving the deep image detection algorithms towards efficient and effective luggage inspection via X-ray images. Instead of processing the whole X-ray images for object detection, we propose a two-stage strategy, which first adaptively crops X-ray images and only preserves the task-related regions, i.e., the luggage regions for security inspection. A task-specific deep feature extractor is used to rapidly identify the importance of each X-ray image pixel. Only the regions that are useful and related to the detection tasks are kept and passed to the follow-up deep detector. The varied-scale X-ray images are thus reduced to the same size and aspect ratio, which enables a more efficient deep detection pipeline. Besides, to benchmark the effectiveness of X-ray image detection algorithms, we propose a novel dataset for X-ray image detection, dubbed SIXray-D, based on the popular SIXray dataset. In SIXray-D, we provide the complete and more accurate annotations of both object classes and bounding boxes, which enables model training for supervised X-ray detection methods. Our results show that our proposed TDC algorithm can effectively boost popular detection algorithms, by achieving better detection mAPs or reducing the run time.
Inspired by the philosophy employed by human beings to determine whether a presented face example is genuine or not, i.e., to glance at the example globally first and then carefully observe the local ...regions to gain more discriminative information, for the face anti-spoofing problem, we propose a novel framework based on the Convolutional Neural Network (CNN) and the Recurrent Neural Network (RNN). In particular, we model the behavior of exploring face-spoofing-related information from image sub-patches by leveraging deep reinforcement learning. We further introduce a recurrent mechanism to learn representations of local information sequentially from the explored sub-patches with an RNN. Finally, for the classification purpose, we fuse the local information with the global one, which can be learned from the original input image through a CNN. Moreover, we conduct extensive experiments, including ablation study and visualization analysis, to evaluate our proposed framework on various public databases. The experiment results show that our method can generally achieve state-of-the-art performance among all scenarios, demonstrating its effectiveness.
Document recapturing is a presentation attack that covers the forensic traces in the digital domain. Document presentation attack detection (DPAD) is an important step in the document authentication ...pipeline. Existing DPAD methods suffer from low generalization performance under the cross-domain scenario with different types of documents. Data augmentation is a de facto technique to reduce the risk of overfitting the training data and improve the generalizability of a trained model. In this work, we improve the generalization performance of DPAD approaches by addressing two important limitations of the existing frequency domain augmentation (FDA) methods. First, contrary to the existing FDA methods that treat different spectral bands equally, we establish a band-of-interest localization (BOIL) method that locates the spectral band-of-interest (BOI) related to the recapturing operation by domain knowledge from the theoretical distortion models. Second, we propose a frequency-domain halftoning augmentation (FHAG) strategy that enhances the halftoning features in the BOI with considerations of different halftoning distortions. To evaluate the generalization performance of our FHAG with BOIL method on different types of document images, we have constructed a diverse recaptured document image dataset with 162 types of documents (RDID162), consisting of 5346 samples. The proposed method has been evaluated on the generic deep learning models and a state-of-the-art DPAD approach under both cross-device and cross-domain protocols for the DPAD task. Compared to the existing FDA methods, our method has improved the models with ResNet50 backbone by reducing more than 25% or 5 percentage points in EERs. The source code and data in this work is available at https://github.com/chenlewis/FHAG-with-BOIL .
Face Anti-Spoofing (FAS) is essential to secure face recognition systems and has been extensively studied in recent years. Although deep neural networks (DNNs) for the FAS task have achieved ...promising results in intra-dataset experiments with similar distributions of training and testing data, the DNNs' generalization ability is limited under the cross-domain scenarios with different distributions of training and testing data. To improve the generalization ability, recent hybrid methods have been explored to extract task-aware handcrafted features (e.g., Local Binary Pattern) as discriminative information for the input of DNNs. However, the handcrafted feature extraction relies on experts' domain knowledge, and how to choose appropriate handcrafted features is underexplored. To this end, we propose a learnable network to extract Meta Pattern (MP) in our learning-to-learn framework. By replacing handcrafted features with the MP, the discriminative information from MP is capable of learning a more generalized model. Moreover, we devise a two-stream network to hierarchically fuse the input RGB image and the extracted MP by using our proposed Hierarchical Fusion Module (HFM). We conduct comprehensive experiments and show that our MP outperforms the compared handcrafted features. Also, our proposed method with HFM and the MP can achieve state-of-the-art performance on two different domain generalization evaluation benchmarks.
Abstract Recently, vision transformer (ViT) based multimodal learning methods have been proposed to improve the robustness of face anti-spoofing (FAS) systems. However, there are still no works to ...explore the fundamental natures (e.g., modality-aware inputs, suitable multimodal pre-training, and efficient finetuning) in vanilla ViT for multimodal FAS. In this paper, we investigate three key factors (i.e., inputs, pre-training, and finetuning) in ViT for multimodal FAS with RGB, Infrared (IR), and Depth. First, in terms of the ViT inputs, we find that leveraging local feature descriptors (such as histograms of oriented gradients) benefits the ViT on IR modality but not RGB or Depth modalities. Second, in consideration of the task (FAS vs. generic object classification) and modality (multimodal vs. unimodal) gaps, ImageNet pre-trained models might be sub-optimal for the multimodal FAS task. Finally, in observation of the inefficiency on direct finetuning the whole or partial ViT, we design an adaptive multimodal adapter (AMA), which can efficiently aggregate local multimodal features while freezing majority of ViT parameters. To bridge these gaps, we propose the modality-asymmetric masked autoencoder (M $$^{2}$$ 2 A $$^{2}$$ 2 E) for multimodal FAS self-supervised pre-training without costly annotated labels. Compared with the previous modality-symmetric autoencoder, the proposed M $$^{2}$$ 2 A $$^{2}$$ 2 E is able to learn more intrinsic task-aware representation and compatible with modality-agnostic (e.g., unimodal, bimodal, and trimodal) downstream settings. Extensive experiments with both unimodal (RGB, Depth, IR) and multimodal (RGB+Depth, RGB+IR, Depth+IR, RGB+Depth+IR) settings conducted on multimodal FAS benchmarks demonstrate the superior performance of the proposed methods. One highlight is that the proposed method is robust under various missing-modality cases where previous multimodal FAS models suffer serious performance drops. We hope these findings and solutions can facilitate the future research for ViT-based multimodal FAS.
The traditional two-dimensional (2D) barcode has been employed in anti-counterfeiting systems as a storage media for serial numbers. However, an attack can be initiated by simply copying the 2D ...barcode and attaching it to a counterfeit product. In this paper, we aim at proposing an authentication scheme with a mobile imaging device for a 2D barcode. This work presents a competitive solution among the 2D barcode authentication schemes that have been verified under mobile imaging conditions. The proposed copy-proof scheme is composed of two sets of features which are extracted by exploiting the characteristics of barcoding channel models. The proposed features identify the intrinsic differences between genuine and counterfeit barcode images in the frequency and spatial domains. An efficient two-stage barcode authentication framework is then proposed by combining the two sets of features in a cascading manner. To evaluate the practicality of the proposed authentication scheme, four databases with different devices (printers, scanners, mobile cameras), barcode sizes, and barcode designs are considered in the experiments. By comparing with the existing texture descriptors and some deep learning-based approaches, it is shown that the proposed scheme has a higher authentication accuracy under various conditions, such as cross-database, cross-size and cross-pattern experiments which study the generalities of a pre-trained model towards challenging conditions commonly found in real-world scenarios. Last but not least, the proposed scheme has been evaluated under some state-of-the-art attack scenarios where the attacker employs several realizations of genuine patterns or the deep learning-based technique to produce a counterfeit copy. The source code and data for producing the results in our experiments are available at https://bit.ly/2FOlJH7 .
Face Anti-Spoofing (FAS) aims to detect malicious attempts to invade a face recognition system by presenting spoofed faces. State-of-the-art FAS techniques predominantly rely on deep learning models ...but their cross-domain generalization capabilities are often hindered by the domain shift problem, which arises due to different distributions between training and testing data. In this study, we develop a generalized FAS method under the Efficient Parameter Transfer Learning (EPTL) paradigm, where we adapt the pre-trained Vision Transformer models for the FAS task. During training, the adapter modules are inserted into the pre-trained ViT model, and the adapters are updated while other pre-trained parameters remain fixed. We find the limitations of previous vanilla adapters in that they are based on linear layers, which lack a spoofing-aware inductive bias and thus restrict the cross-domain generalization. To address this limitation and achieve cross-domain generalized FAS, we propose a novel Statistical Adapter (S-Adapter) that gathers local discriminative and statistical information from localized token histograms. To further improve the generalization of the statistical tokens, we propose a novel Token Style Regularization (TSR), which aims to reduce domain style variance by regularizing Gram matrices extracted from tokens across different domains. Our experimental results demonstrate that our proposed S-Adapter and TSR provide significant benefits in both zero-shot and few-shot cross-domain testing, outperforming state-of-the-art methods on several benchmark tests. We will release the source code upon acceptance.
Face presentation attack detection (PAD) has been extensively studied by research communities to enhance the security of face recognition systems. Although existing methods have achieved good ...performance on testing data with similar distribution as the training data, their performance degrades severely in application scenarios with data of unseen distributions. In situations where the training and testing data are drawn from different domains, a typical approach is to apply domain adaptation techniques to improve face PAD performance with the help of target domain data. However, it has always been a non-trivial challenge to collect sufficient data samples in the target domain, especially for attack samples. This paper introduces a teacher-student framework to improve the cross-domain performance of face PAD with one-class domain adaptation. In addition to the source domain data, the framework utilizes only a few genuine face samples of the target domain. Under this framework, a teacher network is trained with source domain samples to provide discriminative feature representations for face PAD. Student networks are trained to mimic the teacher network and learn similar representations for genuine face samples of the target domain. In the test phase, the similarity score between the representations of the teacher and student networks is used to distinguish attacks from genuine ones. To evaluate the proposed framework under one-class domain adaptation settings, we devised two new protocols and conducted extensive experiments. The experimental results show that our method outperforms baselines under one-class domain adaptation settings and even state-of-the-art methods with unsupervised domain adaptation.
As deep learning-based inpainting methods have achieved increasingly better results, its malicious use, e.g. removing objects to report fake news or to provide fake evidence, is becoming threatening. ...Previous works have provided rich discussions on network architectures, e.g. even performing Neural Architecture Search to obtain the optimal model architecture. However, there are rooms in other aspects. In our work, we provide comprehensive efforts from data and feature aspects. From the data aspect, as harder samples in the training data usually lead to stronger detection models, we propose near original image augmentation that pushes the inpainted images closer to the original ones (without distortion and inpainting) as the input images, which is proved to improve the detection accuracy. From the feature aspect, we propose to extract the attentive pattern. With the designed attentive pattern, the knowledge of different inpainting methods can be better exploited during the training phase. Finally, extensive experiments are conducted. In our evaluation, we consider the scenarios where the inpainting masks, which are used to generate the testing set, have a distribution gap from those masks used to produce the training set. Thus, the comparisons are conducted on a newly proposed dataset, where testing masks are inconsistent with the training ones. The experimental results show the superiority of the proposed method and the effectiveness of each component. All our codes and data will be online available.