Visual Question Answering (VQA) is the vision-language task of answering text-based questions presented in an image and has been advanced by the remarkable success of multimodal deep networks. ...Similar to unimodal networks, multimodal VQA models are also vulnerable to adversarial examples, which raises severe threats to the corresponding applications. Although several adversarial training methods have been proposed, most of them focus on improving the generalization ability of VQA models on clean samples instead of mitigating the adversarial attacks. In this paper, we systemically analyze the core structure of multimodal VQA networks and propose a novel adversarial training algorithm to mitigate adversarial attacks on VQA models. Specifically, our key component is a regularization term with our carefully designed Contrastive Fusion Representation (CFR), which can reduce the sensitivity of VQA models to adversarial perturbations of both the vision and language inputs. We further enhance the adversarial training with augmented CFRs. Comprehensive experimental results show that our method can mitigate adversarial attacks as well as preserve the generalization ability on clean samples under various system settings and outperforms other defense methods.
Deep hiding, concealing secret information using Deep Neural Networks (DNNs), can significantly increase the embedding rate and improve the efficiency of secret sharing. Existing works mainly force ...on designing DNNs with higher embedding rates or fancy functionalities. In this paper, we want to answer some fundamental questions: how to increase and what determines the embedding rate of deep hiding. To this end, we first propose a novel Local Deep Hiding (LDH) scheme that significantly increases the embedding rate by hiding large secret images into small local regions of cover images. Our scheme consists of three DNNs: hiding, locating, and revealing. We use the hiding network to convert a secret image in a small imperceptible compact secret code that is embedded into a random local region of a cover image. The locating network assists the revealing process by identifying the position of secret codes in the stego image, while the revealing network recovers all full-size secret images from these identified local regions. Our LDH achieves an extremely high embedding rate, i.e., \(16\times24\) bpp and exhibits superior robustness to common image distortions. We also conduct comprehensive experiments to evaluate our scheme under various system settings. We further quantitatively analyze the trade-off between the embedding rate and image quality with different image restoration algorithms.