UP - logo
E-resources
Full text
Peer reviewed
  • PVASS-MDD: Predictive Visua...
    Yu, Yang; Liu, Xiaolong; Ni, Rongrong; Yang, Siyuan; Zhao, Yao; Kot, Alex C.

    IEEE transactions on circuits and systems for video technology, 2024-Aug., Volume: 34, Issue: 8
    Journal Article

    Deepfake techniques can forge the visual or audio signals in the video, which leads to inconsistencies between visual and audio (VA) signals. Therefore, multimodal detection methods expose deepfake videos by extracting VA inconsistencies. Recently, deepfake technology has started VA collaborative forgery to obtain more realistic deepfake videos, which poses new challenges for extracting VA inconsistencies. Recent multimodal detection methods propose to first extract natural VA correspondences in real videos in a self-supervised manner, and then use the learned real correspondences as targets to guide the extraction of VA inconsistencies in the subsequent deepfake detection stage. However, the inherent VA relations are difficult to extract due to the modality gap, which leads to the limited auxiliary performance of the aforementioned self-supervised methods. In this paper, we propose Predictive Visual-audio Alignment Self-supervision for Multimodal Deepfake Detection (PVASS-MDD), which consists of PVASS auxiliary and MDD stages. In the PVASS auxiliary stage in real videos, we first devise a three-stream network to associate two augmented visual views with corresponding audio clues, leading to explore common VA correspondences based on cross-view learning. Secondly, we introduce a novel cross-modal predictive align module for eliminating VA gaps to provide inherent VA correspondences. In the MDD stage, we propose to the auxiliary loss to utilize the frozen PVASS network to align VA features of real videos, to better assist multimodal deepfake detector for capturing subtle VA inconsistencies. We conduct extensive experiments on existing widely used and latest multimodal deepfake datasets. Our method obtains a significant performance improvement compared to state-of-the-art methods.