Akademska digitalna zbirka SLovenije - logo
E-viri
Celotno besedilo
Recenzirano Odprti dostop
  • M2FNet: Multi-modal fusion ...
    Jiang, Chenchen; Ren, Huazhong; Yang, Hong; Huo, Hongtao; Zhu, Pengfei; Yao, Zhaoyuan; Li, Jing; Sun, Min; Yang, Shihao

    International journal of applied earth observation and geoinformation, June 2024, 2024-06-00, 2024-06-01, Letnik: 130
    Journal Article

    •A multi-modal network (M2FNet) is proposed for object detection over all lighting conditions.•The M2FNet improves the accuracy of object detection by 25.6 % for the low-light condition.•Eight thresholds of illumination range are identified to determine the optimal modal. Fusing multi-modal information from visible (VIS) and thermal infrared (TIR) images is crucial for object detection in fully adapting to varied lighting conditions. However, the existing models usually treat VIS and TIR images as independent information and extract corresponding features from separate networks due to the scarcity of training data with labeled instances from both VIS and TIR registration images. To fill this gap, a novel Multi-Modal Fusion NETwork (M2FNet) based on the Transformer architecture is proposed in this paper, which contains two effective modules: the Union-Modal Attention (UMA) and the Cross-Modal Attention (CMA). The UMA module aggregates multi-spectral features from VIS and TIR images and then extracts multi-modal features via a convolutional neural network (CNN) backbone. The CMA module is designed to learn cross-attention features from VIS and TIR pairwise features by Transformer architecture. Evaluation results by the mean average precision (mAP) metric show that the M2FNet method significantly advances the baseline methods trained using only VIS or TIR images by 10.71 % and 2.97 %, respectively. The increments in mAP are observed in the M2FNet method compared with the existing multi-modal methods on two public datasets. Sensitivity analysis of eight illumination thresholds shows that the M2FNet method presents robustness performance on varied illumination conditions and achieves the maximum increase in accuracy of 25.6 %. Moreover, this method is subsequently applied to a new testing dataset, VI2DA (Visible-Infrared paired Video and Image DAtaset), observed by diverse sensors and platforms for testing the generalization ability of object detectors, which will be publicly available at https://github.com/TIR-OD/Datasets.