NUK - logo
E-resources
Full text
Peer reviewed
  • Local–Global Transformer Ne...
    Tian, Xiaoyan; Jin, Ye; Tang, Xianglong

    Multimedia systems, 04/2023, Volume: 29, Issue: 2
    Journal Article

    The temporal action segmentation task is a branch of video understanding that aims to predict what is happening in the action segments (comprising a series of consecutive action frames with identical labels) in an untrimmed video. Recent works have harnessed the Transformer, which is capable of modeling temporal relations in long sequences. However, there are several limitations when utilizing Transformer-based networks for processing video sequences, such as (1) the dramatic changes to the neighboring action segments, (2) the paradox between the loss of fine-grained information in deeper layers and inefficient learning with small receptive fields, and (3) the lack of refinement process to raise the performance. This paper proposes a novel network to address the above difficulties called the Local–Global Transformer Neural Network (LGTNN). LGTNN comprises three main modules. The first two modules are the Local and Global Transformer modules, which efficiently capture multiscale features and solve the paradox of perceiving higher- and lower-level representations at different convolutional layer depths. The third module, called the Boundary Detection Network (BDN), executes a postprocessing procedure and helps to finetune ambiguous action boundaries and generate the final prediction. Our proposed model can be embedded in existing temporal action segmentation models, such as MS-TCN, ASFormer, and ETSN. The results of experiments conducted on three challenging datasets (50Salads, Georgia Tech Egocentric Activities (GTEA), and Breakfast) using LGTNN both singly and embedded in existing segmentation models verify that it outperforms state-of-the-art methods by a large margin.