DIKUL - logo
E-viri
Recenzirano Odprti dostop
  • Beyond visual semantics: Ex...
    Dey, Arka Ujjal; Ghosh, Suman K.; Valveny, Ernest; Harit, Gaurav

    Pattern recognition letters, September 2021, 2021-09-00, 20210901, Letnik: 149
    Journal Article

    •Images use visual and scene text to convey ideas.•Jointly leveraging scene text and visual cues leads to robust semantic interpretation.•Contextual encoding capture dynamics between co-occurring visual and text elements.•Text visual semantics can be applied to retrieval and classification tasks alike. Images with visual and scene text content are ubiquitous in everyday life. However, current image interpretation systems are mostly limited to using only the visual features, neglecting to leverage the scene text content. In this paper, we propose to jointly use scene text and visual channels for robust semantic interpretation of images. We not only extract and encode visual and scene text cues but also model their interplay to generate a contextual joint embedding with richer semantics. The contextual embedding thus generated is applied to retrieval and classification tasks on multimedia images with scene text content to demonstrate its effectiveness. In the retrieval framework, we augment the contextual semantic representation with scene text cues to mitigate vocabulary misses that may have occurred during the semantic embedding. To deal with irrelevant or erroneous scene text recognition, we also apply query-based attention to the text channel. We show that our multi-channel approach, involving contextual semantics and scene text, improves upon the absolute accuracy of the current state-of-the-art methods on Advertisement Images Dataset by 8.9% in the relevant statement retrieval task and by 5% in the topic classification task.