Cross-lingual cross-modal retrieval aims at leveraging human-labeled annotations in a source language to construct cross-modal retrieval models for a new target language, due to the lack of ...manually-annotated dataset in low-resource languages (target languages). Contrary to the growing developments in the field of monolingual cross-modal retrieval, there has been less research focusing on cross-modal retrieval in the cross-lingual scenario. A straightforward method to obtain target-language labeled data is translating source-language datasets utilizing Machine Translations (MT). However, as MT is not perfect, it tends to introduce noise during translation, rendering textual embeddings corrupted and thereby compromising the retrieval performance. To alleviate this, we propose Noise-Robust Fine-tuning (NRF) which tries to extract clean textual information from a possibly noisy target-language input with the guidance of its source-language counterpart. Besides, contrastive learning involving different modalities are performed to strengthen the noise-robustness of our model. Different from traditional cross-modal retrieval methods which only employ image/video-text paired data for fine-tuning, in NRF, selected parallel data plays a key role in improving the noise-filtering ability of our model. Extensive experiments are conducted on three video-text and image-text retrieval benchmarks across different target languages, and the results demonstrate that our method significantly improves the overall performance without using any image/video-text paired data on target languages.
Affect Sensing is a rapidly growing field with the potential to revolutionize human–computer interaction, healthcare, and many more applications. Multimodal Sentiment Analysis (MSA) is a recent ...research area that exploits the multimodal nature of video data for affect sensing. However, the success of a multimodal framework depends on addressing the challenges associated with integrating diverse modalities and selecting informative features. We propose a novel multimodal representation learning framework using multimodal autoencoders that learns a comprehensive representation of the underlying heterogeneous modalities. Affect Sensing is even more challenging in low-resource languages because annotated video datasets and language-specific models are limited. To address this concern, we introduce Multimodal Sentiment Analysis Corpus in Tamil (MSAT), a small-sized dataset in the Tamil language for MSA, and exhibit how a novel technique involving cross-lingual transfer learning in a multimodal setting, leverages the knowledge gained by training the model on a larger English MSA dataset to fine-tune a much smaller Tamil MSA dataset. Our transfer learning model achieves significant gain in the Tamil dataset by a large margin. Our experiments demonstrate that we can build efficient, generalized models for low-resource languages by using the existing MSA datasets.
Display omitted
•A novel multimodal representation learning model(SPMMAE) for Sentiment Analysis is proposed.•A Multimodal Sentiment Analysis corpus in Tamil, MSAT, is curated for low-resource language research.•Achieved state-of-the-art results on CMU-MOSI, CMU-MOSEI, and MELD datasets using SPMMAE.•Cross-lingual transfer learning in a multimodal setting improves MSAT’s performance by 11%.•Established benchmark results on MSAT, a valuable resource for future research.
•A benchmark Hindi dataset in disaster domain for emotion detection is created.•Transfer learning between languages, can be done in shared vector space.•Similarity of task supersedes that of language ...for cross-lingual transfer learning.
Performance of any natural language processing (NLP) system greatly depends on the amount of resources and tools available in a particular language or domain. Therefore, while solving any problem in low-resource setting, it is important to investigate techniques to leverage the resources and tools available in resource-rich languages. In this paper we propose an efficient technique to mitigate the problem of resource scarcity for emotion detection in Hindi by leveraging information from a resource-rich language like English. Our method follows a deep transfer learning framework which efficiently captures relevant information through the shared space of two languages, showing significantly better performance compared to the monolingual scenario that learns in the vector space of only one language. As base learning models, we use Convolution Neural Network (CNN) and Bi-Directional Long Short Term Memory (Bi-LSTM). As there are no available emotion labeled dataset for Hindi, we create a new dataset for emotion detection in disaster domain by annotating sentences of news documents with nine different classes based on Plutchikâ;;s wheel of emotions. To improve the performance of emotion classification in Hindi, we employ transfer learning to exploit the resources available in the related domains. The core of our approach lies in generating a cross-lingual word embedding representation of words in the shared embedding space. The neural networks are trained on the existing datasets, and then weights are fine-tuned following the four different transfer learning strategies for emotion classification in Hindi. We obtain a significant performance gain in our our proposed transfer learning techniques, achieving an F1-score of 0.53 (compared to 0.47)-thereby implying that knowledge from a resource-rich language can be transferred across language and domains.11codes and data available at https://github.com/zishanahmad1821cs18/crosslingual_transfer
The proposed framework addresses the problem of cross-lingual transfer learning resorting to Parallel Factor Analysis 2 (PARAFAC2). To avoid the need for multilingual parallel corpora, a pairwise ...setting is adopted where a PARAFAC2 model is fitted to documents written in English (source language) and a different target language. Firstly, an unsupervised PARAFAC2 model is fitted to parallel unlabelled corpora pairs to learn the latent relationship between the source and target language. The fitted model is used to create embeddings for a text classification task (document classification or authorship attribution). Subsequently, a logistic regression classifier is fitted to the training source language embeddings and tested on the training target language embeddings. Following the zero-shot setting, no labels are exploited for the target language documents. The proposed framework incorporates a self-learning process by utilizing the predicted labels as pseudo-labels to train a new, pseudo-supervised PARAFAC2 model, which aims to extract latent class-specific information while fusing language-specific information. Thorough evaluation is conducted on cross-lingual document classification and cross-lingual authorship attribution. Remarkably, the proposed framework achieves competitive results when compared to deep learning methods in cross-lingual transfer learning tasks.
Recently, cross-lingual transfer learning has attracted extensive attention from both academia and industry. Previous studies usually focus only on the single-level alignment (e.g., word-level, ...sentence-level), based on pre-trained language models. However, it leads to suboptimal performance in downstream tasks of the low-resource language due to the missing correlation of hierarchical semantic information (e.g., sentence-to-word, word-to-word). Therefore, in this paper, we propose a novel multi-level alignment framework, which hierarchically learns the semantic correlation between multiple levels by leveraging well-designed alignment training tasks. In addition, we devise an attention-based fusion mechanism (AFM) to infuse semantic information from high levels. Extensive experiments on mainstream cross-lingual tasks (e.g., text classification, paraphrase identification, and named entity recognition) demonstrate the effectiveness of our proposed method, and also show that our model achieves state-of-the-art performance across various benchmarks compared to other strong baselines.
In this work, we tackle the problem of machine reading comprehension (MRC) on the Holy Qur’an to address the lack of Arabic datasets and systems for this important task. We construct QRCD as the ...first Qur’anic Reading Comprehension Dataset, composed of 1,337 question-passage-answer triplets for 1,093 question-passage pairs, of which 14% are multi-answer questions. We then introduce CLassical-AraBERT (CL-AraBERT for short), a new AraBERT-based pre-trained model, which is further pre-trained on about 1.0B-word Classical Arabic (CA) dataset, to complement the Modern Standard Arabic (MSA) resources used in pre-training the initial model, and make it a better fit for the task. Finally, we leverage cross-lingual transfer learning from MSA to CA, and fine-tune CL-AraBERT as a reader using two MSA-based MRC datasets followed by our QRCD dataset to constitute the first (to the best of our knowledge) MRC system on the Holy Qur’an. To evaluate our system, we introduce Partial Average Precision (pAP) as an adapted version of the traditional rank-based Average Precision measure, which integrates partial matching in the evaluation over multi-answer and single-answer MSA questions. Adopting two experimental evaluation setups (hold-out and cross validation (CV)), we empirically show that the fine-tuned CL-AraBERT reader model significantly outperforms the baseline fine-tuned AraBERT reader model by 6.12 and 3.75 points in pAP scores, in the hold-out and CV setups, respectively. To promote further research on this task and other related tasks on Qur’an and Classical Arabic text, we make both the QRCD dataset and the pre-trained CL-AraBERT model publicly available.
•We developed the first reading comprehension system on Qur’an.•QRCD dataset is introduced as the first Qur’anic Reading Comprehension Dataset.•CL-AraBERT is developed as a pre-trained model over a Classical Arabic Dataset.•Cross-lingual transfer learning from MSA to Classical Arabic is leveraged.•A new rank-based measure is proposed to integrate partial matching.
Wikipedia is one of the most prominent online platforms from which people acquire knowledge; thus, its article quality should be of great concern. Currently, many scholars focus on the quality ...assessment and quality flaws detection in Wikipedia articles. However, most of them considered only one language version, typically English. One major obstacle to conducting such research in non-English or multilanguage scenarios is insufficient labeled data. To address this, we introduce transfer learning based on a pretraining multilanguage model to verify whether it is feasible to conduct cross-language flaw detection. Specifically, we chose the Advert flaw (containing content written like an advertisement) as our research objective; French, Spanish, and Chinese as the target language scenarios; and English articles as the source scenario. Multilingual BERT combined with a sequential model was used to extract semantic features and build classifiers. Moreover, we compared three strategies (direct transfer, fine-tuning transfer and nontransfer) to determine the best strategy for cross-language Advert flaw detection at different training sample scales. The experimental results demonstrated that the proposed model trained with the English dataset can identify the Advert flaw in other languages; fine-tuning transfer yields the best performance as the corpus increases.
•Introduce transfer learning for cross-linguistic Wikipedia advert detection.•English Wikipedia samples can detect Non-English Wikipedia advert.•Multi-lingual BERT is qualified for a cross-linguistic transfer learning encoder.•Proposed fine-tuning transfer performs the best for different dataset scales.
Current research on cross-modal retrieval is mostly English-oriented, as the availability of a large number of English-oriented human-labeled vision-language corpora. In order to break the limit of ...non-English labeled data, cross-lingual cross-modal retrieval (CCR) has attracted increasing attention. Most CCR methods construct pseudo-parallel vision-language corpora via Machine Translation (MT) to achieve cross-lingual transfer. However, the translated sentences from MT are generally imperfect in describing the corresponding visual contents. Improperly assuming the pseudo-parallel data are correctly correlated will make the networks overfit to the noisy correspondence. Therefore, we propose Dual-view Curricular Optimal Transport (DCOT) to learn with noisy correspondence in CCR. In particular, we quantify the confidence of the sample pair correlation with optimal transport theory from both the cross-lingual and cross-modal views, and design dual-view curriculum learning to dynamically model the transportation costs according to the learning stage of the two views. Extensive experiments are conducted on two multilingual image-text datasets and one video-text dataset, and the results demonstrate the effectiveness and robustness of the proposed method. Besides, our proposed method also shows a good expansibility to cross-lingual image-text baselines and a decent generalization on out-of-domain data.
Neural machine translation (NMT) systems have made tangible progress in recent years, making them usable for an increasing number of domains and language pairs. The development of neural systems is ...based on machine learning algorithms and requires large electronic corpora of parallel texts, aligned at the sentence level. Such resources however only exist for a small number of language pairs and domains. To overcome this problem, a recent proposal is to develop so-called “multilingual” translation systems. These developments have been driven in particular by major Internet players, who need to develop automatic language processing tools for as many languages as possible. The main characteristic of these systems is to process multiple languages, both on the source and target sides, with a single translation engine. In this paper, we present the general principles underlying these systems and the innovations that have made them possible, before discussing their main strengths and weaknesses.