The rapid globalisation in language technology and the Internet's fast expansion have brought nations and their cultures close together, and the demand for inter-language interactions has risen ...enormously. However, in many low-resource languages (LRL) pairings and areas, Machine Translation (MT) is still not viable because of a lack of parallel data. The challenge of MT is still unsolved. Recent studies employing monolingual datasets have shown excellent outcomes in Phrase-based Statistical MT (PBSMT) and Neural MT (NMT) systems. However, earlier researchers have demonstrated that unsupervised Statistical MT surpasses unsupervised NMT, especially for different language pairings. The study unveils the compendium of ten unsupervised SMT systems translation tasks utilizing a monolingual dataset from the Dravidian and Indo-Aryan language families; and a low-resource endangered language. The machine-translated experimental outcomes examined the system using different tokenizers and investigated them for various language pairs using different evaluation metrics for various iterations. The statistical significance of test results has been computed for each evaluation metric to check the true system quality of the translation tasks.
Learning machine translation by using only monolingual data sets is a complex task as there are many possible ways to connect or associate target sentences with source sentences. The monolingual word ...embeddings are linearly mapped on a common shared space through robust learning or adversarial training in an unsupervised way, but these learning techniques have fundamental limitations in translating sentences. In this paper, a simple yet effective method has been proposed for fully unsupervised machine translation that is based on cross-lingual sense to word embedding instead of cross-lingual word embedding and language model. We have utilized word sense disambiguation to incorporate the source language context in order to select the sense of a word more appropriately. A language model for considering target language context in lexical choices and denoising autoencoder for language insertion, deletion, and reordering are integrated. The proposed approach eliminates the problem of noisy target language context due to erroneous word translations. This work takes into account the challenge of homonyms and polysemous words in the case of morphologically rich languages. The experiments performed on English-Hindi and Hindi-English using different evaluation metrics show an improvement of +3 points in BLEU and METEOR-Hindi over the baseline system.
This letter, presents the compendium of eight unsupervised Machine Translation (MT) systems built from monolingual corpus of five Indian languages from the Indo-Aryan and Dravidian language families. ...Recent research has demonstrated outstanding results in completely unsupervised training of Phrase-based Statistical MT (PBSMT) systems using innovative and designs that rely solely on monolingual datasets. However, prior research has shown that Unsupervised Statistical MT (USMT) outperforms Unsupervised Neural MT (UNMT), particularly for language pairings that are not closely related. The purpose of this work is to investigate the architecture of the USMT system utilizing only monolingual dataset using four different Indian morphologically rich languages and one low-resource endangered Kangri language. The experimental results analysis are evaluated using different natural language toolkit tokenizers and analyzed for different language pair using various fully automatic MT evaluation metrics for different iterations.
Unsupervised word to word translation without parallel corpora has attracted much research interest in the recent years. Even with the remarkable success of the recent techniques that trained with ...adversarial learning methods achieved a high accuracy. But they suffer from the typical drawbacks of generative adversarial models that is sensitivity to hyper-parameters, long training time and lack of interpretability. In this paper, we proposed a method of cross-lingual word embedding generation for English and morphological rich Hindi language pairs especially for healthcare professional because it will remove the communication barrier among patients regardless of its language. There is no requirement of aligned document or sentence aligned corpus, nor any bilingual dictionary because fully unsupervised learning method has been used. We are following the assumption of intra-lingual similarity distribution idea that the distribution graph is identical for the most common terms between language pairs and isometric embeddings. The performance is analyzed by using different word retrieval methods and compared for the cross-lingual word embedding of an English Hindi language pair, which is trained for both fully unsupervised and semi-supervised ways by passing the seed dictionary. We have also provided the comparative analysis of results of adversarial training and robust self-learning method for English and Hindi languages.
The fast advancement in machine translation models necessitates the development of accurate evaluation metrics that would allow researchers to track the progress in text languages. The evaluation of ...machine translation models is crucial since its results are exploited for improvements of translation models. However fully automatically evaluating the machine translation models in itself is a huge challenge for the researchers as human evaluation is very expensive, time-consuming, unreproducible. This paper presents a detailed classification and comprehensive survey on various fully automated evaluation metrics, which are used to assess the performance or quality of machine translated output. Various fully automatic evaluation metrics are classified into five categories that are lexical, character, semantic, syntactic, and semantic & syntactic evaluation metrics for better understanding purpose. Taking account of the challenges posed in the field of machine translation evaluation by Statistical Machine Translation and Neural Machine Translation, along with a discussion on the advantages, disadvantages, and gaps for each fully automatic machine translation evaluation metric has been provided. The presented study will help machine translation researchers in quickly identifying automatic machine translation evaluation metrics that are most appropriate for the improvement or development of their machine translation model, as well as researchers in gaining a general understanding of how automatic machine translation evaluation research evolved.
With the fast advancement of AI technology in recent years, many excellent Data Augmentation (DA) approaches have been investigated to increase data efficiency in Natural Language Processing (NLP). ...The reliance on a large amount of data prohibits NLP models from performing tasks such as labelling enormous amounts of textual data, which require a substantial amount of time, money, and human resources; hence, a better model requires more data. Text DA technique rectifies the data by extending it, enhancing the model's accuracy and resilience. A novel lexical-based matching approach is the cornerstone of this work; it is used to improve the quality of the Machine Translation (MT) system. This study includes resource-rich Indic (i.e., Indo-Aryan and Dravidian language families) to examine the proposed techniques. Extensive experiments on a range of language pairs depict that the proposed method significantly improves scores in the enhanced dataset compared to the baseline system's BLEU, METEOR and ROUGE evaluation scores.
The effective method to utilize monolingual data and enhance the performance of neural machine translation models is back-translation. Iteratively conducting back-translation can further improve the ...performance of the translation model. In back-translation where, pseudo sentence pairs are generated to train the translation systems with a reconstruction loss, but all the pseudo sentence pairs are not of good quality, which can severely impact the performance of neural machine translation systems. This paper proposes an approach to unsupervised learning for neural machine translation with weighted back translation as part of the training process, as it provides more weight to good pseudo-parallel sentence pairs. The weight is calculated as the round-trip semantic similarity score for each pseudo-parallel sentence. We overcome the limitation of earlier lexical metric-based approaches, especially in the case of morphologically rich languages. Experimental results show an improvement of up to around 0.7% BLEU score over the baseline paper for morphologically rich language (English–Hindi, English–Tamil, and English–Telugu) and 0.3% BLEU score for low resource Hindi-Kangri language.
As vehicular traffic continues to grow traffic management and prevention of accidents has become a major concern. This problem only gets magnified when travel on the mountainous roads are considered. ...This study is especially focused toward Himalayan mountains as they pose a greater risk because of their rugged natural setting. This study investigates crucial problems faced on the hilly roads and the challenges in translating existing driver-assistance systems to such roads. The survey probes every lane detection algorithms, image processing techniques and various assistance features for applicability to hilly roads discussing the pros and cons for each of them. Conclusions are drawn as to deduce the more suitable methods that can be improvised and re-tuned to adapt them for mountainous roads.
The fast growth of communication technology has brought nations and their cultures closer together, and the demand for cross-language communication has risen tremendously. There is a different ...learning method to connect the source language to the target language in which unsupervised learning is a blessing for low-resource languages. The unsupervised machine translation is always problematic to those languages which are morphologically rich and low resources languages. Morphologically rich and low-resource language does not provide good results in machine translation if the translation is from morphologically less complex language to morphologically more complex languages. In this paper, we have improved the unsupervised neural machine translation by tackling the ambiguity problem and the quality of pseudo-parallel sentence pairs generated through back-translation for morphologically rich languages. The ambiguity problem is solved by taking the cross-lingual sense embedding at the source side instead of cross-lingual word embedding. By giving more weight to better pseudo-parallel sentence pairs in the back-translation step, the quality of pseudo-parallel sentences is increased. Different evaluation metrics have been used to check the robustness of the model and compared with different baseline models. The experiment is performed on different morphologically rich languages English-Hindi, English-Tamil, English-Telegu, and one low-resource endangered kangri language.
The recent United Nations Educational, Scientific and Cultural Organization (UNESCO) survey states that India has 197 endangered languages. Himachal Pradesh, a state in India, has topped the list ...with seven definitely endangered languages, and Kinnauri-Pahari being the one. Due to the lack of availability of digitized resources, the corpus compilation is a bit difficult. This paper presents and releases the Kinnauri-Pahari (ISO-639-3:kjo) dataset, consisting of the 43,362 Monolingual and 20,307 Parallel sentences in version_0.1. The dataset was tested on the Statistical, and Neural Machine Translation and their results were evaluated using different evaluation metrics. The corpus is freely available for non-commercial usage and research (
https://github.com/phildani7/dlnith/tree/master/Kinnauri-Pahari
).