Most of the existing approaches to bilingual lexicon extraction (BLE) first map words in source and target languages into a single vector space, and then measure the similarity of words across the ...two languages in this space. We point out that existing BLE methods suffer from the so-called hubness phenomenon; i.e., a small number of translation candidates (hub candidates) are chosen by the systems as likely translations of many source words, which consequently degrade the accuracy of extracted translations. We show that this phenomenon can be alleviated by centering the data or by using the mutual proximity measure, which are two known techniques that effectively reduce hubness in standard nearest-neighbor search settings. Our empirical evaluation shows that naive nearest-neighbor search combined with these methods outperforms a recently proposed BLE method based on label propagation.
We present in this article experiments on multi-language information extraction and access in the medical domain. For such applications, multilingual terminology plays a crucial role when working on ...specialized languages and specific domains.
We propose firstly a method for enriching multilingual thesauri which extracts new terms from parallel corpora, and secondly, a new approach for bilingual lexicon extraction from comparable corpora, which uses a bilingual thesaurus as a pivot. We illustrate their use in multi-language information retrieval (English/German) in the medical domains.
Our experiments show that these automatically extracted bilingual lexicons are accurate enough (85% precision for term extraction) for semi-automatically enriching mono- or bi-lingual thesauri such as the universal medical language system, and that their use in cross-language information retrieval significantly improves the retrieval performance (from 22 to 40% average precision) and clearly outperforms existing bilingual lexicon resources (both general lexicons and specialized ones).
We show in this paper first that bilingual lexicon extraction from parallel corpora in the medical domain could lead to accurate, specialized lexicons, which can be used to help enrich existing thesauri and second that bilingual lexicons extracted from comparable corpora outperform general bilingual resources for cross-language information retrieval.
Les lexiques bilingues sont des ressources particulièrement utiles pour la Traduction Automatique et la Recherche d’Information Translingue. Leur construction manuelle nécessite une expertise forte ...dans les deux langues concernées et est un processus coûteux. Plusieurs méthodes automatiques ont été proposées comme une alternative, mais elles qui ne sont disponibles que dans un nombre limité de langues et leurs performances sont encore loin derrière la qualité des traductions manuelles.Notre travail porte sur l'extraction de ces lexiques bilingues à partir de corpus de textes parallèles et comparables, c'est à dire la reconnaissance et l'alignement d'un vocabulaire commun multilingue présent dans ces corpus.
Bilingual lexicons are central components of machine translation and cross-lingual information retrieval systems. Their manual construction requires extensive expertise in both languages involved and it is a costly process. Several automatic methods were proposed as an alternative but they often rely of resources available in a limited number of languages and their performances are still far behind the quality of manual translations.Our work concerns bilingual lexicon extraction from multilingual parallel and comparable corpora, in other words, the process of finding translation pairs among the common multilingual vocabulary available in such corpora.
Parallel corpora are crucial for statistical machine translation (SMT); however, they are quite scarce for most language pairs and domains. As comparable corpora are far more available, many studies ...have been conducted to extract parallel sentences from them for SMT. Parallel sentence extraction relies highly on bilingual lexicons that are also very scarce. We propose an unsupervised bilingual lexicon extraction based parallel sentence extraction system that first extracts bilingual lexicons from comparable corpora and then extracts parallel sentences using the lexicons. Our bilingual lexicon extraction method is based on a combination of topic model and context based methods in an iterative process. The proposed method does not rely on any prior knowledge, and the performance can be improved iteratively. The parallel sentence extraction method uses a binary classifier for parallel sentence identification. The extracted bilingual lexicons are used for the classifier to improve the performance of parallel sentence extraction. Experiments conducted with the Wikipedia data indicate that the proposed bilingual lexicon extraction method greatly outperforms existing methods, and the extracted bilingual lexicons significantly improve the performance of parallel sentence extraction for SMT.