VSE knjižnice (vzajemna bibliografsko-kataložna baza podatkov COBIB.SI)
  • Statistical machine translation of subtitles for highly inflected language pair
    Sepesy Maučec, Mirjam ; Kačič, Zdravko ; Verdonik, Darinka
    This paper addresses the problem of statistical machine translation between highly inflected languages. Even when dealing with closely-related language pairs, statistical machine translation ... encounters problems if the parallel corpus is not big enough. To reduce the problem of data sparsity, we use the approach called factored translation, which has proven successful when translating between English and a morphologically rich language. We show that it is even more useful when translating between two highly inflected languages. The main contribution of the paper involves two extensions of the factored translation approach. First, we propose a new, more general asynchronous framework for training translation components, where lemmas in the lemma component and MSD tags in the MSD component are aligned independently of alignment done for surface word forms. The second contribution of the paper is a new technique for efficient use of a bilingual dictionary in the translation process. A dictionary is introduced into the lemma component to improve lexical translation. Dictionary use is based on entropy. We tested our enhanced translation approach on the Slovenian-Serbian language pair. The system was trained on a freely available OpenSubtitle corpus. The results show improvements in automatic scores (BLEU and TER). The approach could be used for other language pairs, especially if one or both are highly inflected.
    Vrsta gradiva - članek, sestavni del
    Leto - 2014
    Jezik - angleški
    COBISS.SI-ID - 17900054
    DOI