UP - logo
E-resources
Full text
Peer reviewed Open access
  • Four Million Segments and C...
    Jaworski, Rafał; Seljan, Sanja; Dunđer, Ivan

    Information, 04/2023, Volume: 14, Issue: 4
    Journal Article

    Parallel corpora have been widely used in the fields of natural language processing and translation as they provide crucial multilingual information. They are used to train machine translation systems, compile dictionaries, or generate inter-language word embeddings. There are many corpora available publicly; however, support for some languages is still limited. In this paper, the authors present a framework for collecting, organizing, and storing corpora. The solution was originally designed to obtain data for less-resourced languages, but it proved to work very well for the collection of high-value domain-specific corpora. The scenario is based on the collective work of a group of people who are motivated by the means of gamification. The rules of the game motivate the participants to submit large resources, and a peer-review process ensures quality. More than four million translated segments have been collected so far.