NUK - logo
E-resources
Full text
Peer reviewed
  • Konkani WordNet: Corpus-Bas...
    Manerkar, Sanjana; Asnani, Kavita; Khorjuvenkar, Preeti Ravindranath; Desai, Shilpa; Pawar, Jyoti D.

    ACM transactions on Asian and low-resource language information processing, 07/2022, Volume: 21, Issue: 4
    Journal Article

    Konkani is one of the languages included in the eighth schedule of the Indian constitution. It is the official language of Goa and is spoken mainly in Goa and some places in Karnataka and Kerala. Konkani WordNet or Konkani Shabdamalem (kōṁkanī śabdamālēṁ) as it has been referred to, was developed under the Indradhanush WordNet Project Consortium during the period from August 2010 to October 2013. This project was funded by Technology Development for Indian Languages (TDIL), Department of Electronics & Information Technology (Deity), and Ministry of Communication and Information Technology (MCIT). The work on Konkani WordNet has halted since the end of the project. Currently, the Konkani WordNet contains around 32,370 synsets. However, to make it a powerful resource for NLP applications in the Konkani language, a need is felt for research work toward enhancement of the Konkani WordNet via community involvement. Crowdsourcing is a technique in which the knowledge of the crowd is utilized to accomplish a particular task. In this article, we have presented the details of the crowdsourcing platform named “Konkani Shabdarth” (kōṁkanī śabdārth). Konkani Shabdarth attempts to use the knowledge of Konkani speaking people for creating new synsets and perform the quantitative enhancement of the wordnet. It also intends to work toward enhancing the overall quality of the Konkani WordNet by validating the existing synsets, and adding the missing words to the existing synsets. A text corpus named “Konkani Shabdarth Corpus”, has been created from the Konkani literature while implementing the Konkani Shabdarth tool. Using this corpus, 572 root words that are missing from the Konkani WordNet have been identified which are given as input to Konkani Shabdarth. As of now, total 94 users have registered on the platform, out of which 25 users have actually played the game. Currently, 71 new synsets have been obtained for 21 words. For some of the words, multiple entries for the concept definition have been received. This overlap is essential for automating the process of validating the synsets. Due to the pandemic period, it has been difficult to train and get players to actually play the game and contribute. We studied the impact of adding missing words from other existing Konkani text corpus on the coverage of Konkani WordNet. The expected increase in the percentage coverage of Konkani WordNet has been found to be in the range 20–27 after adding the missing words from the Konkani Shabdarth corpus in comparison to the other corpora for which the increase is in the range 1–10.