Following Den Besten’s (2009) desiderata for historical linguistics of Afrikaans, this article aims to contribute some modern evidence to the debate regarding the founding dialects of Afrikaans. From ...an applied perspective (i.e. human language technology), we aim to determine which West Germanic language(s) and/or dialect(s) would be best suited for the purposes of recycling speech resources for the benefit of developing speech technologies for Afrikaans. Being recognised as a West Germanic language, Afrikaans is first compared to Standard Dutch, Standard Frisian and Standard German. Pronunciation distances are measured by means of Levenshtein distances. Afrikaans is found to be closest to Standard Dutch. Secondly, Afrikaans is compared to 361 Dutch dialectal varieties in the Netherlands and North-Belgium, using material from the Reeks Nederlandse Dialectatlassen, a series of dialect atlases compiled by Blancquaert and Pée in the period 1925-1982 which cover the Dutch dialect area. Afrikaans is found to be closest to the South-Holland dialectal variety of Zoetermeer; this largely agrees with the findings of Kloeke (1950). No speech resources are available for Zoetermeer, but such resources are available for Standard Dutch. Although the dialect of Zoetermeer is significantly closer to Afrikaans than Standard Dutch is, Standard Dutch speech resources might be a good substitute.
This paper aims at landscaping the Human Language Technologies (HLT) sector by applying topic modeling and graph analysis to study the scientific literature in ACL Anthology with special emphasis on ...the Spanish participation. The analysis takes into account the structured and unstructured data to offer an overview of the HLT landscape in Spain identifying main underlying themes and its evolution in the last years compared to the international HLT community. Results obtained are represented through an interactive visualization to allow the exploration of the HLT landscape in the time frame 1983-2018.
This paper introduces a large-scale dual-mode stochastic system to automatically diacritize raw Arabic text. The first of these modes determines the most likely diacritics by choosing the sequence of ...full-form Arabic word diacritizations with maximum marginal probability via A^ lattice search and long-horizon n-grams probability estimation. When full-form words are OOV, the system switches to the second mode which factorizes each Arabic word into all its possible morphological constituents, then uses also the same techniques used by the first mode to get the most likely sequence of morphemes, hence the most likely diacritization. While the second mode achieves a far better coverage of the highly derivative and inflective Arabic language, the first mode is faster to learn, i.e., yields better disambiguation results for the same size of training corpora, especially for inferring syntactical (case-ending) diacritics. Our presented hybrid system that benefits from the advantages of both modes has experimentally been found superior to the best performing reported systems of Habash and Rambow, and of Zitouni, using the same training and test corpus for the sake of fair comparison. The word error rates of (morphological diacritization, overall diacritization including the case endings) for the three systems are, respectively, as follows (3.1%, 12.5%), (5.5%, 14.9%), and (7.9%, 18%). The hybrid architecture of language factorizing and unfactorizing components may be inspiring to other NLP/HLT problems in analogous situations.
This introduction provides an overview of the state-of-the-art technology in Applications of Natural Language to Information Systems. Specifically, we analyze the need for such technologies to ...successfully address the new challenges of modern information systems, in which the exploitation of the Web as a main data source on business systems becomes a key requirement. It will also discuss the reasons why Human Language Technologies themselves have shifted their focus onto new areas of interest very directly linked to the development of technology for the treatment and understanding of Web 2.0. These new technologies are expected to be future interfaces for the new information systems to come. Moreover, we will review current topics of interest to this research community, and will present the selection of manuscripts that have been chosen by the program committee of the NLDB 2011 conference as representative cornerstone research works, especially highlighting their contribution to the advancement of such technologies.
Proyecto emergente centrado en el tratamiento de textos educativos en castellano con la finalidad de reducir las barreras lingüísticas que dificultan la comprensión lectora a personas con ...deficiencias auditivas, o incluso a personas aprendiendo una lengua distinta a su lengua materna. Se describe la metodología aplicada para resolver los distintos problemas relacionados con el objetivo a conseguir, la hipótesis de trabajo y las tareas y los objetivos parciales alcanzados.