  • Gold-standard datasets for annotation of Slovene computer-mediated communication [Elektronski vir]
    Erjavec, Tomaž, 1960- ...
    This paper presents the first publicly available, manually annotated gold-standard datasets for the annotation of Slovene ComputerMediated Communication. In this type of language, diacritics, ... punctuation and spaces are often omitted, and phonetic spelling and slang words frequently used, which considerably deteriorates the performance of text processing tools that were trained on standard Slovene. Janes-Norm, which contains 7,816 texts or 184,766 tokens, is a gold-standard dataset for tokenisation, sentence segmentation and word normalisation, whereas Janes-Tag, comprising 2,958 texts or 75,276 tokens, was created for training and evaluating morphosyntactic tagging and lemmatisation tools for non-standard Slovene.
    Vrsta gradiva - prispevek na konferenci
    Leto - 2016
    Jezik - angleški
    COBISS.SI-ID - 62994530