VSE knjižnice (vzajemna bibliografsko-kataložna baza podatkov COBIB.SI)
  • Predicting the level of text standardness in user-generated content [Elektronski vir]
    Ljubešić, Nikola, 1979- ...
    Non-standard language as it appears in user-generated content has recently at- tracted much attention. This paper pro- poses that non-standardness comes in two basic varieties, technical and ... linguistic, and develops a machine-learning method to discriminate between standard and non- standard texts in these two dimensions. We describe the manual annotation of a dataset of Slovene user-generated content and the features used to build our re- gression models. We evaluate and dis- cuss the results, where the mean abso- lute error of the best performing method on a three-point scale is 0.38 for tech- nical and 0.42 for linguistic standard- ness prediction. Even when using no language-dependent information sources, our predictor still outperforms an OOV- ratio baseline by a wide margin. In addi- tion, we show that very little manually an- notated training data is required to perform good prediction. Predicting standardness can help decide when to attempt to nor- malise the data to achieve better annota- tion results with standard tools, and pro- vide linguists who are interested in non- standard language with a simple way of selecting only such texts for their research.
    Vir: Proceedings [Elektronski vir] (Str. 371-378)
    Vrsta gradiva - prispevek na konferenci ; neleposlovje za odrasle
    Leto - 2015
    Jezik - angleški
    COBISS.SI-ID - 58338402