UNI-MB - logo
UMNIK - logo
 
(UM)
  • Statistical language modeling based on automatic classification of words
    Sepesy Maučec, Mirjam
    In statistical language modeling the model's parameters are extracted from large amounts of text. This kind of models can be built for any language without requireing any linguistic knowledge. Bigram ... and trigram language models will be discussed. In statistical modeling there is always a problem of sparse data. We will compare two proposed solutions: smoothing method proposed by Katz and automatic word clustering proposed by Ney. In the first case, some probability mass is redistributed over bigrams (trigrams) which never occured in the text. In the second case, the words are mapped into classes in such a way that the perplexity of the model is minimized. By comparing word based models and class based models we see that the use of clustered words leads to a significant improvement, as measured by the perplexity.
    Type of material - conference contribution
    Publish date - 1998
    Language - english
    COBISS.SI-ID - 3943702