E-resources
-
Lau, Jey Han; Baldwin, Timothy; Newman, David
ACM transactions on speech and language processing, 07/2013, Volume: 10, Issue: 3Journal Article
We investigate the impact of preextracting and tokenizing bigram collocations on topic models. Using extensive experiments on four different corpora, we show that incorporating bigram collocations in the document representation creates more parsimonious models and improves topic coherence. We point out some problems in interpreting test likelihood and test perplexity to compare model fit, and suggest an alternate measure that penalizes model complexity. We show how the Akaike information criterion is a more appropriate measure, which suggests that using a modest number (up to 1000) of top-ranked bigrams is the optimal topic modelling configuration. Using these 1000 bigrams also results in improved topic quality over unigram tokenization. Further increases in topic quality can be achieved by using up to 10,000 bigrams, but this is at the cost of a more complex model. We also show that multiword (bigram and longer) named entities give consistent results, indicating that they should be represented as single tokens. This is the first work to explicitly study the effect of n -gram tokenization on LDA topic models, and the first work to make empirical recommendations to topic modelling practitioners, challenging the standard practice of unigram-based tokenization.
![loading ... loading ...](themes/default/img/ajax-loading.gif)
Shelf entry
Permalink
- URL:
Impact factor
Access to the JCR database is permitted only to users from Slovenia. Your current IP address is not on the list of IP addresses with access permission, and authentication with the relevant AAI accout is required.
Year | Impact factor | Edition | Category | Classification | ||||
---|---|---|---|---|---|---|---|---|
JCR | SNIP | JCR | SNIP | JCR | SNIP | JCR | SNIP |
Select the library membership card:
If the library membership card is not in the list,
add a new one.
DRS, in which the journal is indexed
Database name | Field | Year |
---|
Links to authors' personal bibliographies | Links to information on researchers in the SICRIS system |
---|
Source: Personal bibliographies
and: SICRIS
The material is available in full text. If you wish to order the material anyway, click the Continue button.