On collocations and topic models Lau, Jey; Baldwin, Timothy; Newman, David
ACM transactions on speech and language processing,
01/07, Volume:
10, Issue:
3
Journal Article
We investigate the impact of preextracting and tokenizing bigram collocations on topic models. Using extensive experiments on four different corpora, we show that incorporating bigram collocations in ...the document representation creates more parsimonious models and improves topic coherence. We point out some problems in interpreting test likelihood and test perplexity to compare model fit, and suggest an alternate measure that penalizes model complexity. We show how the Akaike information criterion is a more appropriate measure, which suggests that using a modest number (up to 1000) of top-ranked bigrams is the optimal topic modelling configuration. Using these 1000 bigrams also results in improved topic quality over unigram tokenization. Further increases in topic quality can be achieved by using up to 10,000 bigrams, but this is at the cost of a more complex model. We also show that multiword (bigram and longer) named entities give consistent results, indicating that they should be represented as single tokens. This is the first work to explicitly study the effect of n -gram tokenization on LDA topic models, and the first work to make empirical recommendations to topic modelling practitioners, challenging the standard practice of unigram-based tokenization.
We develop an IGA collocation method modified by collocating at points other than the standard Greville abscissae. The method is related to orthogonal collocation used for solving differential ...equations and to the superconvergence theory, therefore we refer to this method as “super-collocation” (IGA-SC). By carefully choosing the collocation points, it can be seen that the IGA-SC converges in the first derivative (energy) norms at rates similar to that of the Galerkin solution. This is different from the collocation at Greville abscissae (IGA-C), where the convergence in energy norm for odd polynomial degrees is typically suboptimal. The method is tested on 1D, 2D and 3D numerical examples, in which it is compared to IGA-C and Galerkin’s method (IGA-G). The comparison includes a detailed cost vs. accuracy analysis, which shows an improved efficiency of the proposed method in particular for odd polynomial degrees.
•We propose an isogeometric collocation method with improved approximation properties.•The locations of the collocation points are derived from the superconvergence theory.•The proposed method achieves optimal convergence rates in the first derivative norms.•We include a detailed comparison with standard collocation and the Galerkin method.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UL, UM, UPCLJ, UPUK
•A comprehensive evaluation of global gridded transpiration products based on SAPFLUXNET for the first time and collocation analysis validation as a promising proxy for ungauged regions.•GLEAM and ...the weighted average based on collocation analysis perform well across diverse vegetation types but show precision limitations at lower and higher value ranges.•Accurately capturing how vegetation responds to changes in VPD, root-zone soil moisture, and radiation is crucial for precise transpiration estimates in these products.
Ecosystem transpiration estimation presents significant uncertainties, prompting the need for direct assessment through sap flow measurements to validate T products and their associated modeling mechanisms. Additionally, the scarcity of global site data urges us to seek reliable error assessment and integration methods that do not rely on observations. This pioneering study conducts a comprehensive global-scale analysis, evaluating uncertainties in four prominent products using SAPFLUXNET data and delving into multi-source data assessment and fusion using collocation analysis. Results highlight GLEAM’s superior performance across diverse vegetation types, closely trailed by weighted average and PMLv2, with ERA5L displaying the poorest performance. While discrepancies exist among product performances, their estimations across low, median, and high percentiles demonstrate moderate differences. Further sensitivity analysis underscores the critical role of accurately representing vegetation responses to VPD, root-zone soil moisture, and radiation for improved transpiration estimates. Additionally, collocation analysis emerges as a reliable tool for error analysis, with collocation-based fusion results effectively reducing input errors and demonstrating the best estimation of multi-year mean and trend. This study strongly advocates for heightened precision in these products and continual assessments to refine ecological models.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
This paper dealt with the fact that despite its far reaching importance for language proficiency, collocation competence is one of the most neglected studies in L2 pedagogy. The last few decades have ...witnessed a growing interest in vocabulary items consisting of more than a single word in the field of language pedagogy. The study adopted descriptive survey, and has as its population the senior secondary students (SSS) three of Nigerian Turkish International College (NTIC), Abuja. 60 SS III students (30 males and 30 females) were randomly selected for the study who responded to a 15-item researcher-designed collocation questionnaire (RDCQ). Reliability index of the instrument, having dealt with its validity process, was determined via test re-test statistical device which produced 0.61 reliability index at 0.05 alpha level of significance. The findings of the analysis indicate that majority of the respondents ranging from 91.67o/o to 100o/o correctly responded to the RDCQ. For these, it was concluded that the respondents are to a greater extent good in vocabulary which has assisted them in collocation study, and also a good pointer for better performance in their senior school certificate examination (SSCE). Having concluded thus, it was recommended among others that students should be asked to read different types of textbooks, newspapers as well as periodicals.
Global-scale surface soil moisture products are currently available from multiple remote sensing platforms. Footprint-scale assessments of these products are generally restricted to limited number of ...densely-instrumented validation sites. However, by taking active and passive soil moisture products together with a third independent soil moisture estimates via land surface modeling, triple collocation (TC) can be applied to estimate the correlation metric of satellite soil moisture products (versus an unknown ground truth) over a quasi-global domain. Here, an assessment of Soil Moisture Active Passive (SMAP), Soil Moisture Ocean Salinity (SMOS) and Advanced SCATterometer (ASCAT) surface soil moisture retrievals via TC is presented. Considering the potential violation of TC error assumptions, the impact of active-passive and satellite-model error cross correlations on the TC-derived inter-comparison results is examined at in situ sites using quadruple collocation analysis. In addition, confidence intervals for the TC-estimated correlation metric are constructed from moving-block bootstrap sampling designed to preserve the temporal persistence of the original (unevenly-sampled) soil moisture time-series. This study is the first to apply TC to obtain a robust global-scale cross-assessment of SMAP, SMOS and ASCAT soil moisture retrieval accuracy in terms of anomaly temporal correlation. Our results confirm the overall advantage of SMAP (with a global average anomaly correlation of 0.76) over SMOS (0.66) and ASCAT (0.63) that has been established in several recent regional, ground-based studies. SMAP is also the best-performing product over the majority of applicable land pixels (52%), although SMOS and ASCAT each shows advantage in distinct geographic regions.
•SMAP, SMOS, ASCAT soil moisture compared globally using triple collocation.•Active-passive error cross-correlation has small impact on comparison results.•SMAP performs best in more than half of the pixels where retrievals are available.•SMAP, SMOS and ASCAT demonstrate superiority in different global land regions.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UL, UM, UPCLJ, UPUK, ZRSKP
•We compare isogeometric collocation (IGA-C) with isogeometric Galerkin and FEA.•Of particular interest are quadrature cost and accuracy vs. computing time.•IGA-C has the potential to offer a more ...efficient alternative to existing technology.•Motivated by the two-scale relation of B-splines we introduce the concept of weighted collocation.•Its combination with hierarchical refinement of NURBS leads to efficient and robust adaptive IGA-C.
We compare isogeometric collocation with isogeometric Galerkin and standard C0 finite element methods with respect to the cost of forming the matrix and residual vector, the cost of direct and iterative solvers, the accuracy versus degrees of freedom and the accuracy versus computing time. On this basis, we show that isogeometric collocation has the potential to increase the computational efficiency of isogeometric analysis and to outperform both isogeometric Galerkin and standard C0 finite element methods, when a specified level of accuracy is to be achieved with minimum computational cost. We then explore an adaptive isogeometric collocation method that is based on local hierarchical refinement of NURBS basis functions and collocation points derived from the corresponding multi-level Greville abscissae. We introduce the concept of weighted collocation that can be consistently developed from the weighted residual form and the two-scale relation of B-splines. Using weighted collocation in the transition regions between hierarchical levels, we are able to reliably handle coincident collocation points that naturally occur for multi-level Greville abscissae. The resulting method combines the favorable properties of isogeometric collocation and hierarchical refinement in terms of computational efficiency, local adaptivity, robustness and straightforward implementation, which we illustrate by numerical examples in one, two and three dimensions.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UL, UM, UPCLJ, UPUK
Meshfree direct collocation method (DCM) based on strong form suffers low accuracy and instability compared with Galerkin-based meshfree methods. Meshfree weighted least squares collocation method ...(WLSCM) can improve the accuracy and stability by minimizing the least squares functional corresponding to the governing equation and boundary conditions. However, usage of more discrete points in the least squares solution increases the computational cost which notably reduces the efficiency. In this paper, we propose a new meshfree stabilized collocation method (SCM) and introduce the reproducing kernel function as the approximation. Auxiliary collocation points located in the local subdomains are employed for the stabilization. Their proper positions and corresponding weighting factors for integration using constant and cubic B-spline weighting functions are derived to keep the consistency conditions. Therefore, in this method, the approximation can satisfy the consistency conditions not only on the points, but also in the subdomains. The suggested method has the comparable accuracy and stability compared with WLSCM, and the computation efficiency can match up to DCM. Numerical simulations demonstrate that SCM can surpass DCM in accuracy and stability, and outperform WLSCM in efficiency and conditioning of the matrix according to the discrete equations, where the condition number of matrix is related to the stability.
•A new meshfree stabilized collocation method (SCM) is proposed.•Auxiliary collocation points located in the subdomains are used for stabilization.•Proper positions and corresponding weighting factors for integration are derived.•The approximation meets the consistency conditions on points and in subdomains.•SCM can surpass DCM in accuracy and stability, and outperform WLSCM in efficiency.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UL, UM, UPCLJ, UPUK
•Localized Trefftz collocation scheme is proposed for heat conduction analysis.•Generalized reciprocity method is proposed to deal with the nonhomogeneous terms.•The proposed method inherits ...semi-analytical property and avoids dense matrix.•A large-scale simulation with almost 100,000 discretization nodes is performed.
This paper presents a novel localized collocation Trefftz method (LCTM) for heat conduction analysis in two kinds of heterogeneous materials (functionally graded materials and multi-medium materials) under temperature loading. In contrast to the conventional collocation Trefftz method (CTM), the proposed LCTM divides the whole domain into many stencil support domains consisting of several discretization nodes. An efficient technique, the generalized reciprocity method (GRM), is proposed to derive the problem-dependent T-complete functions for approximating the particular solution of the nonhomogeneous equations in the stencil support domains. Based on the moving least square (MLS) technique and T-complete functions, the LCTM numerical differentiation formulation at a certain node can be derived by using a linear combination of the T-complete functions at its adjacent discretization nodes in the related stencil support domain. It inherits the semi-analytical property from the conventional CTM and avoids the ill-conditioned dense matrix problem. Besides, the quadrant criterion is applied to guarantee the stability of the proposed LCTM, and the domain decomposition method (DDM) is introduced to divide the multi-medium domains into several single-medium sub-domains. Numerical results demonstrate the accuracy and efficiency of the proposed LCTM in comparison with the known analytical solutions and the finite element method (FEM) results.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP