The present article aims to introduce structural equation modeling, in particular measured variable path models, and discuss their great potential for corpus linguists. Compared to other techniques ...commonly employed in the field such as multiple regression, path models are highly flexible and enable testing
hypotheses about causal relations between multiple independent and dependent variables. In addition to increased methodological versatility, this technique encourages big-picture, model-based reasoning, thus allowing corpus linguists to move away from the, at times, somewhat overly simplified mindset brought about by the more narrow null-hypothesis significance testing paradigm. The article also includes commentary on corpus linguistics and its trajectory, arguing in favor of increased cumulative knowledge building.
In the first volume of
, Gries (2005. Null-hypothesis significance testing of word frequencies: A follow-up on Kilgarriff.
1(2). doi:10.1515/cllt.2005.1.2.277.
: 285) asked whether corpus linguists ...should abandon null-hypothesis significance testing. In this paper, I want to revive this discussion by defending the argument that the assumptions that allow inferences about a given population – in this case about the studied languages – based on results observed in a sample – in this case a collection of naturally occurring language data – are not fulfilled. As a consequence, corpus linguists should indeed abandon null-hypothesis significance testing.
Keyword extraction involves the application of Natural Language Processing (NLP) algorithms or models developed in the realm of text mining. Keyword extraction is a common technique used to explore ...linguistic patterns in the corpus linguistic field, and Dunning’s Log-Likelihood Test (LLT) has long been integrated into corpus software as a statistic-based NLP model. While prior research has confirmed the widespread applicability of keyword extraction in corpus-based research, LLT has certain limitations that may impact the accuracy of keyword extraction in such research. This paper summarized the limitations of LLT, which include benchmark corpus interference, elimination of grammatical and generic words, consideration of sub-corpus relevance, flexibility in feature selection, and adaptability to different research goals. To address these limitations, this paper proposed an extended Term Frequency-Inverse Document Frequency (TF-IDF) method. To verify the applicability of the proposed method, 20 highly cited research articles on climate change from the Web of Science (WOS) database were used as the target corpus, and a comparison was conducted with the traditional method. The experimental results indicated that the proposed method could effectively overcome the limitations of the traditional method and demonstrated the feasibility and practicality of incorporating the TF-IDF algorithm into relevant corpus-based research.
This study focuses on connectives, linguistic devices that signal relations between ideas. Connective knowledge varies widely across grades and is known to support reading comprehension and school ...writing. Despite the important role that connective knowledge plays in reading and writing, few resources exist to support teachers in understanding which connectives to prioritize in Spanish literacy instruction. In this study we combined a secondary analysis of midadolescents’ connective assessment data (810 students in grades 4–8) with a corpus linguistics frequency analysis of connectives in Chilean national science and social studies textbooks (1,079,593 total words), with the goal of generating a pedagogically relevant tool to support the instruction of connectives in Spanish literacy across content areas. To generate the
we first examined the role of grade and connective type in midadolescents’ connective knowledge. Secondly, we examined associations between students’ knowledge and textual frequencies across content areas for each connective included in the assessment. Finally, informed by midadolescents’ connective knowledge and by the corpus linguistics analysis of science and social studies textbooks from grades 1–8, we generated the
listing connectives frequencies by grade, connective type, and content area.
In this paper we document the developmental trajectory of the complementizer system (CP-system) in Italian by looking at the earliest spontaneous production of eleven young children, whose ...transcriptions are available on CHILDES. We conducted a novel corpus analysis, tracking down a number of constructions in which the clausal left-periphery is activated. First, we considered the appearance of the different complementizer particles in the CP-system, which overtly realize the three distinct functional projections ForceP, IntP, and FinP. The analysis revealed that children acquiring Italian correctly use these complementizer particles already in the third year of life. Second, we looked for the simultaneous activation of different functional projections within the CP-system. We went through our corpus searching for complex sentences in which more than one constituent was moved to the left periphery. This option is allowed by the adult grammar of Italian and, as our search revealed, it is also attested in the grammar of young children. Soon after their second birthday, sequences in which a left-dislocated Topic and a Wh- element co-occur are attested, directly supporting the existence of a (high) Topic position above FocusP. Moreover, movement in general conforms to the constraints of the adult grammar, with no attested violation of obligatory inversion (a consequence of the Q-Criterion). Importantly, "
-questions" did not require inversion, much as in the adult grammar of Italian. Taken together, children's use of complementizer particles and their activation of multiple landing sites for movement show that 2-year-olds already possess a richly articulated functional structure of the CP-system, aligned to the layered adult structure. In concluding the paper, we also discuss some temporal differences between constructions activating high and low portions of the CP-system. In particular, we detect a temporal precedence for wh-questions over why-questions. Since the former activate a lower projection, this is consistent with the recently proposed
hypothesis, according to which the development of the CP-system proceeds stepwise.
In this paper, we present corpus data that questions the concept of native speaker homogeneity as it is presumed in many studies using native speakers (L1) as a control group for learner data (L2), ...especially in corpus contexts. Usage-based research on second and foreign language acquisition often investigates quantitative differences between learners, and usually a group of native speakers serves as a control group, but often without elaborating on differences within this group to the same extent. We examine inter-personal differences using data from two well-controlled German native speaker corpora collected as control groups in the context of second and foreign language research. Our results suggest that certain linguistic aspects vary to an extent in the native speaker data that undermines general statements about quantitative expectations in L1. However, we also find differences between phenomena: while morphological and syntactic sub-classes of verbs and nouns show great variability in their distribution in native speaker writing, other, coarser categories, like parts of speech, or types of syntactic dependencies, behave more predictably and homogeneously. Our results highlight the necessity of accounting for inter-individual variance in native speakers where L1 is used as a target ideal for L2. They also raise theoretical questions concerning a) explanations for the divergence between phenomena, b) the role of frequency distributions of morphosyntactic phenomena in usage-based linguistic frameworks, and c) the notion of the individual adult native speaker as a general representative of the target language in language acquisition studies or language in general.
This paper presents word clusters used to comment on results in the Discussion section of quantitative research articles in the field of applied linguistics. The corpus linguistic approach was ...adopted to identify clusters in 124 Discussion texts from leading applied linguistics journals. The identified clusters were then comprehensively analysed in context for their discourse functions. Next, the present study mapped the clusters onto an analytical framework termed the ‘four-Step model’, based on Yang and Allison's (2003) genre-based description of the Commenting on results Move. The study provided a detailed corpus linguistic account of how the clusters were used in specific Steps described in the model. A detailed description of the linguistic features, the internal structure (Move/Step cycles and embedding) and communicative functions of specific Steps in the Commenting on results Move were also presented based on the concordance analysis of the clusters. The findings further suggest that the use of specific clusters strongly manifests, and is conditioned by, the research article genre. The study has pedagogical implications for academic writing courses for students, especially for those from non-English language backgrounds.
•Word clusters in Comment on results Move in discussions examined.•Keywords established, and associated clusters then identified.•Clusters analysed for discourse functions and mapped onto Move framework.•Concordance analyses of clusters showed linguistic features, cycles/embedding.•Points to relationship between clusters and genre.
Government and market are the two main factors that drive the practices of the Chinese media system and influence the news construction process. A dramatic, socially disruptive event like the 2014 ...Kunming terrorist attack has the potential both to damage the government image and to attract readers. Analyzing how different types of media, more specifically the state-sponsored and the market-oriented press, construct a terrorist attack may therefore reveal essential characteristics of the Chinese media system and its relationship with both government and market. In doing so, the present study makes a contribution in terms of methodology, resources, and empirical description. From a methodological perspective, drawing on a dataset of 275 news articles about the Kunming attack that was collected from 16 mainstream Chinese newspapers, we explore the possibilities of combining computer-assisted techniques (i.e. part-of-speech tagging, sentiment analysis, collocation, and concordance) and Discursive News Values Analysis (DNVA), based on which we identified 699 Chinese lexical indicators distributed across ten news values. The open-source wordlist produced by this procedure will facilitate future quantitative DNVA, but also fills a resource gap in non-English news values studies. After calculating the mean normalized frequency of indicators under each news value on a more empirical level, we found that the state-sponsored and the market-oriented press converge in foregrounding the news values of Eliteness and Personalization, in line with public expectations, while at the same time diverging in their use of the news values of Positivity, Negativity, and Superlativeness, which we can relate to the different aims and responsibilities of these two types of newspapers.
Metadiscourse refers to the linguistic element that is used to communicate meanings with imagined readers and to express a viewpoint as members of a particular academic community. Accordingly, this ...study reported the distributions of interactive and interactional metadiscourse markers in a corpus of 99 research articles representing the English language, Computer Sciences, and Education disciplines. To observe the writers’ metadiscourse devices usage in their discourse community, Hyland’s (Metadiscourse: exploring interaction in writing. Continuum, New York, 2005) metadiscourse taxonomy was employed. The data were computed through descriptive statistics, Chi square, Kruskal–Wallis test, and content analysis. Hence, the data revealed that though articles in all disciplines employed both interactive and interactional metadiscourse markers, English Language discipline articles contained highest metadiscourse devices compared with Education and Computer sciences discipline articles. It was also observed that the book review writers used much more interactive markers such as transition and evidential devices than interactional markers. However, among interactional markers, self-mention markers were extensively used. The data also indicated that there was statistically a significant difference across disciplines in using interactive and interactional metadiscourse devices. Hence, these findings implied that academic writing teachers should focus on discipline-oriented metadiscourse devices while teaching academic writing skills.