Abstract
This article presents Verbel: The Electronic Dictionary of Paradigms of Polish Verbal Multiword Expressions (MWEs) and discusses theoretical problems connected with compiling such a ...dictionary for inflectionally complex languages such as Polish. The dictionary includes over 5,000 Polish verbal MWEs and explicitly presents their forms and constraints in inflection. It also provides grammatical, semantic, pragmatic and prescriptive commentaries. The first part of the article covers the theoretical and methodological basis used in the compilation of the dictionary. Generally, a verbal MWE is inflected according to the paradigm of the verb which is its main component. However, MWEs may have some specific inflectional constraints connected with other factors (e.g. semantic, pragmatic), which result in different paradigms for verbal MWEs and for the verbs that are their main components. In the second part, the conception and content of the dictionary are discussed. Finally, the natural language processing tools that underlie the work on the dictionary are described.
Concreteness describes the degree to which a word’s meaning is understood through perception and action. Many studies use the Brysbaert et al. (
2014
) concreteness ratings to investigate language ...processing and text analysis. However, these ratings are limited to English single words and a few two-word expressions. Increasingly, attention is focused on the importance of multiword expressions, given their centrality in everyday language use and language acquisition. We present concreteness ratings for 62,889 multiword expressions and examine their relationship to the existing concreteness ratings for single words and two-word expressions. These new ratings represent the first big dataset of multiword expressions, and will be useful for researchers interested in language acquisition and language processing, as well as natural language processing and text analysis.
A growing number of studies have probed the effectiveness of certain exercise formats in the learning of multi-word expressions (MWEs) in classroom settings. However, a number of important variables, ...such as MWE retention over an extended period of time and the role of repetition, have so far not been considered. Furthermore, studies have focused primarily on university level learners, with young L2 learners being almost entirely disregarded. The present study sought to address these gaps with 148 high school students who were randomly assigned one of three fill-in-the-gap exercises: (1) word-format, where participants selected the appropriate verb from a list provided; (2) letter-format, where the first-letter of the missing verb was provided as a clue; and (3) phrase-format, where participants chose an appropriate intact phrase from a list. Participants did the exercise once, twice or three times. The study investigated the effects of exercise format and repetition on the learning of 20 verb–noun collocations, one and eight weeks following the treatment. Results from generalized linear mixed-effects modeling showed that both exercise and repetition had significant impact on the learnability of the target MWEs, but the format had a smaller effect size than repetition. Rasch analysis was also used to examine the potential difference in the difficulty of MWEs, and how this difficulty may interact with the exercise format. The findings largely support previous research, but also underline the importance of repetition and suggest that exercise format does not uniformly interact with the learnability of MWEs.
Abstract
Despite a long history of research into phraseology and its practical applications, the representation of multiword expressions (henceforth MWEs) in dictionaries still remains unsettled. The ...paper singles out and investigates the lexicographic treatment of one thematically related group of MWEs, those that contain nominal forms of body part names. Body part MWEs have been chosen for the present study because they constitute a fairly homogenous group, as many of them are related by means of metonymic motivation, and are therefore expected to be treated in a fairly consistent way. The study examines and evaluates the representation of body part MWE in monolingual English learners’ dictionaries online from the perspective of cognitive linguistics. It focuses primarily on the arrangement of semantically related MWEs in the dictionary microstructure, and attention is also paid to navigation devices used to nest metonymically motivated body part MWEs. The paper proposes solutions that could be adopted in lexicographic practice so that semantic links between related body part MWEs become more apparent in the entry structure.
This paper presents the results of the annotation in a learner translation corpus consisting of German source texts and student translations into Basque. The analysis was carried out with the purpose ...of identifying trainee translators’ strengths and weaknesses when translating multiword expressions, such as compounds, collocations, and idioms. The data comprised eight German source texts and sixty-eight Basque translations from undergraduate students enrolled at the University of the Basque Country. From the total number of annotations (1214), which include not only errors but also cases of interference from the source language and positive outcomes, around 27 % are related to multiword expressions. The results of the translation analysis show that there are variables — such as the use of machine translation systems, the level of specialisation of the source text, the type of multiword expression to be translated or the absence of a literal counterpart in the target language — that may affect the translation of such units and lead to erroneous solutions and/or interference in the outputs produced by the trainee translators. From a pedagogical point of view, these findings will have a direct impact on the translation classes and will be very valuable for designing corpus-based in-class activities
Corpus-based models of lexical strength have called into question the role of word frequency as an organizing principle of the lexicon, revealing that contextual and semantic diversity measures ...provide a closer fit to lexical behavior data (Adelman et al., 2006; Jones et al., 2012). Contextual diversity measures modify word frequency by ignoring word repetition in context, while semantic diversity measures consider the semantic consistency of contextual word occurrence. Recent research has shown that a better account of lexical organization data is provided by socially based measures of semantic diversity, which encode the communication patterns of individuals across discourses (Johns, 2021b). While most research on contextual diversity has focused on single words, recent corpus-based and experimental evidence suggests that an integral part of language use involves recurrent and more structurally complex units, such as multiword phrases and idioms. The aim of the present work was to determine if contextual and semantic diversity drive lexical organization at the level of multiword units (here, operationalized as idiomatic expressions), in addition to single words. To this end, we analyzed normative ratings of familiarity for 210 English idioms (Libben & Titone, 2008) using a set of contextual, semantic, and socially based diversity measures that were computed from a 55-billion word corpus of Reddit comments. The results confirm the superiority of diversity measures over frequency for multiword expressions, suggesting that multiword units, such as idiomatic phrases, show similar lexical organization dynamics as single words.
Les modèles de force lexicale fondés sur le corpus remettent en question le rôle de la fréquence des mots comme principe d'organisation du lexique. Ce qui transparait de ceci, c'est que les mesures de la diversité contextuelle et sémantique correspondent davantage aux données sur le comportement lexical (Adelman et al., 2006; Jones et al., 2012). Les mesures de la diversité contextuelle modifient la fréquence des mots en ignorant la répétition des mots dans un contexte particulier, tandis que les mesures de la diversité sémantique tiennent compte de la cohérence sémantique de l'occurrence des mots selon le contexte. Des recherches récentes ont démontré que les mesures sociales de la diversité sémantique, qui permettent de coder les modèles de communication des individus à travers les discours, fournissent un meilleur compte rendu des données lexicales sur l'organisation (Johns, 2021b). Bien que la plupart des recherches sur la diversité contextuelle se soient concentrées sur des mots uniques, des données expérimentales récentes fondées sur le corpus donnent à penser que l'utilisation du langage repose aussi en partie sur des unités récurrentes et plus structurellement complexes, comme les locutions et les expressions idiomatiques. Ces travaux visaient à déterminer si la diversité contextuelle et sémantique détermine l'organisation lexicale au niveau des unités comportant plusieurs mots (désignées ici sous l'appellation d'expressions idiomatiques), ainsi que des mots uniques. À cette fin, nous avons analysé des évaluations normatives de la connaissance de 210 expressions idiomatiques anglaises (Libben et Titone, 2008) en recourant à un ensemble de mesures de la diversité contextuelle, sémantique et sociale, qui ont été calculées à partir d'un corpus de commentaires sur Reddit comportant quelque 55 milliards de mots. Les résultats confirment la supériorité des mesures de la diversité sur la fréquence des expressions comportant plusieurs mots, ce qui donne à penser que les unités à mots multiples, par exemple les expressions idiomatiques, présentent une dynamique d'organisation lexicale semblable à celle des mots uniques.
Public Significance Statement
Corpus-based evidence indicates that the ease with which we access single words in the lexicon depends on their contextual and social diversity, rather than their frequency. However, an integral part of our language environment consists also of conventional multiword units like idioms. We demonstrate that contextual and social diversity shape the lexicon also at the level of multiword units and that idioms thus exhibit the same lexical organizational dynamics as single words.
Multiword expressions are combinations of words that exhibit pecu-liar semantic properties, such as different degrees of non-compositio-nality, decomposability, transparency and figuration. ...Long-standing linguistic debates suggest that such semantic idiosyncrasy can con-dition the morpho-syntactic configurations in which a given multi-word expression can occur. Here, we extend this argumentation to a particular semantic and pragmatic phenomenon: nominal coreference. We hypothesise that the internal components of a multiword expres-sion are unlikely to occur in coreference chains. While previous work has identified the rareness of coreference-related phenomena in pres-ence of multiword expressions, this observation has never been quan-tified, to the best of our knowledge. We bridge this gap by performing an automated corpus-based study of the intersections between verbal multiword expressions and nominal coreference in French. The results largely corroborate our hypothesis but also display various tendencies depending on the type of multiword expression and the corpus genre. The analysis of the corpus examples highlights interesting properties of coreference, notably in speech.
The paper consists of two main parts: (a) In the first part, a typology of multiword expressions (MWE) in Czech is described in a detailed way. This typology is part of the description of MWE ...database entries in the lexical database LEMUR containing more than 10,500 MWE entries as of June 2020. MWE properties reflected in this typology are accounted for by categories and their values. Each MWE is identified by a unique lemma; a group of related MWEs is assigned a “superlemma”. A MWE is described by the following properties: a MWE definition, characteristic examples, lemmas and morphological features of MWE components (words), as well as the following key categories: MWE style/register, type of usage, syntactic structure (including its representation by a dependency and a phrase-structure tree), aspects of flexibility (variants and fragments, internal modifiability of individual MWE components, possibilities of syntactic transformations of the main MWE components and morphological constraints) and types of idiomaticity on the lexical, morphological, syntactic, semantic and pragmatic level. (b) In the second part of the paper, the authors focus on the frequency of the main features of the adopted typology in the real language material represented by the genre-balanced SYN2015 corpus, containing 100 mil. word forms (excluding punctuation): a type of usage correlated with a syntactic type and frequency of various kinds of idiomaticity. Our paper seems to be the first attempt at approaching the MWE properties from the point of view of MWE frequencies as types rather than tokens (i.e. frequencies of occurrences of a given MWE).
Languages have formulaic multiword sequences (MWSs) which occur repeatedly in speech and writing (e.g., Nattinger & DeCarrico, 1992; Siyanova-Chanturia & Pellicer-Sánchez, 2018). For learners, then, ...the production of MWSs is an important element in developing spoken language that is complex, accurate, and fluent. Though the use of MWSs is important for achieving spoken proficiency, it is unclear whether the production of MWSs supports or hinders another aspect of proficiency, lexical variety. This paper is an exploration of the production of MWSs (recurrent trigrams) and the development of lexical variety, found in 2-min speeches (n = 294) from English L2 learners (n = 66) over time in an intensive English program (IEP). Using hierarchical linear modeling and correlation analysis, we found different patterns of development for the two measures. The use of MWSs increased and then decreased while the lexical variety scores slightly decreased and then sharply increased over time in the IEP. Although the impact of MWSs on oral fluency has been studied, this seems to be the first study to consider how MWSs influence lexical variety across development.
•Use of multi-word-sequences increased and then decreased over time.•Lexical variety scores slightly decreased and then sharply increased over time.•A negative relationship was found generally, but not for all learners.
This article addresses the question of which possibilities and limitations of frequency-based studies on the relevance of multi-word expressions open up for applied purposes. For this purpose, the ...corpus Ref10 of the project Wortschatzwissen.de was exploratively examined. After the development of a category system for multi-word expressions, a sample of the corpus was examined and assigned to the different categories. Subsequently, the identified multi-word expressions were compared with a phrase list of Hallsteinsdóttir, Šajánková & Quasthoff (2006). Findings suggest that the proportion of collocations is particularly high in all subcorpora and that, in addition, idioms and light verb constructions are predominant. Moreover, a large proportion of the idioms identified in the Ref10 corpus sample does not occur at all or occurs only partially, i.e. in an unlisted variant, in the phraseological optimum of Hallsteinsdóttir, Šajánková & Quasthoff (2006). This raises above all the question of how phrase variance is to be evaluated in corpus analyses and to what extent corpus linguists should rely only on basic vocabulary from the perspective of Applied linguistics.