The Sketch Engine: ten years on Kilgarriff, Adam; Baisa, Vít; Bušta, Jan ...
Lexicography,
2014/7, Letnik:
1, Številka:
1
Journal Article
Recenzirano
Odprti dostop
The Sketch Engine is a leading corpus tool, widely used in lexicography. Now, at 10 years old, it is mature software. The Sketch Engine website offers many ready-to-use corpora, and tools for users ...to build, upload and install their own corpora. The paper describes the core functions (word sketches, concordancing, thesaurus). It outlines the different kinds of users, and the approach taken to working with many different languages. It then reviews the kinds of corpora available in the Sketch Engine, gives a brief tour of some of the innovations from the last few years, and surveys other corpus tools and websites.
Towards the Description of a Multi-sided Prototype Concept in Multilingual Automatic Language Generation: From Corpus via Word Embeddings to the Automatic Dictionary. The multilingual dictionary of ...noun valency Portlex is considered to be the trigger for the creation of the automatic language generators Xera and Combinatoria, whose development and use is presented in this paper. Both prototypes are used for the automatic generation of nominal phrases with their mono- and bi-argumental valence slots, which could be used, among others, as dictionary examples or as integrated components of future autonomous E-Learning-Tools. As samples for new types of automatic valency dictionaries including user interaction, we consider the language generators as we know them today. In the specific methodological procedure for the development of the language generators, the syntactic-semantic description of the noun slots turns out to be the main focus from a syntagmatic and paradigmatic point of view. Along with factors such as representativeness, grammatical correctness, semantic coherence, frequency and the variety of lexical candidates, as well as semantic classes and argument structures, which are fixed components of both resources, a concept of a multi-sided prototype stands out. The combined application of this prototype concept as well as of word embeddings together with techniques from the field of automatic natural language processing and generation (NLP and NLG) opens up a new way for the future development of automatically generated plurilingual valency dictionaries. All things considered, the paper depicts the language generators both from the point of view of their development as well as from that of the users. The focus lies on the role of the prototype concept within the development of the resources.
Eesti Keele Instituudi ja tarkvarafirma Lexical Computing Ltd. koostöös on valminud ühendkorpuste sari, milles on nüüdseks neli versiooni: eesti keele ühendkorpus 2013, 2017, 2019 ja 2021. ...Ühendkorpused on mahult suurimad eesti keele korpused ning nende rakendusvõimalused on laialdased, alates leksikograafia-alasest uurimistööst ning lõpetades masinõppe-otstarbeliste keelemudelite loomisega. Artiklis keskendume seni uusimale eesti keele ühendkorpusele 2021, mis koosneb suures osas veebist kogutud tekstidest. Kirjeldame veebitekstide kogumise, järeltöötluse ja puhastamise põhimõtteid ning ühendkorpuse allkorpusi, samuti anname ülevaate lähtetekstide klassifitseerimisest. Lisaks tutvustame korpuspäringusüsteemi Sketch Engine näitel korpusandemete uusi analüüsivõimalusi ning visandame korpusalase arendustöö edasisi perspektiive ja vajadusi.
Abstract
In this paper, we aim to evaluate the 12-million-word Helsinki Corpus of Swahili as a source of dictionary data used, among others, for the creation of the lemma list for a new ...Swahili-Polish dictionary. We analyse the dictionary log-files in order to answer a question already asked by De Schryver et al. (2006), Koplenig et al. (2014) and Trap-Jensen (2014) about whether dictionary users actually look up frequent words. However, the issue of utmost importance to us is whether a ten-thousand-item frequency list derived from a 12-million-word corpus meets the needs of a Swahili-Polish dictionary user.
The treatise focuses on mutual comparison of three methods of detection of prominent text units (prominent in relation to the contents of the text). The methods are: 1) analysis of key words based on ...comparison of source and referential corpora, 2) thematic concentration and h-point, and 3) the TF*IDF method. We try to thematize their pros and cons and, using the results of the carried out analyses, propose the optimal method for the extraction of thematic words from the spoken texts the frequency structure of which differs distinctly from the frequency structure of written texts.
The aim of the study was to develop new Estonian GDEX configurations for A-, B- and C-language proficiency levels. GDEX (Good Dictionary Example) (Kilgarriff et al. 2008) is a software module of the ...corpus query system Sketch Engine (Kilgarriff et al. 2004), which helps to identify good dictionary example candidates from large corpora. In order to identify which specific parameters characterise sentences in each proficiency level, full sentences from the Estonian Coursebook Corpus 2018 were analysed using a program called Analyser of Sentence Parameters developed at the Institute of the Estonian Language. The analyser allows to find out how long the sentences and tokens are, what kind of verb forms are used, what syntactic properties the sentences have etc. The analysis showed that compared to the latest Estonian GDEX configuration 1.4 such parameters as sentence and token length, occurrence of certain verb forms and parts of speech needed to be adjusted. Accordingly, for A-level the sentence length was set to 3–14 tokens (optimal interval 4–7 tokens), for B-level 3–18 tokens (optimal interval 4–12) and for C-level 4–23 tokens (optimal interval 6–14 tokens). A new classifier that penalises tokens longer than 9 characters on A-level and tokens longer than 11 characters on B-level was introduced. On A- and B-levels certain verb forms were penalised or banned from appearing in the sentence. etSkELL – a corpus tool for Estonian language learning – and the dictionary portal Sõnaveeb (Wordweb) are introduced as possible ways to implement the new GDEX configurations output. The results of this paper can be applied in compiling corpora and teaching materials for different language proficiency levels.
Die doel van hierdie bydrae is om 'n oorsig te bied op die areas van grootste vernuwing in die Afrikaanse leksikografie oor die afgelope twee dekades, met spesifieke verwysing na standaard- en ...pedagogiese woordeboeke, aangesien dít die tipologiese klasse is waarin die meeste vooruitgang te bespeur is. Daar sal egter ook na ander woordeboeksoorte en kontekste verwys word waar dit relevant is vir die bespreking. Eerstens sal kortliks na die huidige stand van sake in die Afrikaanse leksikografie gekyk word, in terme van bestaande leksikografiese produkte. Tweedens word enkele opmerkings aangebied oor die professionalisering van die Afrikaanse kografie 'n onderwerp wat nog nie eintlik aandag in metaleksikografiese ondersoek geniet het nie. Die rol van die Buro van die Woordeboek van die Afrikaanse Taal en die kommersiěle woordeboekuitgewers word ook uitgelig. Daarna word kommentaar gelewer op sekondere en spesifieke leksikografiese prosesse, met verwysing na korpusleksikografie en lemmaseleksie, makrostrukturele vernuwing, en mikrostrukturele vernuwing. Die bespreking van mikrostrukturele vernuwing sluit in geintegreerde mikrostrukture in vertalende woordeboeke, leksikografiese definisies in verklarende woordeboeke, koteksinskrywings, en aspekte van grammatikale data. Die stand van sake van en 'n toekomsblik op die Afrikaanse e-leksikografie word ook aangeraak. Laastens sal enkele opmerkings oor spesiale woordeboeke gemaak word. Sleutelwoorde: