Pričujoča tematska številka revije Slovenščina 2.0 se posveča digitalnemu jezikoslovju, hitro rastočemu interdisciplinarnemu področju raziskav na stičišču tradicionalnega jezikoslovja, informacijskih ...tehnologij in družboslovnih ved. V ospredju digitalnojezikoslovnih raziskav je ohranjanje, analiza in uporaba jezikovnih podatkov, digitalnih artefaktov z jezikom kot nosilcem medčloveškega sporazumevanja. Digitalno jezikoslovje tako pri nas kot po svetu postaja vse pomembnejše ne samo v akademskih in izobraževalnih krogih, temveč tudi v javnem in zasebnem sektorju, ki za uspešno delovanje v sodobni družbi in gospodarstvu vse bolj potrebujeta strokovnjake, vešče upravljanja z digitalnimi jezikovnimi podatki.
Parliamentary proceedings are a rich source of data that can be used by scholars in various humanities and social sciences disciplines. Unlike the sources of most other language corpora, ...parliamentary proceedings are not subject to copyright or personal privacy protections, and are typically available online, thus making them ideal for compilation into corpora and for open distribution. For these reasons many countries have already produced corpora of parliamentary proceedings, but each typically in their own encoding, limiting their comparability and utilization in a multilingual setting. In this paper we propose an encoding schema which could serve as an interchange format for parliamentary corpora compiled for the purposes of scholarly investigations. The schema, called Parla-CLARIN, was developed within the CLARIN research infrastructure, and is written as a TEI ODD which includes a TEI customization and prose guidelines with examples of use. We discuss the coverage and choices made in designing the recommendations, and give an overview of the guidelines. We also discuss two other standard schemas for encoding parliamentary data, Akoma Ntoso and RDF, and their relation to Parla-CLARIN. We conclude by presenting corpora already encoded in Parla-CLARIN and discussing further work, especially the provision of a set of example documents and of transformation scripts that would make the proposed encoding more usable.
The aim of this contribution is to reflect on the process of building the multilingual European Literary Text Collection (ELTeC) that is being created in the framework of the networking project ...Distant Reading for European Literary History funded by COST (European Cooperation in Science and Technology). To provide some background, we briefly introduce the basic idea of ELTeC with a focus on the overall goals and intended usage scenarios. We then describe the collection composition principles that we have derived from the usage scenarios. In our discussion of the corpus-building process, we focus on collections of novels from four different literary traditions as components of ELTeC: French, Portuguese, Romanian, and Slovenian, selected from the more than twenty collections that are currently in preparation. For each collection, we describe some of the challenges we have encountered and the solutions developed while building ELTeC. In each case, the literary tradition, the history of the language, the current state of digitization of cultural heritage, the resources available locally, and the scholars’ training level with regard to digitization and corpus building have been vastly different. How can we, in this context, hope to build comparable collections of novels that can usefully be integrated into a multilingual resource such as ELTeC and used in Distant Reading research? Based on our individual and collective experience with contributing to ELTeC, we end this contribution with some lessons learned regarding collaborative, multilingual corpus building.
V prispevku predstavimo referenčne, specializirane in vzporedne korpuse, do katerih je mogoče dostopati prek konkordančnikov na strežniku nl.ijs.si. Večina korpusov vsebuje besedila v slovenščini, ...nekaj pa je tudi tujejezičnih. Mnogi od korpusov obstajajo že dalj časa, vendar so sedaj na novo označeni, pri nekaterih so dodana nova besedila, nekateri pa so v celoti novi. Besedila v korpusih so opremljena z metapodatki, besednim pojavnicam pa so ročno ali avtomatsko pripisane vsaj leme in oblikoskladenjske oznake. V večini primerov so korpusi prosto dostopni, in sicer prek dveh spletnih konkordančnikov, ki omogočata iskanje po obsežnih označenih korpusih, ponujata bogat nabor analitičnih orodij, možnosti filtriranja glede na metapodatke in shranjevanje rezultatov na lastni računalnik. Poleg korpusov in obeh konkordančnikov v prispevku obravnavamo tudi nekatera vprašanja, ki so se zastavila pri zagotavljanju tovrstne infrastrukture za namene korpusnega jezikoslovja, ter zaključimo s smernicami za nadaljnje delo.
Odprta znanost temelji na prosto in odprto dostopnih znanstvenih publikacijah in podatkih. Slednji omogočajo preverjanje rezultatov predhodnih raziskav in njihovo nadgrajevanje, v kontekstu ...jezikovnih tehnologij in ročno označenih jezikovnih virov pa tudi šolanje novih orodij za procesiranje besedil. Vendar pa je, tako kot za znanstvene objave, tudi za podatke pomembno, da so korektno citirani, saj šele to omogoča ponovljivost raziskav, citati pa so tudi najpomembnejši pokazatelj zanimivosti in koristnosti delovanja znanstvenikov ter pomembno vplivajo na njihovo priznanost in s tem možnost pridobivanja projektov ter zaposlitev. V prispevku najprej predstavimo ti. »austinska načela« citiranja jezikovnih podatkov in opišemo tovrstne aktivnosti v sklopu infrastrukture CLARIN.SI. Nato analiziramo stanje citiranja jezikovnih podatkov, predvsem korpusov, v šestih vodilnih slovenskih jezikoslovnih znanstvenih revijah (Jezik in slovstvo, Slavistična revija, Slovenščina 2.0, Linguistica, Slovene Linguistic Studies in Jezikoslovni zapiski) ter v zbornikih dveh znanstvenih konferenc z jezikoslovno tematiko (Jezikovne tehnologije in digitalna humanistika ter Obdobja) za obdobje zadnjih sedmih let, tj. 2013–2019. Pregledali smo 1.074 znanstvenih objav in kvantitativno ter kvalitativno analizirali rezultate. S kvantitativnega vidika pokažemo, da v celotnem obdobju zgolj dobra četrtina pregledanih člankov vključuje rabo virov ter da je v poznejšem obdobju (2018–2019) raba virov v objavah več kot dvakrat pogostejša kot v zgodnejšem obdobju (2013–2017). Načine navajanja virov razvrstimo v pet kategorij (npr. navajanje hiperpovezave na vir v besedilu ter navajanje ključne publikacije o viru); pokažemo, da je raba posameznega načina v veliki meri odvisna od navodil avtorjem za posamezno publikacijo. S kvalitativnega vidika se osredotočamo predvsem na vire z vnosom v repozitoriju raziskovalne infrastrukture CLARIN.SI, za katere pokažemo, da so z redkimi izjemami neustrezno citirani. Izsledke povzamemo in po ti. »austinskih načelih« pokažemo, kaj je bilo že narejenega v sklopu infrastrukture CLARIN.SI ter predlagamo smernice za citiranje jezikoslovnih podatkov in načine za njihovo implementacijo.
The paper presents the MULTEXT-East language resources, a multilingual dataset for language engineering research, focused on the morphosyntactic level of linguistic description. The MULTEXT-East ...dataset includes the morphosyntactic specifications, morphosyntactic léxica, and a parallel corpus, the novel "1984" by George Orwell, which is sentence aligned and contains hand-validated morphosyntactic descriptions and lemmas. The resources are uniformly encoded in XML, using the Text Encoding Initiative Guidelines, TEI P5, and cover 16 languages, mainly from Central and Eastern Europe: Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene, and Ukrainian. This dataset, unique in terms of languages covered and the wealth of encoding, is extensively documented, and freely available for research purposes. The paper overviews the MULTEXT-East resources by type and language and gives some conclusions and directions for further work.
V prispevku je predstavljen Korpus šolskih besedil slovenskega jezika, specializirani pisni korpus slovenščine v obsegu približno 1,8 milijona pojavnic. Korpus je bil zasnovan v okviru projekta ...Franček, Jezikovna svetovalnica za učitelje slovenščine in Šolski slovar slovenskega jezika, in sicer kot gradivska osnova za oblikovanje Šolskega slovarja slovenskega jezika, prvega znanstveno utemeljenega pedagoškega slovarja za slovenski jezik. Prispevek obravnava besedilnotipsko sestavo in obseg korpusa, osvetljuje tehnične postopke predpriprave besedil in njihovega jezikoslovnega označevanja ter predstavlja nabor korpusnih metapodatkov, hkrati pa pojasnjuje, v katerih formatih in pod katerimi licencami je Korpus šolskih besedil slovenskega jezika na voljo. Članek opozarja tudi na pravne vidike pridobivanja besedil.
The paper presents the results of the Janes project, which aimed to develop language resources and tools for Slovene user generated content. The paper first describes the 200 million word Janes ...corpus, containing tweets, forum posts, news comments, user and talk pages from Wikipedia, and blogs and blog comments, where each text is accompanied by rich metadata. The developed processing tools for Slovene user generated content are presented next, which include a tokeniser, word-normaliser, part-of-speech tagger and lemmatiser, and a named entity recogniser. A set of manually annotated datasets was also produced, both for tool training as well as for linguistic research. The developed resources and tools are made publicly available under Creative Commons licences in the repository of the CLARIN.SI research infrastructure and on GitHub, while the corpora are also available through the CLARIN.SI concordancers.
The paper presents the KAS corpus of Slovenian academic writing, which consists of almost 65,000 B.A./B.Sc., 16,000 M.A./M.Sc. and 1600 Ph.D. theses (5 million pages or 1.7 billion tokens) gathered ...from the digital libraries of Slovenian higher education institutions via the Slovenian Open Science portal. We discuss the compilation, meta-data, annotation, and distribution of the corpus, which is made freely available via on-line concordancers and is openly available for research through the CLARIN.SI research infrastructure. We also present the tools for mono- and bilingual term extraction and for thesis structure annotation that were developed in the scope of the project, including the manually annotated datasets used to train these tools. This specialised corpus, large by any standards, represents a substantial and highly useful language resource for the study of Slovenian academic writing and for terminology extraction.