Gre za primerjavo dveh vzorcev -3.661 rojstnih imen iz slovenskega telefonskega imenika 1995 (mesto Ljubljana) in 51.448 samostalnikov iz Slovarja slovenskega knjižnega jezika (izdaja 1994: 21.823 m. ...sp., 21.427 ž. sp. in 8.198 srednjega spola). Povprečna dolžina pri imenih znaša 5.88 črke, pri občnih samostalnikih pa 8.49 črke in izkaže se, da so rojstna imena statistično zelo značilno krajša kot občni samostalniki. - Navedena je še primerjava dolžin samostalnikov po spolu (m povprečno 7.63, ž 8.95, s 9.59 črke) in tudi tu so pri vseh parih (m-ž, m-s in ž-s) razlike statistično značilne.
From the first beginnings in the mid-1990s, availability of electronic text corpora in Slovenian, all with an Internet user interface, has grown to a level comparable to many European languages with ...a long history of quantitative linguistic research. There are two established corpora with 100 million running words, an academic one which is freely accessible and a commercial one, prepared by industrial and academic partners. The two are complemented by a sizeable collection of works of fiction, available for reading in a free virtual library and several specialized corpora, compiled for the needs of particular institutions. The majority of Slovenian newspapers are also accessible online, at least in the form of selected articles.
In this paper, a language model, based on probabilities of text n-grams, is used as a measure of distance between Slovenian & 15 other European languages. During the construction of the model, a ...Huffman tree is generated from all the n-grams (n = 1 to 32, frequency 2 or more) in the training corpus of Slovenian literary texts (2.7 million words), & appropriate Huffman codes are computed for every leaf in the tree. To apply the model to a new text sample, it is cut into n-grams (1-32) in such a way that the sum of model Huffman code lengths for all the obtained n-grams of new text is minimal. The above model, applied to all (16) translations of Plato's Republic from the TELRI CD ROM, produced the following language order (average coding length in bits per character): Slovenian (2,37), Serbocroatian (3,77), Croatian (3,84), Bulgarian (3,96), Czech (4,10), Polish (4,32), Russian (4,46), Slovak (4,46), Latvian (4,74), Lithuanian (4,94), English (5,40), French (5,67), German (5,69), Romanian (5,76), Finnish (6,11), & Hungarian (6,47). 9 Tables, 1 Figure, 4 References. Adapted from the source document
A corpus of 3,661 names collected from the Ljubljana, Slovenia, phone directory & 51,448 common nouns from a Slovene dictionary are compared for number of letters, & it is found that first names are ...shorter than common nouns. The most frequent first names & common nouns are listed & statistical values for the distribution of lengths for common nouns by gender are tabulated. Masculine nouns are found to be the shortest. 7 Figures, 4 References. Z. Dubiel
A host of quantitative characteristics obtained during the electronic conversion of Slovar slovenskega knjiznega jezika (Dictionary of Literary Slovene, Ljubljana, 1970-1991) to a CD-ROM format are ...quoted: word & letter statistics, word length in bytes, character n-tuples entropy, words, letters, author, descriptive label frequencies, & various graphs illustrating numerical relations of the dictionary's contents. 26 Tables, 16 Figures, 11 References. Adapted from the source document