Authorship attribution is an important problem in text classification, with many applications and a substantial body of research activity. Among the research findings are that many different methods ...will work, including a number of methods that are superficially language‐independent (such as an analysis of the most common “words” or “character n‐grams” in a document). Since all languages have words (and all written languages have characters), this method could (in theory) work on any language. However, it is not clear that the methods that work best on, for example English, would also work best on other languages. It is not even clear that the same level of performance is achievable in different languages, even under identical conditions. Unfortunately, it is very difficult to achieve “identical conditions” in practice. A new corpus, developed by George Mikros, provides very tight controls not only for author but also for topic, thus enabling a direct comparison of performance levels between the two languages Greek and English. We compare a number of different methods head‐to‐head on this corpus, and show that, overall, performance on English is higher than performance on Greek, often highly significantly so.
Full text
Available for:
BFBNIB, FZAB, GIS, IJS, KILJ, NLZOH, NUK, OILJ, SBCE, SBMB, UL, UM, UPUK, VSZLJ
Display omitted
•Behavioral biometrics: keystroke dynamics, mouse movement, stylometry.•A parallel binary decision fusion architecture with 11 sensors.•A dataset collected from 67 users each working ...in an office environment for a week.•Achieve below 1% error rates (FAR, FRR) after only 30s of activity.•Characterize robustness of system to adversarial attacks.
Active authentication is the process of continuously verifying a user based on their on-going interaction with a computer. In this study, we consider a representative collection of behavioral biometrics: two low-level modalities of keystroke dynamics and mouse movement, and a high-level modality of stylometry. We develop a sensor for each modality and organize the sensors as a parallel binary decision fusion architecture. We consider several applications for this authentication system, with a particular focus on secure distributed communication. We test our approach on a dataset collected from 67 users, each working individually in an office environment for a period of approximately one week. We are able to characterize the performance of the system with respect to intruder detection time and robustness to adversarial attacks, and to quantify the contribution of each modality to the overall performance.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UL, UM, UPCLJ, UPUK
In this paper, we present preliminary results on how an individual's writing style persists even across languages. In other words, what aspects of an individual's writing will persist irrespective of ...the language in which he or she writes? We argue that cognitive and social traits are likely to persist and demonstrate this by two separate analyses of bilingual corpora using the same individuals. We show that for various measures of linguistic complexity (which we consider to be a cognitive variable) and participation in specific social conventions (a social one), the correlation between scores on the two languages studied is significantly higher than would be expected by chance. We argue that this type of correlation may permit cross-linguistic authorship attribution.
Full text
Available for:
BFBNIB, NUK, PILJ, SAZU, UL, UM, UPUK
The present study considers the role of adjectives and adverbs in stylometric analysis and authorship attribution. Adjectives and adverbs allow both for variations in placement and order (adverbs) ...and variations in type (adjectives). This preliminary study examines a collection of 25 English-language blogs taken from the Schler Blog corpus, and the Project Gutenberg corpus with specific emphasis on 3 works. Within the blog corpora, the first and last 100 lines were extracted for the purpose of analysis. Project Gutenberg corpora were used in full. All texts were processed and part-of-speech tagged using the Python NLTK package. All adverbs were classified as sentence-initial, preverbal, interverbal, postverbal, sentence-final, or none-of-the-above. The adjectives were classified into types according to the universal English type hierarchy (Cambridge Dictionary Online,
2021
; Annear,
1964
) manually by one of the authors. Ambiguous adjectives were classified according to their context. For the adverbs, the initial samples were paired and used as training data to attribute the final samples. This resulted in 600 trials under each of five experimental conditions. We were able to attribute authorship with an average accuracy of 9.7% greater than chance across all five conditions. Confirmatory experiments are ongoing with a larger sample of English-language blogs. This strongly suggests that adverbial placement is a useful and novel idiolectal variable for authorship attribution (Juola et al.,
2021
). For the adjective, differences were found in the type of adjective used by each author. Percent use of each type varied based upon individual preference and subject-matter (e.g. Moby Dick had a large number of adjectives related to size and color). While adverbial order and placement are highly variable, adjectives are subject to rigid restrictions that are not violated across texts and authors. Stylometric differences in adjective use generally involve the type and category of adjectives preferred by the author. Future investigation will focus, likewise, on whether adverbial variation is similarly analyzable by type and category of adverb.
Full text
Available for:
EMUNI, FIS, FZAB, GEOZS, GIS, IJS, IMTLJ, KILJ, KISLJ, MFDPS, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, SBMB, SBNM, UKNU, UL, UM, UPUK, VKSCE, ZAGLJ
15.
Becoming Jack London Juola, Patrick
Journal of quantitative linguistics,
08/2007, Volume:
14, Issue:
2-3
Journal Article
Peer reviewed
Differences in entropy are applied to a computation of stylistic variation across 18 novels written by Jack London to trace the author's stylistic development in the 1912-1916 period. Figures, ...References. J. Hitchcock
Full text
Available for:
BFBNIB, NUK, PILJ, SAZU, UL, UM, UPUK
Authorship attribution is an important emerging security tool. However, just as criminals may wear gloves to hide their fingerprints, so authors may choose to mask their style to escape detection. ...Most authorship studies have focused on cooperative and/or unaware authors who do not take such precautions. Using a newly published corpus (the Brennan-Greenstadt Obfuscation corpus), we use the JGAAP system (www.jgaap.com) to test different methods of authorship attribution against essays written in deliberate attempt to mask style. We confirm that this is an issue.
The acquisition of English noun and verb morphology is modeled using a single‐system connectionist network. The network is trained to produce the plurals and past tense forms of a large corpus of ...monosyllabic English nouns and verbs. The developmental trajectory of network performance is analyzed in detail and is shown to mimic a number of important features of the acquisition of English noun and verb morphology in young children. These include an initial error‐free period of performance on both nouns and verbs followed by a period of intermittent over‐regularization of irregular nouns and verbs. Errors in the model show evidence of phonological conditioning and frequency effects. Furthermore, the network demonstrates a strong tendency to regularize denominal verbs and deverbal nouns and masters the principles of voicing assimilation. Despite their incorporation into a single‐system network, nouns and verbs exhibit some important differences in their profiles of acquisition. Most importantly, noun inflections are acquired earlier than verb inflections. The simulations generate several empirical predictions that can be used to evaluate further the suitability of this type of cognitive architecture in the domain of inflectional morphology.
Full text
Available for:
BFBNIB, FZAB, GIS, IJS, KILJ, NLZOH, NUK, OILJ, SBCE, SBMB, UL, UM, UPUK
The authors apply a decision fusion architecture on a collection of behavioral biometric sensors using keystroke dynamics, mouse movement, stylometry, and Web browsing behavior. They test this active ...authentication approach on a dataset collected from 19 individuals in an office environment.
Empirical studies of broad-ranging aspects of culture, such as 'cultural complexities' are often extremely difficult. Following the model of Michel et al. (Michel, J.-B., Shen, Y. K., Aiden, A. P. et ...al. (2011). Quantitative analysis of culture using millions of digitized books. Science, 331(6014): 176-82), and using a set of techniques originally developed to measure the complexity of language, we propose a text-based analysis of a large corpus of topic-uncontrolled text to determine how cultural complexity varies over time within a single culture. Using the Google Books American 2Gram corpus, we are able to show that (as predicted from the cumulative nature of culture), US culture has been steadily increasing in complexity, even when (for economic reasons) the amount of actual discourse as measured by publication volume decreases. We discuss several implication of this novel analysis technique as well as its implications for discussion of the meaning of 'culture.' Adapted from the source document
A key step forward in the professionalization of forensic science is the development of standards of practice and protocols. Based on his analysis of the Rowling case, Juola (2015) proposed a ...systematic protocol for authorship verifi- cation. We present both a theoretical and an empirical analysis of the accuracy of this protocol. We further present a demonstration of this analysis in terms of a high-profile case of political activism. We show that this protocol produces accurate and understandable analyses of the likelihood of common authorship.