The computational analysis of documents to learn about their authorship (also known as authorship attribution and/or authorship profiling) is an increasingly important area of research and ...application of technology. This paper discusses the technology, focusing on its application to social media in a variety of disciplines. It includes a brief survey of the history as well as three tutorial case studies, and discusses several significant applications and societal benefits that authorship analysis has brought about. It further argues, though, that while the benefits of this technology have been great, it has created serious risks to society that have not been sufficiently considered, addressed, or mitigated.
The vast amount of unstructured data spread on a daily basis rises the need for developing effective information retrieval and extraction methods. Named Entity Recognition is a challenging ...classification task for structuring data into pre-defined labels, and is even more complicated when being applied on the Arabic language due to its special traits and complex nature. This article presents a novel Deep Learning approach for Standard Arabic Named Entity Recognition that proved its out-performance when being compared to previous works. The main aim of building a new model is to provide better fine-grained results for use in the Natural Language Processing fields. In our proposed methodology we utilized transfer learning with deep neural networks to build a Pooled-GRU model combined with the Multilingual Universal Sentence Encoder. Our proposed model scored about 17% enhancement when being compared to previous work.
Being able to identify the author of an anonymous or disputed document is an important task in forensic science. This can be treated as a form of pattern evidence based on writing style, but the ...subjective analysis of writing style may have all the well-known problems of other forms of subjective pattern evidence. In this paper, we demonstrate a computer program to address these issues. This program analyzes a pair of documents (a known document and a questioned document) to determine if they were written by the same author. More importantly, this paper also validates the accuracy of this program through a large-scale series of controlled experiments involving English language blogs. Across more than 32,000 different document pairs, the system achieved a measured accuracy of 77%. This paper concludes that this system not only addresses a key problem in forensic linguistics, but also provides the repeatability, reproducibility, and measured accuracy levels that are key to the advancement of forensic science.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
MapLemon is a corpus in its second iteration that was created to obtain a baseline corpus for linguistic variation among English-speaking North Americans. The MapLemon corpus currently houses upwards ...of 21,000 words across 185 participants, 10+ linguistic backgrounds, and 40+ US states and Canadian provinces. MapLemon also houses writing from 91 transgender and non-binary individuals. MapLemon presents a unique method for data collection in the virtual written medium and a corpus that has proven useful for identifying demographic information via writing style, otherwise known as stylometry.MapLemon est un corpus en sa deuxième itération qui a été créé pour obtenir un corpus de référence des variations linguistiques parmi les anglophones d'Amérique du Nord. Le corpus MapLemon contient actuellement plus de 21 000 mots provenant de 185 participants de plus de 10 origines linguistiques et de plus de 40 États américains et provinces canadiennes. MapLemon contient également les écrits de 91 personnes transgenres et non binaires. MapLemon présente une méthode unique de collecte de données dans le domaine de l'écriture virtuelle et un corpus qui s'est avéré utile pour identifier des informations démographiques par le biais du style d'écriture, également connu sous le nom de stylométrie.
The interchangeably connected Web technologies and the advancements that accompany the semantic web content's leaps, have raised many challenges in the results' retrieval process especially for the ...Arabic Language. This research targets an important, yet insufficiently precedent, area in using Linked Open Data (LOD) for Automatic Question Answering systems in the Arabic Language. The significance of work presented, comes from its ability to overcome many challenges in querying Arabic content. Some of these challenges are: (a) bridging the gap between natural language and linked data by mapping users' queries to a standard semantic web query language such as SPARQL, (b) facilitating multilingual access to semantic data, and (c) maintaining the quality of data. Another challenging aspect was the lack of related work and publicly available resources for Arabic Question Answering Systems over Linked Data, despite the vastly growing Arabic corpus on the web. This paper presents a novel approach that targets Automatic Arabic Questions' Answering Systems whilst bypassing many featured challenges in the field. A hybrid approach that evaluates the effectiveness of using LOD to automatically answer Arabic questions is developed. The approach is developed to map users' questions in Modern Standard Arabic, to a standard query language for LOD (i.e. SPARQL) through: (i) extracting entities from questions and linking them over the web using Named-Entity Recognition and Disambiguation (NER/NED), and (ii) extracting properties among extracted named entities using a dependency parsing approach integrated with Wikidata ontology. To evaluate our proposed system, an Arabic questions dataset was created including: (a) Question body in Arabic language, (b) Question type, (c) SPARQL Query formulation, and (d) Question answer. Evaluation results are promising with a Precision of 84%, a Recall of 81.3%, and an F-Measure of 82.8%.
Abstract
‘Authorship attribution’, the problem of determining the author (or the author's attributes, such as gender, age, native language, or other characteristics) by examining the writing style of ...an unknown work, is an important problem in applied linguistics. The theory of authorship attribution is relatively straightforward: language is an underspecified system, and people can pick and choose among several different ways to describe the same thing. These choices, in turn, become habituated and can be identified as persistent patterns of an individual or group of writers.
One important psycholinguistic underpinning to this solution is the universal existence (in natural languages) of so-called “marker words” or ‘function words,’–little, closed-class words that do not carry much semantics but instead denote relationships between content words. Because these words are so lightly processed, writers/speakers can choose among many different near-synonymous forms, and implicitly express their identity in doing so.
Do constructed languages have this same degree of near-synonymity? We present the results of a study of authorship attribution using an ad-hoc corpus of fan-written documents in various constructed languages, and show that even artificial languages constructed for artistic purposes, such as Klingon, Na'vi, and Elvish, permit this type of analysis. This indicates that even constructed languages tend to be psycholinguistically plausible.
The recent growth in digital scholarship has made literally millions of books available to readers. But the implications of this, paradoxically, are that reading becomes more difficult. No human can ...possibly read and understand a million books. This is particularly problematic in literary scholarship, where “reading” a text requires much more than simple content extraction, but may require identifying and explaining patterns of thought and expression across many different works.
Full text
Available for:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, UILJ, UKNU, UL, UM, UPUK
Despite the importance (and recent research attention) of authorship attribution as a scholarly problem, the proposed solutions are at best a collection of ad-hoc and mutually incompatible methods ...and at worst simply a muddle. While most methods are better than chance, there is little understanding of which ones are substantially better or, more importantly, of why some methods outperform. This paper describes some large-scale experiments directly comparing hundreds or in some cases millions of different methods to determine whether there are significant and reliable performance differences. The results of these experiments are shown in the hope of finding useful best practices for further experimentation.
Full text
Available for:
BFBNIB, NUK, PILJ, SAZU, UL, UM, UPUK