Assessing Linguistic Complexity Juola, Patrick
LANGUAGE COMPLEXITY: TYPOLOGY, CONTACT, CHANGE, Miestamo, Matti, Sinnemaki, Kaius, & Karlsson, Fred Eds, Amsterdam: John Benjamins, 2008, pp 89-108,
01/2008
Book Chapter
The question of "linguistic complexity" is interesting & fruitful. Unfortunately, the intuitive meaning of "complexity" is not amenable to formal analysis. This paper discusses some proposed ...definitions & shows how complexity can be assessed in various frameworks. The results show that, as expected, languages are all about equally "complex," but further that languages can & do differ reliably in their morphological & syntactic complexities along an intuitive continuum. I focus not only on the mathematical aspects of complexity, but on the psychological ones as well. Any claim about "complexity" is inherently about process, including an implicit description of the underlying cognitive machinery. By comparing different measures, one may better understand human language processing & similarly, understanding psycholinguistics may drive better measures.
The interaction between humans and most desktop and laptop computers is often performed through two input devices: the keyboard and the mouse. Continuous tracking of these devices provides an ...opportunity to verify the identity of a user, based on a profile of behavioral biometrics from the user's previous interaction with these devices. We propose a bank of sensors, each feeding a binary detector (trying to distinguish the authentic user from all others). In this study the detectors use features derived from the keyboard and the mouse, and their decisions are fused to develop a global authentication decision. The binary classification of the individual features is developed using Naive Bayes Classifiers which play the role of local detectors in a parallel binary decision fusion architecture. The conclusion of each classifier ('authentic user' or 'other') is sent to a Decision Fusion Center (DFC) where we use the Neyman-Pearson criterion to maximize the probability of detection under an upper bound on the probability of false alarms. We compute the receiver operating characteristic (ROC) of the resulting detection scheme, and use the ROC to assess the contribution of each individual sensor to the quality of the global decision on user authenticity. In this manner we identify the characteristics (and local detectors) that are most significant to the development of correct user authentication. While the false accept rate (FAR) and false reject rate (FRR) are fixed for the local sensors, the fusion center provides trade-off between the two global error rates, and allows the designer to fix an operating point based on hislher tolerance level of false alarms. We test our approach on a real-world dataset collected from 10 office workers, who worked for a week in an office environment as we tracked their keyboard dynamics and
Forensic analysis of questioned electronic documents is difficult because the nature of the documents eliminates many kinds of informative differences. Recent work in authorship attribution ...demonstrates the practicality of analyzing documents based on authorial style, but the state of the art is confusing. Analyses are difficult to apply, little is known about error types and rates, and no best practices are available. This paper discusses efforts to address these issues, partly through the development of a systematic testbed for multilingual, multigenre authorship attribution accuracy, and partly through the development and concurrent analysis of a uniform and portable software tool that applies multiple methods to analyze electronic documents for authorship based on authorial style.
Human language is one of the most intricate and complex systems with which scientists have tried to work. Many projects have foundered on the complexity of natural language and the difficulties of ...describing, in a principled way, its regularities and its idiosyncracies. Mathematical formalisms capable of describing its generality have proven infeasible to learn. Machine translation, translating automatically from one (natural) language to another, is in an even deeper hole, because of the difficulties of dealing with two sets of idiosyncracies simultaneously. In theory, all the information that a translator needs can be obtained from a set of already translated text--but it has proven very difficult and time consuming to write programs that are capable of working with this sort of information, and the most successful systems use naive and linguistically implausible formulations that are almost impossible to understand, modify, or use. Linguistic typologists and psycholinguists have identified many constraints on the form and processing of human languages. By incorporating these constraints into a language learning system, it is possible to build a system that learns to translate (infers functions and grammars for machine translation) from an aligned bilingual corpus of sentences using understandable, symbolic linguistic principles and representations. This work focuses on one particular constraint, the Marker Hypothesis, which is shown to be powerful, understandable, and computationally accessible. This hypothesis has been incorporated into a family of systems that infer such transfer functions using standard multivariate optimization techniques. These systems have been tested on a variety of language pairs and corpora, demonstrating the language and corpus independence of this approach. Furthermore, the design principles are in theory independent of any particular inference technique or grammatical representation and reflect only the constraints of the Marker Hypothesis and similar psycholinguistic principles. Because of the symbolic nature of this approach, the transfer functions learned are easy for non-mathematicians to use and modify. It is equally easy to apply other sources of linguistic information to help speed and direct the learning task. This can make the task of developing machine translation systems much simpler and represents a significant improvement over the current state of the art.
A set of formalisms is developed, based on the marker hypothesis, ie, that natural languages are marked for complex syntactic structure at surface form. A comparison of the expressivity & ...restrictedness of these formalisms shows that (1) not all constraints are restrictive & (2) the marker hypothesis & its implicit function/content word distinction provide strong restrictions on the form of allowable grammars. These restrictions may in turn provide evidence about its actual psychological reality & salience. In particular, the class of strongly marked languages can be demonstrated not to admit all finite languages & thus not be subject to the hangman's noose of E. M. Gold's (1967) learnability proofs. It is conjectured that these languages may provide a computable method of inferring human-like languages. 33 References. Adapted from the source document
This article describes an authorship, & more generally document classification, experiment on a pre-existing Dutch corpus of university writings. By measuring linguistic distances using a ...cross-entropy technique, a technique sensitive not only to the distributions of language features, but also to their relative intersequencing, classification judgments can be made with great sensitivity, significance, confidence, & accuracy. In particular, despite the designed difficulty of the Dutch corpus used, the technique was still able to reliably detect not only authorship, but also subtle features of register, topic, & even the educational attainments of the author. We present evidence suggesting that this technique outperforms more well-known techniques such as function word principle components analysis or linear discriminant analysis, as well as suggest ways in which performance can be improved. Tables, References. Adapted from the source document
CSNLP-96, Sept. 2-4, Dublin, Ireland Phonetic ambiguity and confusibility are bugbears for any form of bottom-up
or data-driven approach to language processing. The question of when an input
is ...``close enough'' to a target word pervades the entire problem spaces of
speech recognition, synthesis, language acquisition, speech compression, and
language representation, but the variety of representations that have been
applied are demonstrably inadequate to at least some aspects of the problem.
This paper reviews this inadequacy by examining several touchstone models in
phonetic ambiguity and relating them to the problems they were designed to
solve. An good solution would be, among other things, efficient, accurate,
precise, and universally applicable to representation of words, ideally usable
as a ``phonetic distance'' metric for direct measurement of the ``distance''
between word or utterance pairs. None of the proposed models can provide a
complete solution to the problem; in general, there is no algorithmic theory of
phonetic distance. It is unclear whether this is a weakness of our
representational technology or a more fundamental difficulty with the problem
statement.
NeMLaP-96, Sept. 16-18, 1996, Ankara TURKEY Although the confusion of individual phonemes and features have been studied
and analyzed since (Miller and Nicely, 1955), there has been little work done
...on extending this to a predictive theory of word-level confusions. The PGPfone
alphabet is a good touchstone problem for developing such word-level confusion
metrics. This paper presents some difficulties incurred, along with their
proposed solutions, in the extension of phonetic confusion results to a
theoretical whole-word phonetic distance metric. The proposed solutions have
been used, in conjunction with a set of selection filters, in a genetic
algorithm to automatically generate appropriate word lists for a radio
alphabet. This work illustrates some principles and pitfalls that should be
addressed in any numeric theory of isolated word perception.