Drawing upon the framework of linguistic citizenship, the chapters in this book link questions of language to sociopolitical discourses of justice, rights and equity, as well as to issues of power ...and access. They present powerful evidence of how marginalized speakers reclaim their voices and challenge power relations.
Most widely used pre-trained language models operate on sequences of tokens corresponding to word or subword units. By comparison,
models that operate directly on raw text (bytes or characters) have ...many benefits: They can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Because byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments.
This textbook serves a dual purpose. It is, first, a comprehensive introduction to historical linguistics, intended for both undergraduate and graduate students who have taken, at the least, an ...introductory course in linguistics.
This book provides a linguist with a statistical toolkit for exploration and analysis of linguistic data. It employs R, a free software environment for statistical computing, which is increasingly ...popular among linguists. How to do Linguistics with R: Data exploration and statistical analysis is unique in its scope, as it covers a wide range of classical and cutting-edge statistical methods, including different flavours of regression analysis and ANOVA, random forests and conditional inference trees, as well as specific linguistic approaches, among which are Behavioural Profiles, Vector Space Models and various measures of association between words and constructions. The statistical topics are presented comprehensively, but without too much technical detail, and illustrated with linguistic case studies that answer non-trivial research questions. The book also demonstrates how to visualize linguistic data with the help of attractive informative graphs, including the popular ggplot2 system and Google visualization tools.This book has a companion website: http://doi.org/10.1075/z.195.website
Recently, there has been an increased interest in demographically grounded bias in natural language processing (NLP) applications. Much of the recent work has focused on describing bias and providing ...an overview of bias in a larger context. Here, we provide a simple, actionable summary of this recent work. We outline five sources where bias can occur in NLP systems: (1) the data, (2) the annotation process, (3) the input representations, (4) the models, and finally (5) the research design (or how we conceptualize our research). We explore each of the bias sources in detail in this article, including examples and links to related work, as well as potential counter‐measures.
What is ethnicity? Is there a 'white' way of speaking? Why do people sometimes borrow features of another ethnic group's language? Why do we sometimes hear an accent that isn't there? This lively ...overview, first published in 2006, reveals the fascinating relationship between language and ethnic identity, exploring the crucial role it plays in both revealing a speaker's ethnicity and helping to construct it. Drawing on research from a range of ethnic groups around the world, it shows how language contributes to the social and psychological processes involved in the formation of ethnic identity, exploring both the linguistic features of ethnic language varieties and also the ways in which language is used by different ethnic groups. Complete with discussion questions and a glossary, Language and Ethnicity will be welcomed by students and researchers in sociolinguistics, as well as anybody interested in ethnic issues, language and education, inter-ethnic communication, and the relationship between language and identity.