Automatic speech recognition: a survey Malik, Mishaim; Malik, Muhammad Kamran; Mehmood, Khawar ...
Multimedia tools and applications,
03/2021, Letnik:
80, Številka:
6
Journal Article
Recenzirano
Recently great strides have been made in the field of automatic speech recognition (ASR) by using various deep learning techniques. In this study, we present a thorough comparison between ...cutting-edged techniques currently being used in this area, with a special focus on the various deep learning methods. This study explores different feature extraction methods, state-of-the-art classification models, and vis-a-vis their impact on an ASR. As deep learning techniques are very data-dependent different speech datasets that are available online are also discussed in detail. In the end, the various online toolkits, resources, and language models that can be helpful in the formulation of an ASR are also proffered. In this study, we captured every aspect that can impact the performance of an ASR. Hence, we speculate that this work is a good starting point for academics interested in ASR research.
Minds Backus, Ad; Cohen, Michael; Cohn, Neil ...
Linguistics in the Netherlands,
11/2023, Letnik:
40, Številka:
1
Journal Article
Recenzirano
The advent of large language models (LLMs) like GPT-4 has raised fundamental questions about language and its nature, such as whether artificial systems are able to "use" language in a similar way to ...humans. The role of linguistics in the development of these technologies has been surprisingly limited, but linguists can pick up a much larger role in these discussions clarifying how LLMs could be adapted to become more similar to the language we use as humans. This paper contends that linguistic models and representations should centralize MINDS: Multimodality, Interoperability, Nonopacity, Diversity, and Sociality as the authors argue that these aspects of human language constitute the main challenges to linguistics as a social science and that elucidating them would require a concerted effort from the field itself, but also from affiliated domains such as philosophy, anthropology, sociology, and psychology.
Abstract Large language models (LLMs) are capable of successfully performing many language processing tasks zero-shot (without training data). If zero-shot LLMs can also reliably classify and explain ...social phenomena like persuasiveness and political ideology, then LLMs could augment the computational social science (CSS) pipeline in important ways. This work provides a road map for using LLMs as CSS tools. Towards this end, we contribute a set of prompting best practices and an extensive evaluation pipeline to measure the zero-shot performance of 13 language models on 25 representative English CSS benchmarks. On taxonomic labeling tasks (classification), LLMs fail to outperform the best fine-tuned models but still achieve fair levels of agreement with humans. On free-form coding tasks (generation), LLMs produce explanations that often exceed the quality of crowdworkers’ gold references. We conclude that the performance of today’s LLMs can augment the CSS research pipeline in two ways: (1) serving as zero-shot data annotators on human annotation teams, and (2) bootstrapping challenging creative generation tasks (e.g., explaining the underlying attributes of a text). In summary, LLMs are posed to meaningfully participate in social science analysis in partnership with humans.
While usage-based approaches to language development enjoy considerable support from computational studies, there have been few attempts to answer a key computational challenge posed by usage-based ...theory: the successful modeling of language learning as language use. We present a usage-based computational model of language acquisition which learns in a purely incremental fashion, through online processing based on chunking, and which offers broad, cross-linguistic coverage while uniting key aspects of comprehension and production within a single framework. The model's design reflects memory constraints imposed by the real-time nature of language processing, and is inspired by psycholinguistic evidence for children's sensitivity to the distributional properties of multiword sequences and for shallow language comprehension based on local information. It learns from corpora of child-directed speech, chunking incoming words together to incrementally build an item-based "shallow parse." When the model encounters an utterance made by the target child, it attempts to generate an identical utterance using the same chunks and statistics involved during comprehension. High performance is achieved on both comprehension- and production-related tasks: the model's shallow parsing is evaluated across 79 single-child corpora spanning English, French, and German, while its production performance is evaluated across over 200 single-child corpora representing 29 languages from the CHILDES database. The model also succeeds in capturing findings from children's production of complex sentence types. Together, our modeling results suggest that much of children's early linguistic behavior may be supported by item-based learning through online processing of simple distributional cues, consistent with the notion that acquisition can be understood as learning to process language.
Adaptive Semiparametric Language Models Yogatama, Dani; de Masson d’Autume, Cyprien; Kong, Lingpeng
Transactions of the Association for Computational Linguistics,
01/2021, Letnik:
9
Journal Article
Recenzirano
Odprti dostop
We present a language model that combines a large parametric neural network (i.e., a transformer) with a non-parametric episodic memory component in an integrated architecture. Our model uses ...extended short-term context by caching local hidden states—similar to transformer-XL—and global long-term memory by retrieving a set of nearest neighbor tokens at each timestep. We design a gating function to adaptively combine multiple information sources to make a prediction. This mechanism allows the model to use either local context, short-term memory, or long-term memory (or any combination of them) on an ad hoc basis depending on the context. Experiments on word-based and character-based language modeling datasets demonstrate the efficacy of our proposed method compared to strong baselines.
Transformer-based language models have taken many fields in NLP by storm. BERT and its derivatives dominate most of the existing evaluation benchmarks, including those for Word Sense Disambiguation ...(WSD), thanks to their ability in capturing context-sensitive semantic nuances. However, there is still little knowledge about their capabilities and potential limitations in encoding and recovering word senses. In this article, we provide an in-depth quantitative and qualitative analysis of the celebrated BERT model with respect to lexical ambiguity. One of the main conclusions of our analysis is that BERT can accurately capture high-level sense distinctions, even when a limited number of examples is available for each word sense. Our analysis also reveals that in some cases language models come close to solving coarse-grained noun disambiguation under ideal conditions in terms of availability of training data and computing resources. However, this scenario rarely occurs in real-world settings and, hence, many practical challenges remain even in the coarse-grained setting. We also perform an in-depth comparison of the two main language model-based WSD strategies, namely, fine-tuning and feature extraction, finding that the latter approach is more robust with respect to sense bias and it can better exploit limited available training data. In fact, the simple feature extraction strategy of averaging contextualized embeddings proves robust even using only three training sentences per word sense, with minimal improvements obtained by increasing the size of this training data.
Language models trained on billions of tokens have recently led to unprecedented results on many NLP tasks. This success raises the question of whether, in principle, a system can ever “understand” ...raw text without access to some form of grounding. We formally investigate the abilities of ungrounded systems to acquire meaning. Our analysis focuses on the role of “assertions”: textual contexts that provide indirect clues about the underlying semantics. We study whether assertions enable a system to emulate representations preserving semantic relations like equivalence. We find that assertions enable semantic emulation of languages that satisfy a strong notion of semantic transparency. However, for classes of languages where the same expression can take different values in different contexts, we show that emulation can become uncomputable. Finally, we discuss differences between our formal model and natural language, exploring how our results generalize to a modal setting and other semantic relations. Together, our results suggest that assertions in code or language do not provide sufficient signal to fully emulate semantic representations. We formalize ways in which ungrounded language models appear to be fundamentally limited in their ability to “understand”.
•The two-stage training scheme works better.•A new off-the-shelf language model is proposed.•Linguistic semantic matters in text recognition.•Our method achieves the new state-of-the-art performance ...on six public benchmarks.
Scene Text Recognition (STR) task needs to consume large-amount data to develop a powerful recognizer, including visual data like images and linguistic data like texts. However, existing methods mainly leverage a one-stage training manner to train the entire framework end-to-end, which deeply relies on the well-annotated images and does not effectively use the data of the two modalities mentioned above. To solve this, in this paper, we propose a pre-trained multi-modal network (PMMN) that utilizes visual and linguistic data to pre-train the vision model and language model respectively to learn modality-specific knowledge for accurate scene text recognition. In detail, we first pre-train the proposed off-the-shelf vision model and language model to convergence. And then, we combine the pre-trained models in a unified framework for end-to-end fine-tuning and utilize the learned multi-modal information to interact with each other to generate robust features for character prediction. Extensive experiments are conducted to demonstrate the effectiveness of PMMN. The evaluation results on six benchmarks show that our proposed method exceeds most existing methods, achieving state-of-the-art performance.
Because of using traditional hand-sign segmentation and classification algorithm, many diversities of Bangla language including joint-letters, dependent vowels etc. and representing 51 Bangla written ...characters by using only 36 hand-signs, continuous hand-sign-spelled Bangla sign language (BdSL) recognition is challenging. This paper presents a Bangla language modeling algorithm for automatic recognition of hand-sign-spelled Bangla sign language which consists of two phases. First phase is designed for hand-sign classification and the second phase is designed for Bangla language modeling algorithm (BLMA) for automatic recognition of hand-sign-spelledBangla sign language. In first phase, we have proposed two step classifiers for hand-sign classification using normalized outer boundary vector (NOBV) and window-grid vector (WGV) by calculating maximum inter correlation coefficient (ICC) between test feature vector and pre-trained feature vectors. At first, the system classifies hand-signs using NOBV. If classification score does not satisfy specific threshold then another classifier based on WGV is used. The system is trained using 5,200 images and tested using another (5, 200 × 6) images of 52 hand-signs from 10 signers in 6 different challenging environments achieving mean accuracy of 95.83% for classification with the computational cost of 39.972 milliseconds per frame. In the Second Phase, we have proposed Bangla language modeling algorithm (BLMA) which discovers all "hidden characters" based on "recognized characters" from 52 hand-signs of BdSL to make any Bangla words, composite numerals and sentences in BdSL with no training, only based on the result of first phase. To the best of our knowledge, the proposed system is the first system in BdSL designed on automatic recognition of hand-sign-spelled BdSL for large lexicon. The system is tested for BLMA using hand-sign-spelled 500 words, 100 composite numerals and 80 sentences in BdSL achieving mean accuracy of 93.50%, 95.50% and 90.50% respectively.