The purpose of this work is to verify the effectiveness of the module of transcription of voice commands in a multilingual environment by determining the parameters of the voice signal during the ...operation of the module. An analysis of approaches to solving the problem of speech recognition in the flow of audio data, among which the emphasis is on the recognition of the speaker's voice signal. The study uses voices from Google's database in various languages to improve the implementation of voice commands for automatic “turn on” / ”turn off” equipment. Voice commands are performed in 9 random languages, depending on the availability of the Google Voice database using the recognition module. The influence of volume and distance on the performance of the voice recognition module is analyzed. The effectiveness and influence of the choice of command language from the distance between the microphone and the speaker in the range of approximately 5 cm, 10 cm and 15 cm, as well as the volume of voice commands in Google Voice at 30%, 50% and 100%. Ref. 9, pic. 4.
In recent years, neural models learned through self-supervised pretraining on large scale multilingual text or speech data have exhibited promising results for underresourced languages, especially ...when a relatively large amount of data from related language(s) is available. While the technology has a potential for facilitating tasks carried out in language documentation projects, such as speech transcription, pretraining a multilingual model from scratch for every new language would be highly impractical. We investigate the possibility for adapting an existing multilingual wav2vec 2.0 model for a new language, focusing on actual fieldwork data from a critically endangered tongue: Ainu. Specifically, we (i) examine the feasibility of leveraging data from similar languages also in fine-tuning; (ii) verify whether the model’s performance can be improved by further pretraining on target language data. Our results show that continued pretraining is the most effective method to adapt a wav2vec 2.0 model for a new language and leads to considerable reduction in error rates. Furthermore, we find that if a model pretrained on a related speech variety or an unrelated language with similar phonological characteristics is available, multilingual fine-tuning using additional data from that language can have positive impact on speech recognition performance when there is very little labeled data in the target language.
•Downstream performance of a multilingual speech representation model on a new, underresourced language can be improved through multilingual fine-tuning and additional pretraining.•Continued pretraining on target language data leads to substantially lower error rates in automatic speech transcription.•Multilingual fine-tuning with additional data from a related or similar language helps when labeled target language data is scarce.
Crowd-sourcing prosodic annotation Cole, Jennifer; Mahrt, Timothy; Roy, Joseph
Computer speech & language,
September 2017, 2017-09-00, Letnik:
45
Journal Article
Recenzirano
Odprti dostop
•Untrained annotators performed rapid prosodic annotation for conversational speech.•Inter‐annotator reliability was similar for crowd‐sourced and lab‐based annotators.•Same acoustic and contextual ...cues predict expert and non‐expert prosodic annotation.•Annotators experienced with dialect of speech materials yield higher reliability.•Rapid prosodic annotation is optimized with annotator cohorts of 10–12.
Much of what is known about prosody is based on native speaker intuitions of idealized speech, or on prosodic annotations from trained annotators whose auditory impressions are augmented by visual evidence from speech waveforms, spectrograms and pitch tracks. Expanding the prosodic data currently available to cover more languages, and to cover a broader range of unscripted speech styles, is prohibitive due to the time, money and human expertise needed for prosodic annotation. We describe an alternative approach to prosodic data collection, with coarse-grained annotations from a cohort of untrained annotators performing rapid prosody transcription (RPT) using LMEDS, an open-source software tool we developed to enable large-scale, crowd-sourced data collection with RPT. Results from three RPT experiments are reported. The reliability of RPT is analysed comparing kappa statistics for lab-based and crowd-sourced annotations for American English, comparing annotators from the same (US) versus different (Indian) dialect groups, and comparing each RPT annotator with a ToBI annotation. Results show better reliability for same-dialect annotators (US), and the best overall reliability from crowd-sourced US annotators, though lab-based annotations are the most similar to ToBI annotations. A generalized additive mixed model is used to test differences among annotator groups in the factors that predict prosodic annotation. Results show that a common set of acoustic and contextual factors predict prosodic labels for all annotator groups, with only small differences among the RPT groups, but with larger effects on prosodic marking for ToBI annotators. The findings suggest methods for optimizing the efficiency of RPT annotations. Overall, crowd-sourced prosodic annotation is shown to be efficient, and to rely on established cues to prosody, supporting its use for prosody research across languages, dialects, speaker populations, and speech genres.
•A new developed system was implemented to automatically prediction illness based on speech.•A new hybrid feature selection method was created and built that combined the output of a genetic ...algorithm (GA) to the inputs of a neural network (NN) algorithm, selecting the most powerful discriminating features.•Medical speech, transcription, and intent dataset served as the baseline for this study.•Classification was performed using a support vector machine (SVM), neural network, and Gaussian mixture model. The greatest results obtained were through SVM.
Due to the COVID-19 epidemic and the curfew caused by it, many people have sought to find an ADPS on the internet in the last few years. This hints to a new age of medical treatment, all the more so if the number of internet users continues to expand. As a result, automatic illness prediction online applications have attracted the interest of a large number of researchers worldwide. This work aims to develop and implement an automated illness prediction system based on speech. The system will be designed to forecast the sort of ailment a patient is suffering from based on his voice, but this was not feasible during the trial, therefore the diseases were divided into three categories (painful, light pain and psychological pain), and then the diagnose process were implemented accordingly. The medical dataset named “speech, transcription, and intent” served as the baseline for this study. The smoothness, MFCC, and SCV properties were used in this work, which demonstrated their high representation to human being medical situations. The noise reduction forward-backward filter was used to eliminate noise from wave files captured online in order to account for the high level of noise seen in the deployed dataset. For this study, a hybrid feature selection method was created and built that combined the output of a genetic algorithm (GA) with the inputs of a NN algorithm. Classification was performed using SVM, neural network, and GMM. The greatest results obtained were 94.55% illness classification accuracy in terms of SVM. The results showed that diagnosing illness through speech is a difficult process, especially when diagnosing each type of illness separately, but when grouping the different illness types into groups, depending on the amount of pain and the psychological situation of the patient, the results were much higher.
Display omitted
Speech sound errors are common in people with a variety of communication disorders and can result in impaired message transmission to listeners. Valid and reliable metrics exist to quantify this ...problem, but they are rarely used in clinical settings due to the time-intensive nature of speech transcription by humans. Automated speech recognition (ASR) technologies have advanced substantially in recent years, enabling them to serve as realistic proxies for human listeners. This study aimed to determine how closely transcription scores from human listeners correspond to scores from an ASR system.
Sentence recordings from 10 stroke survivors with aphasia and apraxia of speech were transcribed orthographically by 3 listeners and a web-based ASR service. Adjusted transcription scores were calculated for all samples based on accuracy of transcribed content words.
As expected, transcription scores were significantly higher for the humans than for ASR. However, intraclass correlations revealed excellent agreement among the humans and ASR systems, and the systematically lower scores for computer speech recognition were effectively equalized simply by adding the regression intercept.
The results suggest the clinical feasibility of supplementing or substituting human transcriptions with computer-generated scores, though extension to other speech disorders requires further research.
Transcribing against time Sperber, Matthias; Neubig, Graham; Niehues, Jan ...
Speech communication,
October 2017, 2017-10-00, 20171001, Letnik:
93
Journal Article
Recenzirano
Odprti dostop
We investigate the problem of manually correcting errors from an automatic speech transcript in a cost-sensitive fashion. This is done by specifying a fixed time budget, and then automatically ...choosing location and size of segments for correction such that the number of corrected errors is maximized. The core components, as suggested by previous research (Sperber, 2014c), are a utility model that estimates the number of errors in a particular segment, and a cost model that estimates annotation effort for the segment. In this work we propose a dynamic updating framework that allows for the training of cost models during the ongoing transcription process. This removes the need for transcriber enrollment prior to the actual transcription, and improves correction efficiency by allowing highly transcriber-adaptive cost modeling. We first confirm and analyze the improvements afforded by this method in a simulated study. We then conduct a realistic user study, observing efficiency improvements of 15% relative on average, and 42% for the participants who deviated most strongly from our initial, transcriber-agnostic cost model. Moreover, we find that our updating framework can capture dynamically changing factors, such as transcriber fatigue and topic familiarity, which we observe to have a large influence on the transcriber’s working behavior.
Research on speech technologies necessitates spoken data, which is usually obtained through read recorded speech, and specifically adapted to the research needs. When the aim is to deal with the ...prosody involved in speech, the available data must reflect natural and conversational speech, which is usually costly and difficult to get. This paper presents a machine learning-oriented toolkit for collecting, handling, and visualization of speech data, using prosodic heuristic. We present two corpora resulting from these methodologies: PANTED corpus, containing 250 h of English speech from TED Talks, and Heroes corpus containing 8 h of parallel English and Spanish movie speech. We demonstrate their use in two deep learning-based applications: punctuation restoration and machine translation. The presented corpora are freely available to the research community.
We investigate whether Amazon's Mechanical Turk (MTurk) service can be used as a reliable method for transcription of spoken language data. Utterances with varying speaker demographics (native and ...non-native English, male and female) were posted on the MTurk marketplace together with standard transcription guidelines. Transcriptions were compared against transcriptions carefully prepared in-house through conventional (manual) means. We found that transcriptions from MTurk workers were generally quite accurate. Further, when transcripts for the same utterance produced by multiple workers were combined using the ROVER voting scheme, the accuracy of the combined transcript rivaled that observed for conventional transcription methods. We also found that accuracy is not particularly sensitive to payment amount, implying that high quality results can be obtained at a fraction of the cost and turnaround time of conventional methods.
The challenge of prosodic annotation is reflected in commonly reported variability among trained annotators in the assignment of prosodic labels. The present study examines individual differences in ...the perception of prosody through the lens of prosodic annotation. First, Generalized Additive Mixed Models (GAMMs) reveal the non-linear pattern of some acoustic cues on the perception of prosodic features. Second, these same models reveal that while some of the untrained annotators are using these cues to determine prosodic features, the magnitude of effect differs quite dramatically across the annotators. Finally, the trained annotators follow the same cues as subsets of the untrained annotators, but present a much stronger effect for many of the cues. The findings show that while prosody perception is systemically related to acoustic and contextual cues, there are also individual differences that are limited to the selection and magnitude of the factors that influence prosodic rating, and the relative weighting among those factors.