The project Breaking the Unwritten Language Barrier (BULB), which brings together linguists and computer scientists, aims at supporting linguists in documenting unwritten languages. In order to ...achieve this we develop tools tailored to the needs of documentary linguists by building upon technology and expertise from the area of natural language processing, most prominently automatic speech recognition and machine translation. As a development and test bed for this we have chosen three less-resourced African languages from the Bantu family: Basaa, Myene and Embosi. Work within the project is divided into three main steps:1)Collection of a large corpus of speech (100h per language) at a reasonable cost. For this we use standard mobile devices and a dedicated software—Lig-Aikuma. After initial recording, the data is re-spoken by a reference speaker to enhance the signal quality and orally translated into French.2)Automatic transcription of the Bantu languages at phoneme level and the French translation at word level. The recognized Bantu phonemes and French words will then be automatically aligned.3)Tool development. In close cooperation and discussion with the linguists, the speech and language technologists will design and implement tools that will support the linguists in their work, taking into account the linguists’ needs and technology's capabilities.
This paper reports on activites at LIMSI over the last few years directed at the transcription of broadcast news data. We describe our development work in moving from laboratory read speech data to ...real-world or `found' speech data in preparation for the DARPA evaluations on this task from 1996 to 1999. Two main problems needed to be addressed to deal with the continuous flow of inhomogenous data. These concern the varied acoustic nature of the signal (signal quality, environmental and transmission noise, music) and different linguistic styles (prepared and spontaneous speech on a wide range of topics, spoken by a large variety of speakers).
The problem of partitioning the continuous stream of data is addressed using an iterative segmentation and clustering algorithm with Gaussian mixtures. The speech recognizer makes use of continuous density HMMs with Gaussian mixture for acoustic modeling and 4-gram statistics estimated on large text corpora. Word recognition is performed in multiple passes, where current hypotheses are used for cluster-based acoustic model adaptation prior to the next decoding pass. The overall word transcription error of the LIMSI evaluation systems were 27.1% (Nov96, partitioned test data), 18.3% (Nov97, unpartitioned data), 13.6% (Nov98, unpartitioned data) and 17.1% (Fall99, unpartitioned data with computation time under 10× real-time).
Dieser Artikel berichtet über die Tätigkeiten am LIMSI während der letzten Jahren mit dem Ziel der Spracherkennung von Nachrichtensendungen. Wir beschreiben unsere Forschungsarbeiten der Portierung von unter Laborbedingungen gelesener Sprache zu natürlicher freier Sprache während der Vorbereitung der ARPA Nov96, Nov97, Nov98 und Nov99 Evaluierungen. Zur Bearbeitung des kontinuierlichen Stroms von inhomogenen Daten sind zwei grundlegende Probleme zu lösen.
Diese betreffen einerseits die unregelmässige akustische Natur des Signals (Signalqualität, Hintergrund- und Übertragungsrauschen, Musik,…) und andererseits die unterschiedlichen linguistischen Stile (vorbereitete oder spontane Sprache, eine große Themenvielfalt und viele unterschiedliche Sprecher).
Der kontinuierliche Audiostrom wird mit Hilfe eines iterativen Segmentierungs- und Klusterungsalgorithmus auf der Basis von Gauss Mischverteilungen partioniert. Das Spracherkennungssystem verwendet HMMs mit kontinuierlichen Gauss Mischverteilungen zur akustischen Modellierung und 4-gram Statistiken, welche mit Hilfe großer Textsammlungen geschätzt wurden. Die Worterkennung erfolgt in mehreren Phasen, wobei die Wortgraphen nach und nach mit Hilfe akustischer Modellanpassung verbessert werden. Die Wordfehlerrate von LIMSI's Spracherkennungssytemen beträgt 27,1% (Nov96, segmentierte Testdaten), 18,3% (Nov98, unsegmentierte Testdaten) und 17,1% (Herbst 99, unsegmentierte Testdaten und Ausführungszeiten von weniger als 10-facher Echtzeit).
Cet article présente les travaux effectués au LIMSI pour le développement d'un système de traitement automatique d'informations radio et télédiffusées. Partant d'un système de transcription de textes lus, nous décrivons les adaptations qui ont été nécessaires pour le traitement d'un flux audio continu et de données dites “trouvées”. Ces développements ont été validés dans le cadre des évaluations ARPA BN (Nov96, Nov97, Nov98 et Dec99). Les principales difficultés posées par ce type de données sont liées à leur nature hétérogène, qu'il s'agisse de changements de nature acoustique (environnement, communication, musique) ou de nature linguistique (styles d'élocution, diversités des sujets et des locuteurs).
La partition du flux continu est effectuée de manière itérative, par un algorithme de segmentation–agglomération reposant sur des mélanges de Gaussiennes. Le système de reconnaissance utilise des modèles de Markov cachés à densités continues pour la modélisation acoustique, et des statistiques 4-grammes de mots estimées sur un grand corpus de textes et de parole transcrite pour modèle de langage. La transcription en mots est obtenue en plusieurs passes de décodage, où les hypothèses intermédiaires sont utilisées pour adapter les modèles acoustiques. Les taux d'erreur obtenues avec différentes versions de ce système lors des évaluations ARPA sont 27,1% (Nov96 avec partition manuelle), 18,3% (Nov97), 13,6% (Nov98) et 17,1% (Dec99, moins de 10 fois le temps réel).
What is commonly considered as an epenthetic vowel can actually refer to at least two different realities: phonological epenthesis or phonetic excrescence. French schwa, noted ә, is a vowel ...alternating with zero and limited to unstressed syllables that can appear word-internally or word-finally. This paper presents an extensive description of the distribution of word-final schwa in Standard French in order to shed light on its nature: is it an intrusive vowel or a full epenthetic vowel? To that extent, three large corpora of French containing more than 110 hours of speech were used to establish the presence of word-final schwa as a function of sociolinguistics, orthography, phonotactics and phonetics. Our conclusions are that word-final schwa is impacted by speech style, gender, orthography, phonotactics (i.e., the number of adjacent consonants and their sonority profile), and the phonological properties of the codas. However, speech rate does not impact word-final schwa realization. The specific results lead us to suggest that word-final schwa in Standard French shares similarities with intrusive vowels but ultimately behaves like a legit epenthetic vowel.
Lightly supervised and unsupervised acoustic model training Lamel, Lori; Gauvain, Jean-Luc; Adda, Gilles
Computer speech & language,
January 2002, 2002, 2002-01-00, 20020101, Letnik:
16, Številka:
1
Journal Article, Conference Proceeding
Recenzirano
The last decade has witnessed substantial progress in speech recognition technology, with today’s state-of-the-art systems being able to transcribe unrestricted broadcast news audio data with a word ...error of about 20%. However, acoustic model development for these recognizers relies on the availability of large amounts of manually transcribed training data. Obtaining such data is both time-consuming and expensive, requiring trained human annotators and substantial amounts of supervision. This paper describes some recent experiments using lightly supervised and unsupervised techniques for acoustic model training in order to reduce the system development cost. The approach uses a speech recognizer to transcribe unannotated broadcast news data from the Darpa TDT-2 corpus. The hypothesized transcription is optionally aligned with closed-captions or transcripts to create labels for the training data. Experiments providing supervision only via the language model training materials show that including texts which are contemporaneous with the audio data is not crucial for success of the approach, and that the acoustic models can be initialized with as little as 10 min of manually annotated data. These experiments demonstrate that light or no supervision can dramatically reduce the cost of building acoustic models.
This paper investigates a data-driven word decompounding algorithm for use in automatic speech recognition. An existing algorithm, called ldquoMorfessor,rdquo has been enhanced in order to address ...the problem of increased phonetic confusability arising from word decompounding by incorporating phonetic properties and some constraints on recognition units derived from forced alignments experiments. Speech recognition experiments have been carried out on a broadcast news task for the Amharic language to validate the approach. The out of vocabulary (OOV) word rates were reduced by 35% to 50% and a small reduction in word error rate (WER) has been achieved. The algorithm is relatively language independent and requires minimal adaptation to be applied to other languages.
Le schwa est une voyelle faible ou réduite notée ә alternant avec zéro et restreinte aux syllabes non-accentuées. En français standard, il peut faire surface à l’intérieur ou en fin de mot. Nous ...proposons ici une étude du schwa final de mot exclusivement, en particulier par le prisme de la question du schwa final en tant que « lubrifiant phonétique » (Purse 2019). Le schwa final est-il réellement un lubrifiant ? Joue-t-il seulement un rôle sur le plan exclusivement phonétique ? Pour répondre à ces questions, nous avons utilisé trois très grands corpus du français (plus de 110 heures de discours) pour établir la présence du schwa final selon les contraintes phonotactiques (la loi des trois consonnes, Grammont 1894) et le style de parole, mais aussi son rôle sur les phénomènes d’adjacence de bas niveau que sont le dévoisement final et l’assimilation régressive de voisement en français standard. Nous concluons que le schwa final est en effet corrélé au nombre de consonnes dans la séquence, et au style de parole ; de surcroît, sa présence est significativement corrélée à beaucoup moins d’effets d’adjacence – comme s’il jouait le rôle de bouclier, facilitant l’adéquation entre forme de surface et forme sous-jacente.
•This study aims to establish a link between speech technology and linguistic research by studying the durations of Mandarin lexical tones in large speech corpora using tools from automatic speech ...recognition.•About 1000 hours of continuous Mandarin were used in this study. To our knowledge, this is the first time ever that linguistic study of Mandarin tones has been carried out on such a large corpus.•To date, only a few research studies were found on the regional variation of standard Mandarin, let alone any large corpus-based study of the subject.
This study aims to increase our knowledge of Mandarin lexical tone duration in continuous Mandarin speech. Related variation factors such as the number of syllable(s) in word, the position of syllable in word, its prosodic position and speech style were also explored. Large corpora of casual and journalistic speech (total ∼1000 hours) were used. More than 90% of the words (tokens) used in spoken Mandarin are monosyllabic and disyllabic words. In casual speech, 67% of the wordtokens are monosyllabic and 30% of the word-tokens are disyllabic. In journalistic speech, however, disyllabic words (49%) are more frequently used than monosyllabic words (45%). Tone 4 is the most frequently used tone among the four lexical tones in both casual (34%) and journalistic (36%) speech. Tone 1, Tone 2 and Tone 3 have similar occurrence frequencies in causal speech. Tone 3 appears to be the least frequently used tone in journalistic speech. With regard to tone duration, the results show that Tone 2 tends to have the shortest duration in causal speech and Tone 3 appears to have the longest duration in journalistic speech. Nonetheless, the studied variation factors (number of syllable(s) in word, position of syllable in word and prosodic position) are all found to influence the duration of Mandarin lexical tones, for both causal speech and journalistic speech. Tone durations in monosyllabic words appear to be closer to those of word-final syllables than to other syllable positions in multi-syllabic words. In terms of prosodic position, tone duration tends to increase with higher prosodic level in both casual and journalistic speech. Regardless of tone nature and speech style, the longest tone duration is in phrase-final position, followed by word-final and then word-medial position. Regional variety for tone duration is explored using casual speech productions from speakers of five major cities of North and South-East China, namely Beijing, Shanghai, Wuxi, Suzhou and Nanjing.
Cette contribution présente une étude sur la détection d’émotions et de mélanges d’émotions dans un corpus collecté dans un centre d’appels d’urgence à Paris (CEMO). Notre corpus, enregistré ‹in the ...wild›, est riche en diversité vocale (âge, accent, nombre de locuteurs) et est annoté avec un schéma original qui représente jusqu’à deux émotions par segment. Des tests avec des systèmes utilisant des Transformers audio spécifiques adaptés à CEMO sur une partie des émotions non mixtes ont permis d’obtenir un score de détection ( Accuracy ) de 56.7 % pour 4 classes (peur, neutre, positif, tristesse) surpassant ceux obtenus avec des approches plus classiques basées sur des caractéristiques prosodiques expertes. Des tests supplémentaires ont été effectués sur une partie de CEMO avec des émotions mixtes, mettant en évidence certains des défis à relever, en particulier la prise en compte du contexte de l’interaction.
The emotion detection technology to enhance human decision-making is an important research issue for real-world applications, but real-life emotion datasets are relatively rare and small. The ...experiments conducted in this paper use the CEMO, which was collected in a French emergency call center. Two pre-trained models based on speech and text were fine-tuned for speech emotion recognition. Using pre-trained Transformer encoders mitigates our data's limited and sparse nature. This paper explores the different fusion strategies of these modality-specific models. In particular, fusions with and without cross-attention mechanisms were tested to gather the most relevant information from both the speech and text encoders. We show that multimodal fusion brings an absolute gain of 4-9% with respect to either single modality and that the Symmetric multi-headed cross-attention mechanism performed better than late classical fusion approaches. Our experiments also suggest that for the real-life CEMO corpus, the audio component encodes more emotive information than the textual one.
This paper presents the LIMSI speaker diarization system for lecture data, in the framework of the Rich Transcription 2006 Spring (RT-06S) meeting recognition evaluation. This system builds upon the ...baseline diarization system designed for broadcast news data. The baseline system combines agglomerative clustering based on Bayesian information criterion with a second clustering using state-of-the-art speaker identification techniques. In the RT-04F evaluation, the baseline system provided an overall diarization error of 8.5% on broadcast news data. However since it has a high missed speech error rate on lecture data, a different speech activity detection approach based on the log-likelihood ratio between the speech and non-speech models trained on the seminar data was explored. The new speaker diarization system integrating this module provides an overall diarization error of 20.2% on the RT-06S Multiple Distant Microphone (MDM) data.