This paper presents a new framework for integrating untranscribed spoken content into the acoustic training of an automatic speech recognition system. Untranscribed spoken content plays a very ...important role for underresourced languages because the production of manually transcribed speech databases still represents a very expensive and time‐consuming task. We proposed two new methods as part of the training framework. The first method focuses on combining initial acoustic models using a data‐driven metric. The second method proposes an improved acoustic training procedure based on unsupervised transcriptions, in which word endings were modified by broad phonetic classes. The training framework was applied to baseline acoustic models using untranscribed spoken content from parliamentary debates. We include three types of acoustic models in the evaluation: baseline, reference content, and framework content models. The best overall result of 18.02% word error rate was achieved with the third type. This result demonstrates statistically significant improvement over the baseline and reference acoustic models.
Andrej Zgank (phone: +386 2 220 7206, email: andrej.zgank@uni‐mb.si) is with the Faculty of Electrical Engineering and Computer Science, University of Maribor, Maribor, Slovenia.
This paper addresses the problem of multilingual acoustic modelling for the design of multilingual speech recognisers. An agglomerative clustering algorithm for the definition of multilingual set of ...triphones is proposed. This clustering algorithm is based on the definition of an indirect distance measure for triphones defined as a weighted sum of the explicit estimates of the context similarity on a monophone level. The monophone similarity estimation method is based on the algorithm of Houtgast. The new clustering algorithm was tested in a multilingual speech recognition experiment for three languages. The algorithm was applied on monolingual triphone sets of language specific recognisers for all languages. In order to evaluate the clustering algorithm, the performance of the multilingual set of triphones was compared to the performance of the reference system composed of all three language specific recognisers operating in parallel, and to the performance of the multilingual set of triphones produced by the tree-based clustering algorithm. All experiments were based on the 1000 FDB SpeechDat(II) databases (Slovenian, Spanish and German). Experiments have shown that the use of the clustering algorithm results in a significant reduction of the number of triphones with minor degradation of recognition rate.
Dieser Beitrag befasst sich mit dem Problem der multilingualen akustischen Modellierung für die automatische Spracherkennung. Die Verwendung eines agglomerativen Cluster-Algorithmus zur Defintion einer Menge multilingualer kontextabhängiger phonetischer Einheiten (Triphone) wird eingeführt. Der Algorithmus basiert auf einem indirekten Abstandsmaß für Triphone, das als eine gewichtete Summe der geschätzten Ähnlichkeiten der zu den Triphonen gehörenden Monophone definiert ist. Die Ähnlichkeitsschätzung der Monophone basiert auf dem Algorithmus von Houtgast. Der neue Cluster-Algorithmus wurde auf multilinguale Spracherkennungsexperimente für drei verschiedene Sprachen angewendet. Dazu wurden sprachspezifische Erkennungssysteme mit monolingualen Triphonen für alle drei Sprachen eingesetzt. Um den Cluster-Algorithmus bewerten zu können, wurde die Leistungsfähigkeit des auf den multilingualen Triphonen basierenden Systems mit zwei Referenzsystemen verglichen. Während in dem ersten Referenzsystem die sprachenspezifischen Modelle gleichzeitig (parallel) eingesetzt wurden, fanden im zweiten Referenzsystem multilinguale Modelle Verwendung, die mithilfe eines entscheidungsbaumbasierten Cluster-Algorithmus erstellt wurden. Alle Experimente basieren auf der 1000 FDB SpeechDat(II) Datenbasis (slowenisch, spanisch und deutsch). Die Untersuchungen haben gezeigt, dass die Verwendung des vorgeschlagenen agglomerativen Cluster-Algorithmus bei einer geringgradigen Abnahme der Erkennungsrate zu einer deutlichen Verringerung der Anzahl der Triphonparameter führt.
The development of big data, machine learning, and the Internet of Things has led to rapid advances in the research field of Active and Assisted Living (AAL). A human is placed in the center of such ...an environment, interacting with different modalities while using the system. Although video still plays a dominant role in AAL technologies, audio, as the most natural means of interaction, is also used commonly, either as a single source of information, or in combination with other modalities. Despite the rapidly increased research efforts in the last decade, there is a lack of systematic overview of audio based technologies and applications in AAL. This review tries to fill this gap, and identifies five major topics where audio is an essential AAL building block: Physiological monitoring, emotion recognition in the context of AAL, human activity recognition, fall detection, and food intake monitoring. We address the data work flow and standard sensing technologies for capturing audio in the AAL environment, provide a comprehensive overview of audio-based AAL applications, and identify datasets available to the research community. Finally, we address the main challenges that should be handled in the upcoming years, and try to identify the potential future trends in audio-based AAL.
Display omitted
•Comprehensive review of sensing technologies, datasets and applications of audio in AAL.•Audio can be standalone source of information, or combined with other modalities.•Challenges in deploying audio technologies in AAL and future trends are identified.•Creating benchmark platforms to facilitate the model evaluation is desirable.
This paper presents an application of "LentInfo," which is a system used to provide information about programs for the Festival Lent in Slovenia. The Festival Lent consists of different open-air ...theatre & music performances & draws more than 400,000 visitors per year. This application is based on a Hidden Markov Model (HMM) speech recognizer, & the dialogue construction & management is done using the CSDP (Common Spoken Dialogue Platform) dialogue management system. It is represented as a finite-state structure. The dialogue can be specified in a script using simple syntax description. The dialogue manager is multi-application oriented, so it can easily be upgraded for new applications. If some new concepts are needed, only new actions need be added to the existing ones. Currently, prompt messages are prerecorded, but it is also possible to include a speech synthesis system depending on the needs of the application. Error recovery during the dialogue is done with user confirmation of the recognized input speech. The results are presented for tests performed in the year 2001. The results are analyzed according to the phone type (fixed/mobile), signal to noise ratio, dialogue path, etc. Although some calls where carried out using mobile phones from noisy festival places, the performance of the system decreased only slightly under these conditions. 7 Tables, 5 Figures, 24 References. Adapted from the source document
This paper addresses the topic of defining phonetic broad classes needed during acoustic modeling for speech recognition in the procedure of decision tree based clustering. The usual approach is to ...use phonetic broad classes which are defined by an expert. This method has some disadvantages, especially in the case of multilingual speech recognition. A new data-driven method is proposed for the generation of phonetic broad classes based on a phoneme confusion matrix. The similarity measure is defined using the number of confusions between the master phoneme and all other phonemes included in the set. This proposed method is compared to the standard approach based on expert knowledge and to the randomly generated broad classes approach. The proposed data-driven method is implicitly evaluated within a speech recognition experiment. The intention of the first evaluation stage is to test the generated acoustic models in a monolingual environment (Slovenian), to show that the proposed method does not contain a multilingual influence. In the second evaluation stage, the generated acoustic models are tested in a multilingual environment (Slovenian, German and Spanish). All experiments were based on SpeechDat(II) speech databases. The proposed data-driven method for the generation of phonetic broad classes, based on phoneme confusion matrix, improved speech recognition results when compared to the method based on expert knowledge.
This paper proposes an approach, how to speed up the acoustic classification of bee swarm activity. The proposed system could be used as a daily monitoring solution for beehives, especially if they ...are located remotely. Recorded audio signal was used for acoustic classification with the Mel-frequency cepstral coefficients and hidden Markov acoustic models. The research objective was to analyze the influence of the reduced number of feature extraction coefficients on classification accuracy and real-time factor. Experiments were carried out with the Open Source Beehives Project audio recordings. The baseline system achieved 86,00% classification accuracy. The optimal acoustic classification system with 6 Mel-frequency cepstral coefficients achieved 85.38% accuracy and a 22.1% speed improvement over the baseline system.
This paper presents a proposed system for acoustic monitoring and classification of bee activity, based on audio modality. Beekeeping is one of the agriculture sectors, where ICT solutions provide ...great benefit for production and animal welfare. The presented system is a general classifier, which can classify normal and swarm bee activity. The captured audio signal is transformed into mel-frequency cepstral coefficients, which are used to train one state hidden Markov models. The activity classification is carried out in a form of recognition architecture, based on principles applied to speech modality. The acoustic models were trained in a two-stage approach using the open audio data provided by the Open Source Beehives Project. The highest classification accuracy of 80.89% was achieved with the proposed general classification approach.