This paper addresses the topic of unsupervised speaker segmentation for automatic speech recognition in a complex real life environment like broadcast news domain. A statistical approach where a ...universal background model (UBM) is applied for online speaker segmentation was compared with the widely used Bayesian information criterion (BIC) approach. An analysis of influence of different window selection strategies on performance of both methods was carried out. Experiments and test evaluation were performed on the Slovenian BNSI Broadcast News speech database.
This paper analyzes how emotional speech influ-ences the quality assessment process. The objective metrics for speech quality assessment are usually evaluated with neutral speech recordings, but ...human interaction frequently involves emotions. Those can modify speech signals' acoustic and phonetic characteristics largely, which are essential from the objective speech quality assessment perspective. The proposed analysis was carried out with the English and Slovenian Interface emotional speech database, which includes two neutral and six emotional speaking styles. Speex codec was used to simulate the degradations, which were then assessed with the lTV - T P.862 PESQ objective metric. The results' analysis showed that in both languages surprise, in English fear, and in Slovenian disgust influenc the speech quality assessment significantly.
The paper presents a concept of multi-modal evaluation of quality for measuring high-definition multimedia degraded from a network or other impairments. Sophisticated audio and video metric ...algorithms enhanced with a structure detector are implemented. The detector reveals the Regions Of Interest known to be importantly and impacting the overall quality score in case of degradation. This evaluator is a useful tool in investigating the interaction between the different modal streams perceived by the end-user. PUBLICATION ABSTRACT
The paper presents a concept of multi-modal evaluation of quality for measuring high-definition multimedia degraded from a network or other impairments. Sophisticated audio and video metric ...algorithms enhanced with a structure detector are implemented. The detector reveals the Regions Of Interest known to be importantly and impacting the overall quality score in case of degradation. This evaluator is a useful tool in investigating the interaction between the different modal streams perceived by the end-user.
Emotions are an important part of human communication, but they can present harsh conditions for an automatic continuous speech recognition system. This paper presents an analysis of to which level ...the emotional speech degrades speech recognition accuracy, when dealing with a highly inflected Slovenian language. Namely, the language characteristics are those that also influence the speech recognition performance, and inflection is one of the most challenging ones. Moreover, Slovenian belongs to the group of under-resourced languages, like other Slavic languages. The speech recognition system was developed with the Slovenian BNSI Broadcast News speech database. The Interface speech database was used for the experiments with the emotional speech. The analysis was carried out with HMM and DNN acoustic models, combined with a 3-gram statistical language model. The results show that emotional speech degrades speech recognition accuracy in the range between 5% and 7% absolutely.
The paper addresses the problem of multilingual acoustic modelling for the design of multilingual speech recognisers. Two different approaches for the definition of multilingual set of triphones ...(bottom-up and a top-down) are investigated. A new clustering algorithm for the definition of multilingual set of triphones is proposed. The agglomerative clustering algorithm (bottom-up) is based on a definition of a distance measure for triphones defined as a weighted sum of explicit estimates of the context similarity on a monophone level. The monophone similarity estimation method is based on the algorithm of Houtgast. The second type of system uses tree-based clustering (top-down) with a common decision tree. The experiments were based on the SpeechDat II databases (Slovenian, Spanish and German 1000 FDB SpeechDat II). Experiments have shown that the use of the agglomerative clustering algorithm results in a significant reduction of the number of triphones with minor degradation of word accuracy.
Over the last decade or so, significant research has focused on defining Quality of Experience (QoE) of Multimedia Systems and identifying the key factors that collectively determine it. Some ...consensus thus exists as to the role of System Factors, Human Factors and Context Factors. In this paper, the notion of context is broadened to include information gleaned from simultaneous out-of-band channels, such as social network trend analytics, that can be used if interpreted in a timely manner, to help further optimise QoE. A case study involving simulation of HTTP adaptive streaming (HAS) and load balancing in a content distribution network (CDN) in a flash crowd scenario is presented with encouraging results.
The report illustrates the state of the art of the most successful AAL applications and functions based on audio and video data, namely (i) lifelogging and self-monitoring, (ii) remote monitoring of ...vital signs, (iii) emotional state recognition, (iv) food intake monitoring, activity and behaviour recognition, (v) activity and personal assistance, (vi) gesture recognition, (vii) fall detection and prevention, (viii) mobility assessment and frailty recognition, and (ix) cognitive and motor rehabilitation. For these application scenarios, the report illustrates the state of play in terms of scientific advances, available products and research project. The open challenges are also highlighted.