Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models. Such MOS prediction models include MOSNet and LDNet that use spectral features as ...input, and SSL-MOS that relies on a pretrained selfsupervised learning model that directly uses the speech signal as input. In modern high-quality neural TTS systems, prosodic appropriateness with regard to the spoken content is a decisive factor for speech naturalness. For this reason, we propose to include prosodic and linguistic features as additional inputs in MOS prediction systems, and evaluate their impact on the prediction outcome. We consider phoneme-level F0 and duration features as prosodic inputs, as well as Tacotron encoder outputs, POS tags and BERT embeddings as higher-level linguistic inputs. All MOS prediction systems are trained on SOMOS, a neural TTS-only dataset with crowdsourced naturalness MOS evaluations. Results show that the proposed additional features are beneficial in the MOS prediction task, by improving the predicted MOS scores' correlation with the ground truths, both at utterance-level and system-level predictions.
In this paper, we present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to ...discretize phoneme-level F0 and duration features from a multispeaker speech dataset. These features are fed as an input sequence of prosodic labels to a prosody encoder module which augments an autoregressive attention-based text-to-speech model. We utilize various methods in order to improve prosodic control range and coverage, such as augmentation, F0 normalization, balanced clustering for duration and speaker-independent clustering. The final model enables fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. Instead of relying on reference utterances for inference, we introduce a prior prosody encoder which learns the style of each speaker and enables speech synthesis without the requirement of reference audio. We also fine-tune the multispeaker model to unseen speakers with limited amounts of data, as a realistic application scenario and show that the prosody control capabilities are maintained, verifying that the speaker-independent prosodic clustering is effective. Experimental results show that the model has high output speech quality and that the proposed method allows efficient prosody control within each speaker’s range despite the variability that a multispeaker setting introduces.
•Prosodic clustering for fine-grained phoneme-level prosody control.•Controllable end-to-end text-to-speech synthesis using intuitive discrete labels.•Multispeaker prosody control, with application to unseen speaker adaptation.
We present a new online psycholinguistic resource for Greek based on analyses of written corpora combined with text processing technologies developed at the Institute for Language & Speech Processing ...(ILSP), Greece. The "ILSP PsychoLinguistic Resource" (IPLR) is a freely accessible service via a dedicated web page, at http://speech.ilsp.gr/iplr. IPLR provides analyses of user-submitted letter strings (words and nonwords) as well as frequency tables for important units and conditions such as syllables, bigrams, and neighbors, calculated over two word lists based on printed text corpora and their phonetic transcription. Online tools allow retrieval of words matching user-specified orthographic or phonetic patterns. All results and processing code (in the Python programming language) are freely available for noncommercial educational or research use.
This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a ...variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the speech data in the training set. Each prosodic feature is discretized using unsupervised clustering in order to produce a sequence of prosodic labels for each utterance. This sequence is used in parallel to the phoneme sequence in order to condition the decoder with the utilization of a prosodic encoder and a corresponding attention module. Experimental results show that the proposed method retains the high quality of generated speech, while allowing phoneme-level control of F0 and duration. By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker.
Currently, unit-selection text-to-speech technology is the common approach for near-natural speech synthesis systems. Such systems provide an important aid for blind or partially-sighted people, when ...combined with screen reading software. However, although the overall quality of the synthetic speech achieved by such systems can be quite high, this fact alone does not guarantee a high level of user satisfaction. Many issues have to be coped with in order to fulfill users' expectations when integrating such systems with screen reading tools aiming to assist blind users. This work describes the design and the implementation approaches for the efficient integration of this technology into screen reading environments. In particular, the issues of natural language processing, speed optimization, multilingual design and overall quality optimization are mainly addressed in this paper. In order to evaluate the resulting system, we carried out subjective assessment tests where expert users provided feedback about performance, quality and overall experience.
Emotion-aware computing presents one of the key challenges in contemporary natural human interaction research in which emotional speech is an essential modality in multimodal user interfaces. Speech ...modality relates mainly to speech emotion and affect recognition as well as near natural expressive speech synthesis, the latter being considered as one of the next significant milestones in speech synthesis technology. A common problem to recognizing as well as to generating affective and emotional speech content is the adopted methodology on emotion analysis and modeling. This work proposes a generalized framework for annotating, analyzing and modeling expressive speech in a data-driven machine learning approach, towards building expressive text to speech synthesis systems. To this end, the framework as well as the data driven methodology is described, comprised of the techniques and approaches for acoustic analysis and expression clustering. In addition, the deployment of online experimental tools for speech perception and annotation and the description of the utilized speech data together with initial experimental results are also given, depicting the potential of the proposed framework and providing encouraging indications for further research.
This letter introduces one-class classification as a framework for the spectral join cost calculation in unit selection speech synthesis. Instead of quantifying the spectral cost by a single distance ...measure, a data-driven approach is adopted which exploits the natural similarity of consecutive speech frames in the speech database. A pair of consecutive frames is jointly represented as a vector of spectral distance measures which provide training data for the one-class classifier. At synthesis runtime, speech units are selected based on the scores derived from the classifier. Experimental results provide evidence on the effectiveness of the proposed method which clearly outperforms the conventional approaches currently employed.
Nowadays, unit selection based text-to-speech technology is the mainstream approach for near natural speech synthesis systems. However, this is achieved at the expense of raised requirements in terms ...of computational resources. This work describes design and implementation approaches for the efficient integration of this technology in computational environments with limited resources, such as mobile devices, with no considerable speech quality degradation. In particular, the issues of database reduction, acoustic inventory compression and runtime computational load minimization are mainly addressed in this paper. Both objective and subjective assessments confirm the effectiveness of these approaches in terms of constructing a general purpose embedded unit selection TTS system and reducing the computational requirements while maintaining high speech quality.