•We developed an Arabic text-to-speech system, including a diacritization system.•The speech synthesis system is based on statistical parametric.•We address the accuracy of diacritic and acoustic ...models.•We proposed a diacritization system based on the position of the current letter.•Neural network per unit type based synthesis system generates high speech quality.
Text-to-speech synthesis system has been widely studied for many languages. However, speech synthesis for Arabic language has not sufficient progresses and it is still in its first stage. Statistical parametric synthesis based on hidden Markov models was the most commonly applied approach for Arabic language. Recently, synthesized speech quality based on deep neural networks was found as intelligible as human voice. This paper describes a Text-To-Speech (TTS) synthesis system for modern standard Arabic language based on statistical parametric approach and Mel-cepstral coefficients. Deep neural networks achieved state-of-the-art performance in a wide range of tasks, including speech synthesis. Our TTS system includes a diacritization system which is very important for Arabic TTS application. Our diacritization system is also based on deep neural networks. In addition to the use deep techniques, different methods were also proposed to model the acoustic parameters in order to address the problem of acoustic models accuracy. They are based on linguistic and acoustic characteristics (e.g. letter position based diacritization system, unit types based synthesis system, diacritic marks based synthesis system) and based on deep learning techniques (stacked generalization techniques). Experimental results show that our diacritization system can generate a diacritized text with high accuracy. As regards the speech synthesis system, the experimental results and subjective evaluation show that our proposed method for synthesis system can generate intelligible and natural speech.
Arabic text-to-speech synthesis from non-diacritized text is still a big challenge, because of unique Arabic language rules and characteristics. Indeed, the diacritic and gemination signs, which are ...special characters representing respectively short vowels and consonant doubling, have a major effect on accurate pronunciation of Arabic. However these signs are often not mentioned in written texts, since most of Arab readers are used to guess them from the context. To tackle this issue, this paper presents a grapheme-to-phoneme conversion system for Arabic, which constitutes the text processing module of a deep neural networks (DNN)-based Arabic TTS systems. In the case of Arabic text, this step starts with predicting the diacritic and gemination signs. In this work, this step was fully realized based on DNN. Finally, the grapheme-to-phoneme conversion of the diacritized text was achieved using the Buckwalter code. In comparison to state-of-the-art approaches, the proposed system gives a higher accuracy rate either for all phonemes or for each class, and high precision, recall and F1 score for each class of diacritic signs.
This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a ...variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the speech data in the training set. Each prosodic feature is discretized using unsupervised clustering in order to produce a sequence of prosodic labels for each utterance. This sequence is used in parallel to the phoneme sequence in order to condition the decoder with the utilization of a prosodic encoder and a corresponding attention module. Experimental results show that the proposed method retains the high quality of generated speech, while allowing phoneme-level control of F0 and duration. By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker.
We propose PeriodNet, a non-autoregressive (non-AR) waveform generation model with a new model structure for modeling periodic and aperiodic components in speech waveforms. The non-AR waveform ...generation models can generate speech waveforms parallelly and can be used as a speech vocoder by conditioning an acoustic feature. Since a speech waveform contains periodic and aperiodic components, both components should be appropriately modeled to generate a high-quality speech waveform. However, it is difficult to decompose the components from a natural speech waveform in advance. To address this issue, we propose a parallel model and a series model structure separating periodic and aperiodic components. The features of our proposed models are that explicit periodic and aperiodic signals are taken as input, and external periodic/aperiodic decomposition is not needed in training. Experiments using a singing voice corpus show that our proposed structure improves the naturalness of the generated waveform. We also show that the speech waveforms with a pitch outside of the training data range can be generated with more naturalness.
Multi-speaker spoken datasets enable the creation of text-to-speech synthesis (TTS) systems which can output several voice identities. The multi-speaker (MSPK) scenario also enables the use of fewer ...training samples per speaker. However, in the resulting acoustic model, not all speakers exhibit the same synthetic quality, and some of the voice identities cannot be used at all.
In this paper we evaluate the influence of the recording conditions, speaker gender, and speaker particularities over the quality of the synthesised output of a deep neural TTS architecture, namely Tacotron2. The evaluation is possible due to the use of a large Romanian parallel spoken corpus containing over 81 hours of data. Within this setup, we also evaluate the influence of different types of text representations: orthographic, phonetic, and phonetic extended with syllable boundaries and lexical stress markings.
We evaluate the results of the MSPK system using the objective measures of equal error rate (EER) and word error rate (WER), and also look into the distances between natural and synthesised t-SNE projections of the embeddings computed by an accurate speaker verification network. The results show that there is indeed a large correlation between the recording conditions and the speaker’s synthetic voice quality. The speaker gender does not influence the output, and that extending the input text representation with syllable boundaries and lexical stress information does not equally enhance the generated audio across all speaker identities. The visualisation of the t-SNE projections of the natural and synthesised speaker embeddings show that the acoustic model shifts some of the speakers’ neural representation, but not all of them. As a result, these speakers have lower performances of the output speech.
While text-to-speech synthesis with non-autoregressive Transformers has achieved state-of-the-art quality for many languages, the methodology of Estonian text-to-speech synthesis has not been revised ...for neural methods. This paper evaluates the quality of Estonian text-tospeech with Transformer-based models using different language-specific data processing steps. Additionally, we conduct a human evaluation to show how well these models can learn the patterns of Estonian pronunciation, given different amounts of training data and varying degrees of phonetic information. Our error analysis shows that using a simple multi-speaker approach can significantly decrease the number of pronunciation errors, while some information can also be helpful to a smaller extent.
We propose a physiologically based intonation model using perceptual relevance. Motivated by speech synthesis from a speech-to-speech translation (S2ST) point of view, we aim at a language ...independent way of modelling intonation. The model presented in this paper can be seen as a generalisation of the command response (CR) model, albeit with the same modelling power. It is an additive model which decomposes intonation contours into a sum of critically damped system impulse responses. To decompose the intonation contour, we use a weighted correlation based atom decomposition algorithm (WCAD) built around a matching pursuit framework. The algorithm allows for an arbitrary precision to be reached using an iterative procedure that adds more elementary atoms to the model. Experiments are presented demonstrating that this generalised CR (GCR) model is able to model intonation as would be expected. Experiments also show that the model produces a similar number of parameters or elements as the CR model. We conclude that the GCR model is appropriate as an engineering solution for modelling prosody, and hope that it is a contribution to a deeper scientific understanding of the neurobiological process of intonation.
It is very popular today to integrate voice interfaces into IoT devices. The pronunciation and proper prosody of speech play a major role in the intelligibility and naturalness of synthesized voices. ...Each language has its own prosodic characteristics. In this paper, we present the results of a study aimed at testing the applicability of methods for modelling and predicting the prosodic features of the Croatian language. The extent to which their performance can be improved by incorporating linguistic features and linguistic peculiarities specific to the Croatian language was investigated. In the model learning process, tree classification was used to predict the lexical stress position and the type of stress in a word, and a lexicon of 1,011,785 word forms was used as the model learning set. Separate models were created for predicting the position and type of lexical stress. The results improved significantly after the rules for atonic words (clitics) were applied. A hybrid approach combining a rule-based approach and a modelling approach was also proposed. The final accuracy of assigning lexical stress using the hybrid approach was 95.3%.
This work presents a lifelong learning approach to train a multilingual Text-To-Speech (TTS) system, where each language was seen as an individual task and was learned sequentially and continually. ...It does not require pooled data from all languages altogether, and thus alleviates the storage and computation burden. One of the challenges of lifelong learning methods is "catastrophic forgetting": in TTS scenario it means that model performance quickly degrades on previous languages when adapted to a new language. We approach this problem via a data-replay-based lifelong learning method. We formulate the replay process as a supervised learning problem, and propose a simple yet effective dual-sampler framework to tackle the heavily language-imbalanced training samples. Through objective and subjective evaluations, we show that this supervised learning formulation outperforms other gradient-based and regularization-based lifelong learning methods, achieving 43% Mel-Cepstral Distortion reduction compared to a fine-tuning baseline.
Comparing with traditional text-to-speech (TTS) systems, conversational TTS systems are required to synthesize speeches with proper speaking style confirming to the conversational context. However, ...state-of-the-art context modeling methods in conversational TTS only model the textual information in context with a recurrent neural network (RNN). Such methods have limited ability in modeling the inter-speaker influence in conversations, and also neglect the speaking styles and the intra-speaker inertia inside each speaker. Inspired by DialogueGCN and its superiority in modeling such conversational influences than RNN based approaches, we propose a graph-based multi-modal context modeling method and adopt it to conversational TTS to enhance the speaking styles of synthesized speeches. Both the textual and speaking style information in the context are extracted and processed by DialogueGCN to model the inter- and intra-speaker influence in conversations. The outputs of DialogueGCN are then summarized by attention mechanism, and converted to the enhanced speaking style for current utterance. An English conversation corpus is collected and annotated for our research and released to public. Experiment results on this corpus demonstrate the effectiveness of our proposed approach, which outperforms the state-of-the-art context modeling method in conversational TTS in both MOS and ABX preference rate.