This letter presents an incremental text-to-speech (TTS) method that performs synthesis in small linguistic units while maintaining the naturalness of output speech. Incremental TTS is generally ...subject to a trade-off between latency and synthetic speech quality. It is challenging to produce high-quality speech with a low-latency setup that does not make much use of an unobserved future sentence (hereafter, "lookahead"). To resolve this issue, we propose an incremental TTS method that uses a pseudo lookahead generated with a language model to take the future contextual information into account without increasing latency. Our method can be regarded as imitating a human's incremental reading and uses pretrained GPT2, which accounts for the large-scale linguistic knowledge, for the lookahead generation. Evaluation results show that our method 1) achieves higher speech quality than the method taking only observed information into account and 2) achieves a speech quality equivalent to waiting for the future context observation.
To leverage crowd-sourced data to train multi-speaker text-to-speech (TTS) models that can synthesize clean speech for all speakers, it is essential to learn disentangled representations which can ...independently control the speaker identity and background noise in generated signals. However, learning such representations can be challenging, due to the lack of labels describing the recording conditions of each training example, and the fact that speakers and recording conditions are often correlated, e.g. since users often make many recordings using the same equipment. This paper proposes three components to address this problem by: (1) formulating a conditional generative model with factorized latent variables, (2) using data augmentation to add noise that is not correlated with speaker identity and whose label is known during training, and (3) using adversarial factorization to improve disentanglement. Experimental results demonstrate that the proposed method can disentangle speaker and noise attributes even if they are correlated in the training data, and can be used to consistently synthesize clean speech for all speakers. Ablation studies verify the importance of each proposed component.
This paper gives a general overview of hidden Markov model (HMM)-based speech synthesis, which has recently been demonstrated to be very effective in synthesizing speech. The main advantage of this ...approach is its flexibility in changing speaker identities, emotions, and speaking styles. This paper also discusses the relation between the HMM-based approach and the more conventional unit-selection approach that has dominated over the last decades. Finally, advanced techniques for future developments are described.
Mellotron is a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data. By explicitly conditioning on rhythm and ...continuous pitch contours from an audio signal or music score, Mellotron is able to generate speech in a variety of styles ranging from read speech to expressive speech, from slow drawls to rap and from monotonous voice to singing voice. Unlike other methods, we train Mellotron using only read speech data without alignments between text and audio. We evaluate our models using the LJSpeech and LibriTTS datasets. We provide F0 Frame Errors and synthesized samples that include style transfer from other speakers, singers and styles not seen during training, procedural manipulation of rhythm and pitch and choir synthesis.
•We compare sequence-to-sequence text-to-speech synthesis systems with pipeline text-to-speech synthesis systems.•We investigate sequence-to-sequence text-to-speech from three aspects: a) model ...architecture, b) model parameter size, and c) language (Japanese/English).•We describe sequence-to-sequence and pipeline text-to-speech synthesis in detail, and configure fairer experimental conditions.•Our Japanese experiment shows that increasing model parameter size enables to compensate for a lack of rich linguistic features.•Our English experiment shows that a suitable encoder structure enables to disambiguate pronunciations.
Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce high-quality speech directly from text or simple linguistic features such as phonemes. Unlike traditional pipeline TTS, the neural sequence-to-sequence TTS does not require manually annotated and complicated linguistic features such as part-of-speech tags and syntactic structures for system training. However, it must be carefully designed and well optimized so that it can implicitly extract useful linguistic features from the input features.
In this paper we investigate under what conditions the neural sequence-to-sequence TTS can work well in Japanese and English along with comparisons with deep neural network (DNN) based pipeline TTS systems. Unlike past comparative studies, the pipeline systems also use neural autoregressive (AR) probabilistic modeling and a neural vocoder in the same way as the sequence-to-sequence systems do for a fair and deep analysis in this paper. We investigated systems from three aspects: a) model architecture, b) model parameter size, and c) language. For the model architecture aspect, we adopt modified Tacotron systems that we previously proposed and their variants using an encoder from Tacotron or Tacotron2. For the model parameter size aspect, we investigate two model parameter sizes. For the language aspect, we conduct listening tests in both Japanese and English to see if our findings can be generalized across languages.
Our experiments on Japanese demonstrated that the Tacotron TTS systems with increased parameter size and input of phonemes and accentual type labels outperformed the DNN-based pipeline systems using the complicated linguistic features and that its encoder could learn to compensate for a lack of rich linguistic features. Our experiments on English demonstrated that, when using a suitable encoder, the Tacotron TTS system with characters as input can disambiguate pronunciations and produce natural speech as good as those of the systems using phonemes. However, we also found that the encoder could not learn English stressed syllables from characters perfectly and hence resulted in flatter fundamental frequency. In summary, these experimental results suggest that a) a neural sequence-to-sequence TTS system should have a sufficient number of model parameters to produce high quality speech, b) it should also use a powerful encoder when it takes characters as inputs, and c) the encoder still has a room for improvement and needs to have an improved architecture to learn supra-segmental features more appropriately.
In this paper, we present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to ...discretize phoneme-level F0 and duration features from a multispeaker speech dataset. These features are fed as an input sequence of prosodic labels to a prosody encoder module which augments an autoregressive attention-based text-to-speech model. We utilize various methods in order to improve prosodic control range and coverage, such as augmentation, F0 normalization, balanced clustering for duration and speaker-independent clustering. The final model enables fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. Instead of relying on reference utterances for inference, we introduce a prior prosody encoder which learns the style of each speaker and enables speech synthesis without the requirement of reference audio. We also fine-tune the multispeaker model to unseen speakers with limited amounts of data, as a realistic application scenario and show that the prosody control capabilities are maintained, verifying that the speaker-independent prosodic clustering is effective. Experimental results show that the model has high output speech quality and that the proposed method allows efficient prosody control within each speaker’s range despite the variability that a multispeaker setting introduces.
•Prosodic clustering for fine-grained phoneme-level prosody control.•Controllable end-to-end text-to-speech synthesis using intuitive discrete labels.•Multispeaker prosody control, with application to unseen speaker adaptation.
A method for statistical parametric speech synthesis incorporating generative adversarial networks GANs is proposed. Although powerful deep neural networks techniques can be applied to artificially ...synthesize speech waveform, the synthetic speech quality is low compared with that of natural speech. One of the issues causing the quality degradation is an oversmoothing effect often observed in the generated speech parameters. A GAN introduced in this paper consists of two neural networks: a discriminator to distinguish natural and generated samples, and a generator to deceive the discriminator. In the proposed framework incorporating the GANs, the discriminator is trained to distinguish natural and generated speech parameters, while the acoustic models are trained to minimize the weighted sum of the conventional minimum generation loss and an adversarial loss for deceiving the discriminator. Since the objective of the GANs is to minimize the divergence i.e., distribution difference between the natural and generated speech parameters, the proposed method effectively alleviates the oversmoothing effect on the generated speech parameters. We evaluated the effectiveness for text-to-speech and voice conversion, and found that the proposed method can generate more natural spectral parameters and F0 than conventional minimum generation error training algorithm regardless of its hyperparameter settings. Furthermore, we investigated the effect of the divergence of various GANs, and found that a Wasserstein GAN minimizing the Earth-Mover's distance works the best in terms of improving the synthetic speech quality.
Non-autoregressive architecture for neural text-to-speech (TTS) allows for parallel implementation, thus reduces inference time over its autoregressive counterpart. However, such system architecture ...does not explicitly model temporal dependency of acoustic signal as it generates individual acoustic frames independently. The lack of temporal modeling often adversely impacts speech continuity, thus voice quality. In this paper, we propose a novel neural TTS model that is denoted as FastTalker. We study two strategies for high-quality speech synthesis at low computational cost. First, we add a shallow autoregressive acoustic decoder on top of the non-autoregressive context decoder to retrieve the temporal information of the acoustic signal. Second, we further implement group autoregression to accelerate the inference of the autoregressive acoustic decoder. The group-based autoregression acoustic decoder generates acoustic features as a sequence of groups instead of frames, each group having multiple consecutive frames. Within a group, the acoustic features are generated in parallel. With the shallow and group autoregression, FastTalker retrieves the temporal information of the acoustic signal, while keeping the fast-decoding property. The proposed FastTalker achieves a good balance between speech quality and inference speed. Experiments show that, in terms of voice quality and naturalness, FastTalker outperforms the non-autoregressive FastSpeech baseline significantly, and is on par with the autoregressive baselines. It also shows a considerable inference speedup over Tacotron2 and Transformer TTS.
In recent years, the quality of text-to-speech (TTS) synthesis vastly improved due to deep-learning techniques, with parallel architectures, in particular, providing excellent synthesis quality at ...fast inference. Training these models usually requires speech recordings, corresponding phoneme-level transcripts, and the temporal alignment of each phoneme to the utterances. Since manually creating such fine-grained alignments requires expert knowledge and is time-consuming, it is common practice to estimate them using automatic speech-phoneme alignment methods. In the literature, either the estimation methods' accuracy or their impact on the TTS system's synthesis quality is evaluated. In this study, we perform experiments with five state-of-the-art speech-phoneme aligners and evaluate their output with objective and subjective measures. As our main result, we show that small alignment errors (below 75 ms error) do not decrease the synthesis quality, which implies that the alignment error may not be the crucial factor when choosing an aligner for TTS training.