In this paper, we present a novel architecture to realize fine-grained style control on the transformer-based text-to-speech synthesis (TransformerTTS). Specifically, we model the speaking style by ...extracting a time sequence of local style tokens (LST) from the reference speech. The existing content encoder in TransformerTTS is then replaced by our designed cross-attention blocks for fusion and alignment between content and style. As the fusion is performed along with the skip connection, our cross-attention block provides a good inductive bias to gradually infuse the phoneme representation with a given style. Additionally, we prevent the style embedding from encoding linguistic content by randomly truncating LST during training and using wav2vec 2.0 features. Experiments show that with fine-grained style control, our system performs better in terms of naturalness, intelligibility, and style transferability. Our code and samples are publicly available. 1
•Efficient compression of huge corpus-based TTS unit selection acoustic space.•A novel look-up of acoustic units' concatenation costs as a seq-2-seq problem.•Efficient compression of concatenation ...costs by using a LSTM model.•Reduction of memory footprint by over 90%, the look-up time reduced by over 70%.
Large acoustic inventories must be used to produce speech close to natural quality. However, the concatenation cost space grows exponentially with the number of acoustic units in the acoustic inventory, increasing the latency of the unit selection algorithm, making algorithms unusable in real-time end-to-end systems. Even when data compression techniques are introduced, the model size is still high, representing a challenge for end-to-end systems. Thus, in this paper, we propose representing the concatenation cost space using LSTM (Long Short-Term Memory). The results show a 90% reduction in the size of the data space compared to all our previous techniques, and by an over 70% decrease in the look-up time. The proposed LSTM-based compression increases the responsiveness of the corpus-based text-to-speech systems significantly while keeping the overall speech quality at the same level.
Display omitted
This paper proposes deep Gaussian process (DGP)-based frameworks for multi-speaker speech synthesis and speaker representation learning. A DGP has a deep architecture of Bayesian kernel regression, ...and it has been reported that DGP-based single speaker speech synthesis outperforms deep neural network (DNN)-based ones in the framework of statistical parametric speech synthesis. By extending this method to multiple speakers, it is expected that higher speech quality can be achieved with a smaller number of training utterances from each speaker. To apply DGPs to multi-speaker speech synthesis, we propose two methods: one using DGP with one-hot speaker codes, and the other using a deep Gaussian process latent variable model (DGPLVM). The DGP with one-hot speaker codes uses additional GP layers to transform speaker codes into latent speaker representations. The DGPLVM directly models the distribution of latent speaker representations and learns it jointly with acoustic model parameters. In this method, acoustic speaker similarity is expressed in terms of the similarity of the speaker representations, and thus, the voices of similar speakers are efficiently modeled. We experimentally evaluated the performance of the proposed methods in comparison with those of conventional DNN and variational autoencoder (VAE)-based frameworks, in terms of acoustic feature distortion and subjective speech quality. The experimental results demonstrate that (1) the proposed DGP-based and DGPLVM-based methods improve subjective speech quality compared with a feed-forward DNN-based method, and (2) even when the amount of training data for target speakers is limited, the DGPLVM-based method outperforms other methods, including the VAE-based one. Additionally, (3) by using a speaker representation randomly sampled from the learned speaker space, the DGPLVM-based method can generate voices of non-existent speakers.
•Deep Gaussian processes are effective in multi-speaker text-to-speech synthesis.•A Deep Gaussian process with one-hot speaker codes outperforms a deep neural network.•Learning latent speaker representations improves speech quality with scarce data.•Learned speaker space can be utilized to generate voices of non-existent speakers.
We present a multi-speaker Japanese audiobook text-to-speech (TTS) system that leverages multimodal context information of preceding acoustic context and bilateral textual context to improve the ...prosody of synthetic speech. Previous work either uses unilateral or single-modality context, which does not fully represent the context information. The proposed method uses an acoustic context encoder and a textual context encoder to aggregate context information and feeds it to the TTS model, which enables the model to predict context-dependent prosody. We conducted comprehensive objective and subjective evaluations on a multi-speaker Japanese audiobook dataset. Experimental results demonstrate that the proposed method significantly outperforms two previous works. Additionally, we present insights about the different choices of context - modalities, lateral information and length - for audiobook TTS that have never been discussed in the literature before.
We propose a speech analysis-synthesis and deep neural network (DNN)-based text-to-speech (TTS) synthesis framework using Gaussian mixture model (GMM)-based approximation of full-band spectral ...envelopes. GMMs have excellent properties as acoustic features in statistic parametric speech synthesis. Each Gaussian function of a GMM fits the local resonance of the spectrum. The GMM retains the fine spectral envelope and achieve high controllability of the structure. However, since conventional speech analysis methods (i.e., GMM parameter estimation) have been formulated for a narrow-band speech, they degrade the quality of synthetic speech. Moreover, a DNN-based TTS synthesis method using GMM-based approximation has not been formulated in spite of its excellent expressive ability. Therefore, we employ peak-picking-based initialization for full-band speech analysis to provide better initialization for iterative estimation of the GMM parameters. We introduce not only prediction error of GMM parameters but also reconstruction error of the spectral envelopes as objective criteria for training DNN. Furthermore, we propose a method for multi-task learning based on minimizing these errors simultaneously. We also propose a post-filter based on variance scaling of the GMM for our framework to enhance synthetic speech. Experimental results from evaluating our framework indicated that 1) the initialization method of our framework outperformed the conventional one in the quality of analysis-synthesized speech; 2) introducing the reconstruction error in DNN training significantly improved the synthetic speech; 3) our variance-scaling-based post-filter further improved the synthetic speech.
This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports ...tens of languages, which are a small fraction of the thousands of languages in the world. One difficulty to scale multilingual TTS to hundreds of languages is collecting high-quality speech-text paired data in low-resource languages. This study extends Maestro, a speech-text joint pretraining framework for automatic speech recognition (ASR), to speech generation tasks. To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (paired TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets. Experimental evaluation shows that 1) multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages, and 2) they can synthesize reasonably intelligible and naturally sounding speech for unseen languages where no high-quality paired TTS data is available.
With the similarity between music and speech synthesis from symbolic input and the rapid development of text-to-speech (TTS) techniques, it is worthwhile to explore ways to improve the MIDI-to-audio ...performance by borrowing from TTS techniques. In this study, we analyze the shortcomings of a TTS-based MIDI-to-audio system and improve it in terms of feature computation, model selection, and training strategy, aiming to synthesize highly natural-sounding audio. Moreover, we conducted an extensive model evaluation through listening tests, pitch measurement, and spectrogram analysis. This work demonstrates not only synthesis of highly natural music but offers a thorough analytical approach and useful outcomes for the community. Our code, pre-trained models, supplementary materials, and audio samples are open sourced at https://github.com/nii-yamagishilab/midi-to-audio.
•We propose unsupervised text-to-speech synthesis using subword tokenization and prosodic-context extraction.•The subword tokenization can determine language units suitable for prosody ...generation.•The context extraction can retrieve contexts from pairs of subwords and prosody.•Experimental evaluation demonstrates that the proposed methods outperform conventional methods in terms of synthetic speech quality.
This paper presents text tokenization and context extraction without using language knowledge for text-to-speech (TTS) synthesis. To generate prosody, statistical parametric TTS synthesis typically requires the professional knowledge of the target language. Therefore, languages suitable for TTS synthesis are limited to only rich-resource languages. To achieve TTS synthesis without using language knowledge, we propose acoustic model-based subword tokenization and unsupervised extraction of prosodic contexts. The subword tokenization can determine language units suitable for prosody generation. The context extraction can retrieve contexts from pairs of subwords and prosody. The proposed methods function without language knowledge and can improve F0 prediction accuracy. Experimental evaluation demonstrates that 1) the training of proposed subword tokenization, which uses the expectation-maximization algorithm and deep neural networks, is empirically stable, 2) the proposed subword tokenization tokenizes text into subwords that are close to language-specific units, and 3) the proposed methods outperform the conventional methods using language model-based tokenization in terms of synthetic speech quality.
Automatic dubbing, which generates a corresponding version of the input speech in another language, can be widely utilized in many real-world scenarios, such as video and game localization. In ...addition to synthesizing the translated scripts, automatic dubbing further transfers the speaking style in the original language to the dubbed speeches to give audiences the impression that the characters are speaking in their native tongue. However, state-of-the-art automatic dubbing systems only model the transfer on the duration and speaking rate, disregarding the other aspects of speaking style, such as emotion, intonation and emphasis, which are also crucial to fully understand the characters and speech. In this paper, we propose a joint multiscale cross-lingual speaking style transfer framework to simultaneously model the bidirectional speaking style transfer between two languages at both the global scale (i.e., utterance level) and local scale (i.e., word level). The global and local speaking styles in each language are extracted and utilized to predict the global and local speaking styles in the other language with an encoder-decoder framework for each direction and a shared bidirectional attention mechanism for both directions. A multiscale speaking style-enhanced FastSpeech 2 is then utilized to synthesize the desired speech with the predicted global and local speaking styles for each language. The experimental results demonstrate the effectiveness of our proposed framework, which outperforms a baseline with only duration transfer in objective and subjective evaluations.
Text-to-speech synthesis (TTS) is a task to convert texts into speech. Two of the factors that have been driving TTS are the advancements of probabilistic models and latent representation learning. ...We propose a TTS method based on latent variable conversion using a diffusion probabilistic model and the variational autoencoder (VAE). In our TTS method, we use a waveform model based on VAE, a diffusion model that predicts the distribution of latent variables in the waveform model from texts, and an alignment model that learns alignments between the text and speech latent sequences. Our method integrates diffusion with VAE by modeling both mean and variance parameters with diffusion, where the target distribution is determined by approximation from VAE. This latent variable conversion framework potentially enables us to flexibly incorporate various latent feature extractors. Our experiments show that our method is robust to linguistic labels with ambiguous orthography and alignment errors.