We present progress towards bilingual Text-to-Speech which is able to transform a monolingual voice to speak a second language while preserving speaker voice quality. We demonstrate that a bilingual ...speaker embedding space contains a separate distribution for each language and that a simple transform in speaker space generated by the speaker embedding can be used to control the degree of accent of a synthetic voice in a language. The same transform can be applied even to monolingual speakers.In our experiments speaker data from an English-Spanish (Mexican) bilingual speaker was used, and the goal was to enable English speakers to speak Spanish and Spanish speakers to speak English. We found that the simple transform was sufficient to convert a voice from one language to the other with a high degree of naturalness. In one case the transformed voice outperformed a native language voice in listening tests. Experiments further indicated that the transform preserved many of the characteristics of the original voice. The degree of accent present can be controlled and naturalness is relatively consistent across a range of accent values.
The performance of text-to-speech (TTS) systems heavily depends on spectrogram to waveform generation, also known as the speech reconstruction phase. The time required for the same is known as ...synthesis delay. In this paper, an approach to reduce speech synthesis delay has been proposed. It aims to enhance the TTS systems for real-time applications such as digital assistants, mobile phones, embedded devices, etc. The proposed approach applies Fast Griffin Lim Algorithm (FGLA) instead Griffin Lim algorithm (GLA) as vocoder in the speech synthesis phase. GLA and FGLA are both iterative, but the convergence rate of FGLA is faster than GLA. The proposed approach is tested on LJSpeech, Blizzard and Tatoeba datasets and the results for FGLA are compared against GLA and neural Generative Adversarial Network (GAN) based vocoder. The performance is evaluated based on synthesis delay and speech quality. A 36.58% reduction in speech synthesis delay has been observed. The quality of the output speech has improved, which is advocated by higher Mean opinion scores (MOS) and faster convergence with FGLA as opposed to GLA.
Voice-driven devices (VDDs) like Google Home and Amazon Alexa, which are well-known connected devices in consumer IoT, have applications in various domains i.e., home appliances automation, ...next-generation vehicles, voice banking, and so on. However, these VDDs that are based on automatic speaker verification systems (ASVs) are vulnerable to voice based logical access (LA) attacks like Text-to-Speech (TTS) synthesis and converted voice signals. Intruders can exploit these attacks to bypass the security of such systems and gain access of victim's bank account or home control. Thus, there exists a need to develop an effective voice spoofing countermeasure that can reliably be used to protect these VDDs against such malicious attacks. This work presents a novel audio features descriptor named as extended local ternary pattern (ELTP) to capture the vocal tract dynamically induced attributes of bonafide speech and algorithmic artifacts in synthetic and converted speeches. We fused our novel ELTP features with the linear frequency cepstral coefficients (LFCC) to further strengthen the capability of our features for capturing the traits of bonafide and spoofed signals. We employ the proposed ELTP-LFCC features to train the deep bidirectional Long Short-Term Memory (DBiLSTM) network for classification of the bonafide and spoof signal (i.e., TTS synthesis, converted speech). Performance of our spoofing countermeasure is measured on the large-scale and diverse ASVspoof 2019 logical access dataset. Experimental results demonstrate that the proposed audio spoofing countermeasure can reliably be used to detect the LA spoofing attacks.
This study concentrates on the investigation, development, and evaluation of Text-to-Speech Synthesis systems based on Deep Learning models for the Azerbaijani Language. We have selected and compared ...state-of-the-art models-Tacotron and Deep Convolutional Text-to-Speech (DC TTS) systems to achieve the most optimal model. Both systems were trained on the 24 h speech dataset of the Azerbaijani language collected and processed from the news website. To analyze the quality and intelligibility of the speech signals produced by two systems, 34 listeners participated in an online survey containing subjective evaluation tests. The results of the study indicated that according to the Mean Opinion Score, Tacotron demonstrated better results for the In-Vocabulary words; however, DC TTS indicated a higher performance of the Out-Of-Vocabulary words synthesis.
We propose a method for obtaining disentangled speaker and language representations via mutual information minimization and domain adaptation for cross-lingual text-to-speech (TTS) synthesis. The ...proposed method extracts speaker and language embeddings from acoustic features by a speaker encoder and a language encoder. Then the proposed method applies domain adaptation on the two embeddings to obtain language-invariant speaker embedding and speaker-invariant language embedding. To get more disentangled representations, the proposed method further uses mutual information minimization between the two embeddings to remove entangled information within each embedding. Disentangled representations of speaker and language are critical for cross-lingual TTS synthesis since entangled representations make it difficult to maintain speaker identity information when changing the language representation and consequently causes performance degradation. We evaluate the proposed method using English and Japanese multi-speaker datasets with a total of 207 speakers. Experimental result demonstrates that the proposed method significantly improves the naturalness and speaker similarity of both intra-lingual and cross-lingual TTS synthesis. Furthermore, we show that the proposed method has a good capability of maintaining the speaker identity between languages.
This paper proposes novel training algorithms for vocoder-free statistical parametric speech synthesis (SPSS) using short-term Fourier transform (STFT) spectra. Recently, text-to-speech synthesis ...using STFT spectra has been investigated since it can avoid quality degradation caused by the vocoder-based parameterization in conventional SPSS using a vocoder. In conventional SPSS using a vocoder, we previously proposed a training algorithm for integrating generative adversarial network (GAN)-based distribution compensation. To extend the algorithm to vocoder-free SPSS, we propose low- and multi-resolution GAN-based training algorithms for vocoder-free SPSS. In our algorithm that uses the low-resolution GAN, acoustic models are trained to minimize the weighted sum of the mean squared error between natural and generated spectra in the original resolution and adversarial loss to deceive discriminative models in the lower resolution. Since the low-resolution spectra are close to filter banks and their distribution becomes simpler, GAN-based distribution compensation works well. Furthermore, we propose an algorithm using multi-resolution GANs, which uses both the low-resolution GAN and original-resolution GAN. Experimental results demonstrate that 1) the low-resolution GAN works robustly to the setting of its frequency resolution and hyperparameter, and 2) compared the low-, original-, and multi-resolution GANs, the low-resolution GAN works the best to improve synthetic speech quality.
This paper aims to design and validate a phonetically balanced speech corpus for Arabic language. Designing and developing a rich and phonetically balanced corpus in optimal context is one of the key ...issues in building high quality of text-to-speech synthesis systems. The rich characteristic is in the sense that it must contain all the possible phonemes on the right and left context, whereas the balanced characteristic is in the sense that it respects the phonetic distribution in the language. We propose a new methodology for designing and implementing such corpus for speech synthesis purposes. The paper explains the whole creation process of this corpus, beginning with the design stage, corpus creation, recording phases, and finally the segmentation of the speech corpus. The speech corpus contains 202 sentences with 6174 phonemes. In order to validate the speech corpus, an Arabic speech synthesis system using Hidden Markov Models has been developed.