Sequence-to-sequence models have shown success in end-to-end speech recognition. However these models have only used shallow acoustic encoder networks. In our work, we successively train very deep ...convolutional networks to add more expressive power and better generalization for end-to-end ASR models. We apply network-in-network principles, batch normalization, residual connections and convolutional LSTMs to build very deep recurrent and convolutional structures. Our models exploit the spectral structure in the feature space and add computational depth without overfitting issues. We experiment with the WSJ ASR task and achieve 10.5% word error rate without any dictionary or language model using a 15 layer deep network.
We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional ...speech recognizers. In LAS, the neural network architecture subsumes the acoustic, pronunciation and language models making it not only an end-to-end trained system but an end-to-end model. In contrast to DNN-HMM, CTC and most other models, LAS makes no independence assumptions about the probability distribution of the output character sequences given the acoustic sequence. Our system has two components: a listener and a speller. The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs. The speller is an attention-based recurrent network decoder that emits each character conditioned on all previous characters, and the entire acoustic sequence. On a Google voice search task, LAS achieves a WER of 14.1% without a dictionary or an external language model and 10.3% with language model rescoring over the top 32 beams. In comparison, the state-of-the-art CLDNN-HMM model achieves a WER of 8.0% on the same set.
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps ...character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the conditioning input to WaveNet instead of linguistic, duration, and F_{0} features. We further show that using this compact acoustic intermediate representation allows for a significant reduction in the size of the WaveNet architecture.
The presence of a massless spin-2 field in an effective field theory results in a t-hannel pole in the scattering amplitudes that precludes the application of standard positivity bounds. Despite ...this, recent arguments based on compactification to three dimensions have suggested that positivity bounds may be applied to the t-hannel pole subtracted amplitude. If correct, this would have deep implications for UV physics and the weak gravity conjecture. Within the context of a simple renormalizable field theory coupled to gravity we find that applying these arguments would constrain the low-energy coupling constants in a way which is incompatible with their actual values. This contradiction persists on deforming the theory. Further enforcing the t-channel pole subtracted positivity bounds on such generic renormalizable effective theories coupled to gravity would imply new physics at a scale parametrically smaller than expected, with far-reaching implications. This suggests that generically the standard positivity bounds are inapplicable with gravity, and we highlight a number of issues that impinge on the formulation of a three-dimensional amplitude which simultaneously satisfies the required properties of analyticity, positivity, and crossing symmetry. We conjecture instead a modified bound that ought to be satisfied independently of the precise details of the high energy completion.
We apply positivity bounds directly to a U(1) gauge theory with charged scalars and charged fermions, i.e., QED, minimally coupled to gravity. Assuming that the massless t-channel pole may be ...discarded, we show that the improved positivity bounds are violated unless new physics is introduced at the parametrically low scale Λnew∼(emMPl)1/2, consistent with similar results for scalar field theories, far lower than the scale implied by the weak gravity conjecture. This is sharply contrasted with previous treatments which focus on the application of positivity bounds to the low energy gravitational Euler-Heisenberg effective theory only. We emphasize that the low cutoff is a consequence of applying the positivity bounds under the assumption that the pole may be discarded. We conjecture an alternative resolution that a small amount of negativity, consistent with decoupling limits, is allowed and is not in conflict with standard UV completions, including weakly coupled ones.