Deep learning has been widely applied and brought breakthroughs in speech recognition, computer vision, and many other domains. Deep neural network architectures and computational issues have been ...well studied in machine learning. But there lacks a theoretical foundation for understanding the approximation or generalization ability of deep learning methods generated by the network architectures such as deep convolutional neural networks. Here we show that a deep convolutional neural network (CNN) is universal, meaning that it can be used to approximate any continuous function to an arbitrary accuracy when the depth of the neural network is large enough. This answers an open question in learning theory. Our quantitative estimate, given tightly in terms of the number of free parameters to be computed, verifies the efficiency of deep CNNs in dealing with large dimensional data. Our study also demonstrates the role of convolutions in deep CNNs.
Establishing a solid theoretical foundation for structured deep neural networks is greatly desired due to the successful applications of deep learning in various practical domains. This paper aims at ...an approximation theory of deep convolutional neural networks whose structures are induced by convolutions. To overcome the difficulty in theoretical analysis of the networks with linearly increasing widths arising from convolutions, we introduce a downsampling operator to reduce the widths. We prove that the downsampled deep convolutional neural networks can be used to approximate ridge functions nicely, which hints some advantages of these structured networks in terms of approximation or modeling. We also prove that the output of any multi-layer fully-connected neural network can be realized by that of a downsampled deep convolutional neural network with free parameters of the same order, which shows that in general, the approximation ability of deep convolutional neural networks is at least as good as that of fully-connected networks. Finally, a theorem for approximating functions on Riemannian manifolds is presented, which demonstrates that deep convolutional neural networks can be used to learn manifold features of data.
In this paper, we study data-dependent generalization error bounds that exhibit a mild dependency on the number of classes, making them suitable for multi-class learning with a large number of label ...classes. The bounds generally hold for empirical multi-class risk minimization algorithms using an arbitrary norm as the regularizer. Key to our analysis is new structural results for multi-class Gaussian complexities and empirical ℓ ∞ -norm covering numbers, which exploit the Lipschitz continuity of the loss function with respect to the ℓ 2 - and ℓ ∞ -norm, respectively. We establish data-dependent error bounds in terms of the complexities of a linear function class defined on a finite set induced by training examples, for which we show tight lower and upper bounds. We apply the results to several prominent multi-class learning machines and show a tighter dependency on the number of classes than the state of the art. For instance, for the multi-class support vector machine of Crammer and Singer (2002), we obtain a data-dependent bound with a logarithmic dependency, which is a significant improvement of the previous square-root dependency. The experimental results are reported to verify the effectiveness of our theoretical findings.
Deep learning based on deep convolutional neural networks (CNNs) is extremely efficient in solving classification problems in speech recognition, computer vision, and many other fields. But there is ...no enough theoretical understanding about this topic, especially the generalization ability of the induced CNN algorithms. In this article, we develop some generalization analysis of a deep CNN algorithm for binary classification with data on spheres. An essential property of the classification problem is the lack of continuity or high smoothness of the target function associated with a convex loss function such as the hinge loss. This motivates us to consider the approximation of functions in the <inline-formula> <tex-math notation="LaTeX">L_{p} </tex-math></inline-formula> space with <inline-formula> <tex-math notation="LaTeX">1\leq p \leq \infty </tex-math></inline-formula>. We provide rates of <inline-formula> <tex-math notation="LaTeX">L_{p} </tex-math></inline-formula>-approximation when the approximated function lies in a Sobolev space and then present generalization bounds and learning rates for the excess misclassification error of the deep CNN classification algorithm. Our novel analysis is based on efficient cubature formulae on spheres and other tools from spherical analysis and approximation theory.
High piezoelectricity of (K,Na)NbO3 (KNN) lead‐free materials benefits from a polymorphic phase transition (PPT) around room temperature, but its temperature sensitivity has been a bottleneck ...impeding their applications. It is found that good thermal stability can be achieved in CaZrO3‐modified KNN lead‐free piezoceramics, in which the normalized strain d
33* almost keeps constant from room temperature up to 140 °C. In situ synchrotron X‐ray diffraction experiments combined with permitivity measurements disclose the occurrence of a new phase transformation under an electrical field, which extends the transition range between tetragonal and orthorhombic phases. It is revealed that such an electrically enhanced diffused PPT contributed to the boosted thermal stability of KNN‐based lead‐free piezoceramics with high piezoelectricity. The present approach based on phase engineering should also be effective in endowing other lead‐free piezoelectrics with high piezoelectricity and good temperature stability.
A material concept of electrically enhanced diffused polymorphic phase transition (EED‐PPT) is developed to resolve the long‐standing issue of temperature‐sensitivity in lead‐free (K,Na)NbO3 piezoelectrics. Experimental and theoretical studies reveal that EED‐PPT can remarkbaly boost the temperature stability of (K,Na)NbO3, where the normalized strain d33* almost keeps constant from room temperature up to 140 °C.
We consider a family of deep neural networks consisting of two groups of convolutional layers, a downsampling operator, and a fully connected layer. The network structure depends on two structural ...parameters which determine the numbers of convolutional layers and the width of the fully connected layer. We establish an approximation theory with explicit approximation rates when the approximated function takes a composite form f∘Q with a feature polynomial Q and a univariate function f. In particular, we prove that such a network can outperform fully connected shallow networks in approximating radial functions with Q(x)=|x|2, when the dimension d of data from Rd is large. This gives the first rigorous proof for the superiority of deep convolutional neural networks in approximating functions with special structures. Then we carry out generalization analysis for empirical risk minimization with such a deep network in a regression framework with the regression function of the form f∘Q. Our network structure which does not use any composite information or the functions Q and f can automatically extract features and make use of the composite nature of the regression function via tuning the structural parameters. Our analysis provides an error bound which decreases with the network depth to a minimum and then increases, verifying theoretically a trade-off phenomenon observed for network depths in many practical applications.
In this paper, we consider the problem of approximating functions from a Korobov space on −1,1d by ReLU shallow neural networks and present a rate O(m−25(1+2d)logm) of uniform approximation by ...networks of m hidden neurons. This is achieved by combining a novel Fourier analysis approach and a probability argument. We apply our approximation theory to a learning algorithm for regression based on ReLU shallow neural networks and derive learning rates of order O(N−4(d+2)9d+8logN) for the excess generalization error with the sample size N when the regression function lies in the Korobov space.
Distributed learning based on the divide and conquer approach is a powerful tool for big data processing. We introduce a distributed kernel gradient descent algorithm for the minimum error entropy ...principle and analyze its convergence. We show that the L2 error decays at a minimax optimal rate under some mild conditions. As a tool we establish some concentration inequalities for U-statistics which play pivotal roles in our error analysis.
Pairwise learning is widely employed in ranking, similarity and metric learning, area under the ROC curve (AUC) maximization, and many other learning tasks involving sample pairs. Pairwise learning ...with deep neural networks was considered for ranking, but enough theoretical understanding about this topic is lacking. In this letter, we apply symmetric deep neural networks to pairwise learning for ranking with a hinge loss
and carry out generalization analysis for this algorithm. A key step in our analysis is to characterize a function that minimizes the risk. This motivates us to first find the minimizer of
-risk and then design our two-part deep neural networks with shared weights, which induces the antisymmetric property of the networks. We present convergence rates of the approximation error in terms of function smoothness and a noise condition and give an excess generalization error bound by means of properties of the hypothesis space generated by deep neural networks. Our analysis is based on tools from U-statistics and approximation theory.
Online learning algorithms in a reproducing kernel Hilbert space associated with convex loss functions are studied. We show that in terms of the expected excess generalization error, they can ...converge comparably fast as corresponding kernel-based batch learning algorithms. Under mild conditions on loss functions and approximation errors, fast learning rates and finite sample upper bounds are established using polynomially decreasing step-size sequences. For some commonly used loss functions for classification, such as the logistic and the <inline-formula> <tex-math notation="LaTeX">p </tex-math></inline-formula>-norm hinge loss functions with <inline-formula> <tex-math notation="LaTeX">p\in {1,2} </tex-math></inline-formula>, the learning rates are the same as those for Tikhonov regularization and can be of order <inline-formula> <tex-math notation="LaTeX">O(T^{- {(1 / 2)}} \log T) </tex-math></inline-formula>, which are nearly optimal up to a logarithmic factor. Our novelty lies in a sharp estimate for the expected values of norms of the learning sequence (or an inductive argument to uniformly bound the expected risks of the learning sequence in expectation) and a refined error decomposition for online learning algorithms.