Along with the rapid development of deep learning in practice, theoretical explanations for its success become urgent. Generalization and expressivity are two widely used measurements to quantify ...theoretical behaviors of deep nets. The expressivity focuses on finding functions expressible by deep nets but cannot be approximated by shallow nets with similar number of neurons. It usually implies the large capacity. The generalization aims at deriving fast learning rate for deep nets. It usually requires small capacity to reduce the variance. Different from previous studies on deep nets, pursuing either expressivity or generalization, we consider both the factors to explore theoretical advantages of deep nets. For this purpose, we construct a deep net with two hidden layers possessing excellent expressivity in terms of localized and sparse approximation. Then, utilizing the well known covering number to measure the capacity, we find that deep nets possess excellent expressive power (measured by localized and sparse approximation) without essentially enlarging the capacity of shallow nets. As a consequence, we derive near-optimal learning rates for implementing empirical risk minimization on deep nets. These results theoretically exhibit advantages of deep nets from the learning theory viewpoint.
Training an interpretable deep net to embody its theoretical advantages is difficult but extremely important in the community of machine learning. In this article, noticing the importance of spatial ...sparseness in signal and image processing, we develop a constructive approach to generate a deep net to capture the spatial sparseness feature. We conduct both theoretical analysis and numerical verifications to show the power of the constructive approach. Theoretically, we prove that the constructive approach can yield a deep net estimate that achieves the optimal generalization error bounds in the framework of learning theory. Numerically, we show that the constructive approach is essentially better than shallow learning in the sense that it provides better prediction accuracy with less training time.
In this paper, we aim at analyzing the approximation abilities of shallow networks in reproducing kernel Hilbert spaces (RKHSs). We prove that there is a probability measure such that the achievable ...lower bound for approximating by shallow nets can be realized for all functions in balls of reproducing kernel Hilbert space with high probability, which is different with the classical minimax approximation error estimates. This result together with the existing approximation results for deep nets shows the limitations for shallow nets and provides a theoretical explanation on why deep nets perform better than shallow nets.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UL, UM, UPCLJ, UPUK, ZRSKP
4.
Random Sketching for Neural Networks With ReLU Wang, Di; Zeng, Jinshan; Lin, Shao-Bo
IEEE transaction on neural networks and learning systems,
2021-Feb., 2021-02-00, 2021-2-00, 20210201, Volume:
32, Issue:
2
Journal Article
Training neural networks is recently a hot topic in machine learning due to its great success in many applications. Since the neural networks' training usually involves a highly nonconvex ...optimization problem, it is difficult to design optimization algorithms with perfect convergence guarantees to derive a neural network estimator of high quality. In this article, we borrow the well-known random sketching strategy from kernel methods to transform the training of shallow rectified linear unit (ReLU) nets into a linear least-squares problem. Using the localized approximation property of shallow ReLU nets and a recently developed dimensionality-leveraging scheme, we succeed in equipping shallow ReLU nets with a specific random sketching scheme. The efficiency of the suggested random sketching strategy is guaranteed by theoretical analysis and also verified via a series of numerical experiments. Theoretically, we show that the proposed random sketching is almost optimal in terms of both approximation capability and learning performance. This implies that random sketching does not degenerate the performance of shallow ReLU nets. Numerically, we show that random sketching can significantly reduce the computational burden of numerous backpropagation (BP) algorithms while maintaining their learning performance.
This paper focuses on learning rate analysis of distributed kernel ridge regression (DKRR) for strong mixing sequences. Using a recently developed integral operator approach and a classical ...covariance inequality for Banach-valued strong mixing sequences, we succeed in deriving optimal learning rates of DKRR. As a byproduct, we deduce a sufficient condition for the mixing property to guarantee the optimal learning rates for kernel ridge regression, which fills the gap of learning rates between i.i.d. samples and strong mixing sequences. A series of numerical experiments are conducted to verify our theoretical assertions via showing excellent learning performance of DKRR in learning both toy and real world time series data. All these results extend the applicable range of distributed learning from i.i.d. samples to non-i.i.d. sequences.
We study the generalization ability of distributed learning equipped with a divide-and-conquer approach and gradient descent algorithm in a reproducing kernel Hilbert space (RKHS). Using special ...spectral features of the gradient descent algorithms and a novel integral operator approach, we provide optimal learning rates of
distributed gradient descent algorithms
in probability and partly conquer the saturation phenomenon in the literature in the sense that the maximum number of local machines to guarantee the optimal learning rates does not vary if the regularity of the regression function goes beyond a certain quantity. We also find that additional unlabeled data can help relax the restriction on the number of local machines in distributed learning.
The synaptic weight modification depends not only on interval of the pre‐/postspike pairs according to spike‐timing dependent plasticity (classical pair‐STDP), but also on the timing of the preceding ...spike (triplet‐STDP). Triplet‐STDP reflects the unavoidable interaction of spike pairs in natural spike trains through the short‐term suppression effect of preceding spikes. Second‐order memristors with one state variable possessing short‐term dynamics work in a way similar to the biological system. In this work, the suppression triplet‐STDP learning rule is faithfully demonstrated by experiments and simulations using second‐order memristors. Furthermore, a leaky‐integrate‐and‐fire (LIF) neuron is simulated using a circuit constructed with second‐order memristors. Taking the advantage of the LIF neuron, various neuromimetic dynamic processes, including local graded potential leaking out, postsynaptic impulse generation and backpropagation, and synaptic weight modification according to the suppression triplet‐STDP rule, are realized. The realized weight‐dependent pair‐ and triplet‐STDP rules are clearly in line with findings in biology. The physically realized triplet‐STDP rule is powerful in developing direction and speed selectivity for complex pattern recognition and tracking tasks. These scalable artificial synapses and neurons realized in second‐order memristors can intrinsically capture the neuromimetic dynamic processes; they are the promising building blocks for constructing brain‐inspired computation systems.
Compared with the classical pair‐spike‐timing dependent plasticity (STDP), the triplet‐STDP is an advanced synaptic plasticity that induces improved learning capability. The triplet‐STDP is physically demonstrated and a leaky‐integrate‐and‐fire (LIF) neuron is simulated using second‐order memristors. The biorealistic implementation of the triplet‐STDP and the LIF neuron offers an efficient approach to the artificial intelligence through a simple artificial neural network.
Full text
Available for:
BFBNIB, FZAB, GIS, IJS, KILJ, NLZOH, NUK, OILJ, SBCE, SBMB, UL, UM, UPUK
Deep learning is recognized to be capable of discovering deep features for representation learning and pattern recognition without requiring elegant feature engineering techniques by taking ...advantages of human ingenuity and prior knowledge. Thus it has triggered enormous research activities in machine learning and pattern recognition. One of the most important challenges of deep learning is to figure out relations between a feature and the depth of deep neural networks (deep nets for short) to reflect the necessity of depth. Our purpose is to quantify this feature-depth correspondence in feature extraction and generalization. We present the adaptivity of features to depths and vice-verse via showing a depth-parameter trade-off in extracting both single feature and composite features. Based on these results, we prove that implementing the classical empirical risk minimization on deep nets can achieve the optimal generalization performance for numerous learning tasks. Our theoretical results are verified by a series of numerical experiments including toy simulations and a real application of earthquake seismic intensity prediction.
In this paper, we propose an efficient boosting method with theoretical guarantees for binary classification. There are three key ingredients of the proposed boosting method: a fully corrective ...greedy (FCG) update, a differentiable squared hinge (also called truncated quadratic) loss function, and an efficient alternating direction method of multipliers (ADMM) solver. Compared with traditional boosting methods, on one hand, the FCG update accelerates the numerical convergence rate, and on the other hand, the squared hinge loss inherits the robustness of the hinge loss for classification and maintains the theoretical benefits of the square loss in regression. The ADMM solver with guaranteed fast convergence then provides an efficient implementation for the proposed boosting method. We conduct both theoretical analysis and numerical verification to show the outperformance of the proposed method. Theoretically, a fast learning rate of order O((m/logm)−1/2) is proved under certain standard assumptions, where m is the size of sample set. Numerically, a series of toy simulations and real data experiments are carried out to verify the developed theory.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UL, UM, UPCLJ, UPUK, ZRSKP
Spectral algorithms have been widely used and studied in learning theory and inverse problems. This paper is concerned with distributed spectral algorithms, for handling big data, based on a ...divide-and-conquer approach. We present a learning theory for these distributed kernel-based learning algorithms in a regression framework including nice error bounds and optimal minimax learning rates achieved by means of a novel integral operator approach and a second order decomposition of inverse operators. Our quantitative estimates are given in terms of regularity of the regression function, effective dimension of the reproducing kernel Hilbert space, and qualification of the filter function of the spectral algorithm. They do not need any eigenfunction or noise conditions and are better than the existing results even for the classical family of spectral algorithms.