In this letter, we focus on the relationship among channel capacity, signal-to-noise ratio (SNR), water types, wind speed, and characteristics of transmiter/receiver array such as inter-spacing and ...link range for downlink underwater wireless optical communications (UWOC) multiple-input multiple-output (MIMO) systems. Numerical results suggest that more turbid water, larger link range, and larger inter-spacing may reduce the channel capacity, and meanwhile more turbid water and larger link range can weaken the effects of random slopes and inter-spacing at the expense of larger SNR to balance the channel capacity.
In underwater wireless optical communications (UWOC), each emitted photon will be scattered with a random deviated angle when propagating through underwater channel. Then the light beam observed at ...the receiver plane suffers spatial angle spread which follows a certain angle of arrival (AOA) distribution. In this letter, we present a closed-form expression of AOA distribution to characterize how received intensity of ballistic and single scattering components distributes over AOA with respect to unit transmit power. Numerical results have validated the proposed AOA distribution by Monte Carlo approach in clear and turbid coastal and harbor water with relatively short link range.
Rectified linear unit (ReLU), as a non-linear activation function, is well known to improve the expressivity of neural networks such that any continuous function can be approximated to arbitrary ...precision by a sufficiently wide neural network. In this work, we present another interesting and important feature of ReLU activation function. We show that ReLU leads to: {\it better separation} for similar data, and {\it better conditioning} of neural tangent kernel (NTK), which are closely related. Comparing with linear neural networks, we show that a ReLU activated wide neural network at random initialization has a larger angle separation for similar data in the feature space of model gradient, and has a smaller condition number for NTK. Note that, for a linear neural network, the data separation and NTK condition number always remain the same as in the case of a linear model. Furthermore, we show that a deeper ReLU network (i.e., with more ReLU activation operations), has a smaller NTK condition number than a shallower one. Our results imply that ReLU activation, as well as the depth of ReLU network, helps improve the gradient descent convergence rate, which is closely related to the NTK condition number.
Modern neural architectures for classification tasks are trained using the cross-entropy loss, which is widely believed to be empirically superior to the square loss. In this work we provide evidence ...indicating that this belief may not be well-founded. We explore several major neural architectures and a range of standard benchmark datasets for NLP, automatic speech recognition (ASR) and computer vision tasks to show that these architectures, with the same hyper-parameter settings as reported in the literature, perform comparably or better when trained with the square loss, even after equalizing computational resources. Indeed, we observe that the square loss produces better results in the dominant majority of NLP and ASR experiments. Cross-entropy appears to have a slight edge on computer vision tasks. We argue that there is little compelling empirical or theoretical evidence indicating a clear-cut advantage to the cross-entropy loss. Indeed, in our experiments, performance on nearly all non-vision tasks can be improved, sometimes significantly, by switching to the square loss. Furthermore, training with square loss appears to be less sensitive to the randomness in initialization. We posit that training using the square loss for classification needs to be a part of best practices of modern deep learning on equal footing with cross-entropy.
Over the past decade, the field of machine learning has witnessed significant advancements in artificial intelligence, primarily driven by empirical research. Within this context, we present various ...surprising empirical phenomena observed in deep learning and kernel machines. Among the crucial components of a learning system, the training objective holds immense importance. In the realm of classification tasks, the cross-entropy loss has emerged as the dominant choice for training modern neural architectures, widely believed to offer empirical superiority over the square loss. However, limited compelling empirical or theoretical evidence exists to firmly establish the clear-cut advantage of the cross-entropy loss. In fact, our findings demonstrate that training with the square loss achieves comparable or even better results than the cross-entropy loss, even when computational resources are equalized.However, it remains unclear how the rescaling hyperparameter R, needs to vary with the number of classes. We provide an exact analysis for a 1-layer ReLU network in the proportional asymptotic regime for isotropic Gaussian data. Specifically, we focus on the optimal choice of R as a function of (i) the number of classes, (ii) the degree of overparameterization, and (iii) the level of label noise. Finally, we provide empirical results on real data, which supports our theoretical predictions.Afterwards, to avoid extra parameters brought by the rescaling of the square loss (in cases when class number is large), later on we propose the “squentropy” loss, which is the sum of the cross-entropy loss and the average square loss over the incorrect classes. We show that the squentropy loss outperforms both the pure cross entropy and rescaled square losses in terms of the classification accuracy and model calibration. Also, squentropy loss is a simple “plug-and-play” replacement of cross-entropy as it requires no extra hyperparameters and no extra tuning on optimization parameters.Also, we apply theoretically well-understood kernel machines to practical challenging tasks, speech enhancement, and found that kernel machines actually outperform fully connected networks and require less computation resources. In another work, we investigate the correlation between the Neural Collapse phenomenon proposed by Papyan, Han, & Donoho (2020) and generalization in deep learning. We give precise definitions and their corresponding feasibility on generalization, which clarify neural collapse concepts. Moreover, our empirical evidence supports our claim that neural collapse is mainly an optimization phenomenon.
Nearly all practical neural models for classification are trained using cross-entropy loss. Yet this ubiquitous choice is supported by little theoretical or empirical evidence. Recent work (Hui & ...Belkin, 2020) suggests that training using the (rescaled) square loss is often superior in terms of the classification accuracy. In this paper we propose the "squentropy" loss, which is the sum of two terms: the cross-entropy loss and the average square loss over the incorrect classes. We provide an extensive set of experiments on multi-class classification problems showing that the squentropy loss outperforms both the pure cross entropy and rescaled square losses in terms of the classification accuracy. We also demonstrate that it provides significantly better model calibration than either of these alternative losses and, furthermore, has less variance with respect to the random initialization. Additionally, in contrast to the square loss, squentropy loss can typically be trained using exactly the same optimization parameters, including the learning rate, as the standard cross-entropy loss, making it a true "plug-and-play" replacement. Finally, unlike the rescaled square loss, multiclass squentropy contains no parameters that need to be adjusted.
The recent work of Papyan, Han, & Donoho (2020) presented an intriguing "Neural Collapse" phenomenon, showing a structural property of interpolating classifiers in the late stage of training. This ...opened a rich area of exploration studying this phenomenon. Our motivation is to study the upper limits of this research program: How far will understanding Neural Collapse take us in understanding deep learning? First, we investigate its role in generalization. We refine the Neural Collapse conjecture into two separate conjectures: collapse on the train set (an optimization property) and collapse on the test distribution (a generalization property). We find that while Neural Collapse often occurs on the train set, it does not occur on the test set. We thus conclude that Neural Collapse is primarily an optimization phenomenon, with as-yet-unclear connections to generalization. Second, we investigate the role of Neural Collapse in feature learning. We show simple, realistic experiments where training longer leads to worse last-layer features, as measured by transfer-performance on a downstream task. This suggests that neural collapse is not always desirable for representation learning, as previously claimed. Finally, we give preliminary evidence of a "cascading collapse" phenomenon, wherein some form of Neural Collapse occurs not only for the last layer, but in earlier layers as well. We hope our work encourages the community to continue the rich line of Neural Collapse research, while also considering its inherent limitations.
In this paper, we present a joint training framework between the multi-channel beamformer and the acoustic model for noise robust automatic speech recognition (ASR). The complex ratio mask (CRM), ...demonstrated to be more effective than the ideal ratio mask (IRM), is proposed to estimate the covariance matrix for the beamformer. Minimum Variance Distortionless Response (MVDR) beamformer and Generalized Eigenvalue (GEV) beamformer are both investigated under the CRM-based joint training architecture. We also propose a robust mask pooling strategy among multiple channels. A long short-term memory (LSTM) based language model is utilized to re-score hypotheses which further improves the overall performance. We evaluate the proposed methods on CHiME-4 challenge dataset. The CRM based system achieves a relative 10% reduction on word error rate (WER) compared with the IRM based system. Without sequence discriminative training, our best single system already achieves an average WER 2.72% on the test set which is comparable to the state-of-the-art.
We apply a fast kernel method for mask-based single-channel speech enhancement. Specifically, our method solves a kernel regression problem associated to a non-smooth kernel function (exponential ...power kernel) with a highly efficient iterative method (EigenPro). Due to the simplicity of this method, its hyper-parameters such as kernel bandwidth can be automatically and efficiently selected using line search with subsamples of training data. We observe an empirical correlation between the regression loss (mean square error) and regular metrics for speech enhancement. This observation justifies our training target and motivates us to achieve lower regression loss by training separate kernel model per frequency subband. We compare our method with the state-of-the-art deep neural networks on mask-based HINT and TIMIT. Experimental results show that our kernel method consistently outperforms deep neural networks while requiring less training time.
Speech separation based on deep neural networks (DNNs) has been widely studied recently, and has achieved considerable success. However, previous studies are mostly based on fully-connected neural ...networks. In order to capture the local information of speech signals, we propose to use convolutional maxout neural networks (CMNNs) to separate speech and noise by estimating the ideal ratio mask of the time-frequency units. In our work the proposed CMNN is applied in the frequency domain. By using local filtering and max-pooling, convolutional neural networks can model the local structure of speech signals. Instead of sigmoid function, maxout is selected to address the saturation problem. In addition, dropout is integrated into the network to get better generalization ability. The proposed system outperforms a traditional DNN-based system in both objective speech quality and intelligibility.