'In-memory computing' is being widely explored as a novel computing paradigm to mitigate the well known memory bottleneck. This emerging paradigm aims at embedding some aspects of computations inside ...the memory array, thereby avoiding frequent and expensive movement of data between the compute unit and the storage memory. In-memory computing with respect to Silicon memories has been widely explored on various memory bit-cells. Embedding computation inside the 6 transistor (6T) SRAM array is of special interest since it is the most widely used on-chip memory. In this paper, we present a novel in-memory multiplication followed by accumulation operation capable of performing parallel dot products within 6T SRAM without any changes to the standard bitcell. We, further, study the effect of circuit non-idealities and process variations on the accuracy of the LeNet-5 and VGG neural network architectures against the MNIST and CIFAR-10 datasets, respectively. The proposed in-memory dot-product mechanism achieves 88.8% and 99% accuracy for the CIFAR-10 and MNIST, respectively. Compared to the standard von Neumann system, the proposed system is <inline-formula> <tex-math notation="LaTeX">6.24\times </tex-math></inline-formula> better in energy consumption and <inline-formula> <tex-math notation="LaTeX">9.42\times </tex-math></inline-formula> better in delay.
Machine Learning (ML) workloads being memory- and compute-intensive, consume large amounts of power running on conventional computing systems, restricting their implementations to large-scale data ...centers. Transferring large amounts of data from the edge devices to the data centers is not only energy expensive, but sometimes undesirable in security-critical applications. Thus, there is a need for building domain-specific hardware primitives for energy-efficient ML processing at the edge. One such approach - in-memory computing , eliminates frequent and unnecessary data-transfers between the memory and the compute units, by directly computing the data where it is stored. However, the analog nature of computations introduces non-idealities, which degrades the overall accuracy of neural networks. In this paper, we propose an in-memory computing primitive for accelerating dot-products within standard 8T-SRAM caches, using charge-sharing. The inherent parasitic capacitance of the bitlines and sourcelines is used for accumulating analog voltages, which can be sensed for an approximate dot product. The charge sharing approach involves a self-compensation technique which reduces the effects of non-idealities, thereby reducing the errors. Our results for ternary weight neural networks show that using the proposed compensation approaches, the accuracy degradation is within 1% and 5% of the baseline accuracy, for the MNIST and CIFAR-10 dataset, respectively, with an energy-delay product improvement of <inline-formula> <tex-math notation="LaTeX">38\times </tex-math></inline-formula> over the standard von-Neumann computing system. We believe that this work can be used in conjunction with existing mitigation techniques, such as re-training approaches, to further enhance system performance.
Deep neural networks (DNNs) have found widespread adoption in solving image recognition and natural language processing tasks. However, they make confident mispredictions when presented with data ...that does not belong to the training distribution, i.e. out-of-distribution (OoD) samples. Research has shown that angular representations can be useful to address the curse of dimensionality and improve OoD detection performance. However, when evaluating the angular separability using Fisher's criterion we find that empirical risk minimization and inter-class mixup trained DNNs have low angular separability between in-distribution data and OoD data. To improve angular separability, we propose intra-class mixup. We provide mathematical reasoning that shows that intra-class mixup results in reduced angular spread because of reduced variance at the input during training. Further, to take full advantage of improved angular separability from intra-class mixup we propose supplementing the separation metric with the cosine of angular margin to improve OoD detection. Angular margin is the angle between the final layer weight vector and the sample representation. The proposed intra-class mixup when applied to various existing OoD detection techniques shows an improvement of 4.21% and 6.21% in AUROC performance over empirical risk minimization and inter-class mixup, respectively. Further, intra-class mixup aided with the cosine of angular margin improves AUROC performance by 6.71% and 8.75% over empirical risk minimization and inter-class mixup, respectively.
Decentralized distributed learning is the key to enabling large-scale machine learning (training) on the edge devices utilizing private user-generated local data, without relying on the cloud. ...However, practical realization of such on-device training is limited by the communication and compute bottleneck. In this paper, we propose and show the convergence of low precision decentralized training that aims to reduce the computational complexity and communication cost of decentralized training. Many feedback-based compression techniques have been proposed in the literature to reduce communication costs. To the best of our knowledge, there is no work that applies and shows compute efficient training techniques such as quantization, pruning etc., for peer-to-peer decentralized learning setups. Since real-world applications have a significant skew in the data distribution, we design ”Range-EvoNorm” as the normalization activation layer which is better suited for low precision training over non-IID data. Moreover, we show that the proposed low precision training can be used in synergy with other communication compression methods decreasing the communication cost further. Our experiments indicate that 8-bit decentralized training has minimal accuracy loss compared to its full precision counterpart even with non-IID data. However, when low precision training is accompanied by communication compression through sparsification we observe a 1−2% drop in accuracy. The proposed low precision decentralized training decreases computational complexity, memory usage, and communication cost by ∼4× and compute energy by a factor of ∼20×, while trading off less than a 1% accuracy for both IID and non-IID data. In particular, for higher skew values, we observe an increase in accuracy (by ∼0.5%) with low precision training, indicating the regularization effect of the quantization.
Machine unlearning is a prominent and challenging field, driven by regulatory demands for user data deletion and heightened privacy awareness. Existing approaches involve retraining model or multiple ...finetuning steps for each deletion request, often constrained by computational limits and restricted data access. In this work, we introduce a novel class unlearning algorithm designed to strategically eliminate specific classes from the learned model. Our algorithm first estimates the Retain and the Forget Spaces using Singular Value Decomposition on the layerwise activations for a small subset of samples from the retain and unlearn classes, respectively. We then compute the shared information between these spaces and remove it from the forget space to isolate class-discriminatory feature space. Finally, we obtain the unlearned model by updating the weights to suppress the class discriminatory features from the activation spaces. We demonstrate our algorithm's efficacy on ImageNet using a Vision Transformer with only \(\sim 1.5\%\) drop in retain accuracy compared to the original model while maintaining under \(1\%\) accuracy on the unlearned class samples. Furthermore, our algorithm exhibits competitive unlearning performance and resilience against Membership Inference Attacks (MIA). Compared to baselines, it achieves an average accuracy improvement of \(1.38\%\) on the ImageNet dataset while requiring up to \(10 \times\) fewer samples for unlearning. Additionally, under stronger MIA attacks on the CIFAR-100 dataset using a ResNet18 architecture, our approach outperforms the best baseline by \(1.8\%\). Our code is available at https://github.com/sangamesh-kodge/class_forgetting.
Label corruption, where training samples have incorrect labels, can significantly degrade the performance of machine learning models. This corruption often arises from non-expert labeling or ...adversarial attacks. Acquiring large, perfectly labeled datasets is costly, and retraining large models from scratch when a clean dataset becomes available is computationally expensive. To address this challenge, we propose Post-Training Correction, a new paradigm that adjusts model parameters after initial training to mitigate label noise, eliminating the need for retraining. We introduce Verifix, a novel Singular Value Decomposition (SVD) based algorithm that leverages a small, verified dataset to correct the model weights using a single update. Verifix uses SVD to estimate a Clean Activation Space and then projects the model's weights onto this space to suppress activations corresponding to corrupted data. We demonstrate Verifix's effectiveness on both synthetic and real-world label noise. Experiments on the CIFAR dataset with 25% synthetic corruption show 7.36% generalization improvements on average. Additionally, we observe generalization improvements of up to 2.63% on naturally corrupted datasets like WebVision1.0 and Clothing1M.
We propose BERMo, an architectural modification to BERT, which makes predictions based on a hierarchy of surface, syntactic and semantic language features. We use linear combination scheme proposed ...in Embeddings from Language Models (ELMo) to combine the scaled internal representations from different network depths. Our approach has two-fold benefits: (1) improved gradient flow for the downstream task as every layer has a direct connection to the gradients of the loss function and (2) increased representative power as the model no longer needs to copy the features learned in the shallower layer which are necessary for the downstream task. Further, our model has a negligible parameter overhead as there is a single scalar parameter associated with each layer in the network. Experiments on the probing task from SentEval dataset show that our model performs up to \(4.65\%\) better in accuracy than the baseline with an average improvement of \(2.67\%\) on the semantic tasks. When subject to compression techniques, we find that our model enables stable pruning for compressing small datasets like SST-2, where the BERT model commonly diverges. We observe that our approach converges \(1.67\times\) and \(1.15\times\) faster than the baseline on MNLI and QQP tasks from GLUE dataset. Moreover, our results show that our approach can obtain better parameter efficiency for penalty based pruning approaches on QQP task.
Decentralized learning over distributed datasets can have significantly different data distributions across the agents. The current state-of-the-art decentralized algorithms mostly assume the data ...distributions to be Independent and Identically Distributed. This paper focuses on improving decentralized learning over non-IID data. We propose \textit{Neighborhood Gradient Clustering (NGC)}, a novel decentralized learning algorithm that modifies the local gradients of each agent using self- and cross-gradient information. Cross-gradients for a pair of neighboring agents are the derivatives of the model parameters of an agent with respect to the dataset of the other agent. In particular, the proposed method replaces the local gradients of the model with the weighted mean of the self-gradients, model-variant cross-gradients (derivatives of the neighbors' parameters with respect to the local dataset), and data-variant cross-gradients (derivatives of the local model with respect to its neighbors' datasets). The data-variant cross-gradients are aggregated through an additional communication round without breaking the privacy constraints. Further, we present \textit{CompNGC}, a compressed version of \textit{NGC} that reduces the communication overhead by \(32 \times\). We theoretically analyze the convergence rate of the proposed algorithm and demonstrate its efficiency over non-IID data sampled from {various vision and language} datasets trained. Our experiments demonstrate that \textit{NGC} and \textit{CompNGC} outperform (by \(0-6\%\)) the existing SoTA decentralized learning algorithm over non-IID data with significantly less compute and memory requirements. Further, our experiments show that the model-variant cross-gradient information available locally at each agent can improve the performance over non-IID data by \(1-35\%\) without additional communication cost.
Decentralized distributed learning is the key to enabling large-scale machine learning (training) on edge devices utilizing private user-generated local data, without relying on the cloud. However, ...the practical realization of such on-device training is limited by the communication and compute bottleneck. In this paper, we propose and show the convergence of low precision decentralized training that aims to reduce the computational complexity and communication cost of decentralized training. Many feedback-based compression techniques have been proposed in the literature to reduce communication costs. To the best of our knowledge, there is no work that applies and shows compute efficient training techniques such as quantization, pruning, etc., for peer-to-peer decentralized learning setups. Since real-world applications have a significant skew in the data distribution, we design "Range-EvoNorm" as the normalization activation layer which is better suited for low precision training over non-IID data. Moreover, we show that the proposed low precision training can be used in synergy with other communication compression methods decreasing the communication cost further. Our experiments indicate that 8-bit decentralized training has minimal accuracy loss compared to its full precision counterpart even with non-IID data. However, when low precision training is accompanied by communication compression through sparsification we observe a 1-2% drop in accuracy. The proposed low precision decentralized training decreases computational complexity, memory usage, and communication cost by 4x and compute energy by a factor of ~20x, while trading off less than a \(1\%\) accuracy for both IID and non-IID data. In particular, with higher skew values, we observe an increase in accuracy (by ~ 0.5%) with low precision training, indicating the regularization effect of the quantization.