Realizing increasingly complex artificial intelligence (AI) functionalities directly on edge devices calls for unprecedented energy efficiency of edge hardware. Compute-in-memory (CIM) based on ...resistive random-access memory (RRAM)
promises to meet such demand by storing AI model weights in dense, analogue and non-volatile RRAM devices, and by performing AI computation directly within RRAM, thus eliminating power-hungry data movement between separate compute and memory
. Although recent studies have demonstrated in-memory matrix-vector multiplication on fully integrated RRAM-CIM hardware
, it remains a goal for a RRAM-CIM chip to simultaneously deliver high energy efficiency, versatility to support diverse models and software-comparable accuracy. Although efficiency, versatility and accuracy are all indispensable for broad adoption of the technology, the inter-related trade-offs among them cannot be addressed by isolated improvements on any single abstraction level of the design. Here, by co-optimizing across all hierarchies of the design from algorithms and architecture to circuits and devices, we present NeuRRAM-a RRAM-based CIM chip that simultaneously delivers versatility in reconfiguring CIM cores for diverse model architectures, energy efficiency that is two-times better than previous state-of-the-art RRAM-CIM chips across various computational bit-precisions, and inference accuracy comparable to software models quantized to four-bit weights across various AI tasks, including accuracy of 99.0 percent on MNIST
and 85.7 percent on CIFAR-10
image classification, 84.7-percent accuracy on Google speech command recognition
, and a 70-percent reduction in image-reconstruction error on a Bayesian image-recovery task.
While it is expected for gene length to be associated with factors such as intron number and evolutionary conservation, we are yet to understand the connections between gene length and function in ...the human genome. In this study, we show that, as expected, there is a strong positive correlation between gene length, transcript length, and protein size as well as a correlation with the number of genetic variants and introns. Among tissue-specific genes, we find that the longest transcripts tend to be expressed in the blood vessels, nerves, thyroid, cervix uteri, and the brain, while the smallest transcripts tend to be expressed in the pancreas, skin, stomach, vagina, and testis. We report, as shown previously, that natural selection suppresses changes for genes with longer transcripts and promotes changes for genes with smaller transcripts. We also observe that genes with longer transcripts tend to have a higher number of co-expressed genes and protein-protein interactions, as well as more associated publications. In the functional analysis, we show that bigger transcripts are often associated with neuronal development, while smaller transcripts tend to play roles in skin development and in the immune system. Furthermore, pathways related to cancer, neurons, and heart diseases tend to have genes with longer transcripts, with smaller transcripts being present in pathways related to immune responses and neurodegenerative diseases. Based on our results, we hypothesize that longer genes tend to be associated with functions that are important in the early development stages, while smaller genes tend to play a role in functions that are important throughout the whole life, like the immune system, which requires fast responses.
Type 2 diabetes (T2D) and its secondary complications result from the complex interplay of genetic and environmental factors. To understand the role of these factors on disease susceptibility, the ...present study was conducted to assess the association of
eNOS
and
MCP-1
variants with T2D and diabetic nephropathy (DN) in two ethnically and geographically different cohorts from North India. A total of 1313 subjects from two cohorts were genotyped for
eNOS
(rs2070744, rs869109213 and rs1799983) and
MCP-1
(rs1024611 and rs3917887) variants. Cohort-I (Punjab) comprised 461 T2D cases (204 T2D with DN and 257 T2D without DN) and 315 healthy controls. Cohort-II (Jammu and Kashmir) included 337 T2D (150 T2D with DN and 187 T2D without DN) and 200 controls. Allele, genotype and haplotype frequencies were compared among the studied participants, and phenotype–genotype interactions were determined. Meta-analysis was performed to investigate the association between the selected variants and disease susceptibility. All three
eNOS
variants were associated with 1.5–4.0-fold risk of DN in both cohorts.
MCP-1
rs1024611 conferred twofold risk towards DN progression in cohort-II, while rs3917887 provided twofold risk for both T2D and DN in both cohorts
. eNOS
and
MCP-1
haplotypes conferred risk for T2D and DN susceptibility. Phenotype–genotype interactions showed significant associations between the studied variants and anthropometric and biochemical parameters. In meta-analysis, all
eNOS
variants conferred risk towards DN progression, whereas no significant association was observed for
MCP-1
rs1024611. We show evidences for an association of
eNOS
and
MCP-1
variants with T2D and DN susceptibility.
Osteoarthritis, the most common joint disorder, is characterised by deterioration of the articular cartilage. Many studies have identified potential therapeutic targets, yet no effective treatment ...has been determined. The aim of this study was to identify and rank osteoarthritis-associated genes and micro-RNAs to prioritise those most integral to the disease. A systematic meta-analysis of differentially expressed mRNA and micro-RNAs in human osteoarthritic cartilage was conducted. Ingenuity pathway analysis identified cellular senescence as an enriched pathway, confirmed by a significant overlap (p < 0.01) with cellular senescence drivers (CellAge Database). A co-expression network was built using genes from the meta-analysis as seed nodes and combined with micro-RNA targets and SNP datasets to construct a multi-source information network. This accumulated and connected 1689 genes which were ranked based on node and edge aggregated scores. These bioinformatic analyses were confirmed at the protein level by mass spectrometry of the different zones of human osteoarthritic cartilage (superficial, middle, and deep) compared to normal controls. This analysis, and subsequent experimental confirmation, revealed five novel osteoarthritis-associated proteins (PPIB, ASS1, LHDB, TPI1, and ARPC4-TTLL3). Focusing future studies on these novel targets may lead to new therapies for osteoarthritis.
Timeloop: A Systematic Approach to DNN Accelerator Evaluation Parashar, Angshuman; Raina, Priyanka; Shao, Yakun Sophia ...
2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS),
03/2019
Conference Proceeding
This paper presents Timeloop, an infrastructure for evaluating and exploring the architecture design space of deep neural network (DNN) accelerators. Timeloop uses a concise and unified ...representation of the key architecture and implementation attributes of DNN accelerators to describe a broad space of hardware topologies. It can then emulate those topologies to generate an accurate projection of performance and energy efficiency for a DNN workload through a mapper that finds the best way to schedule operations and stage data on the specified architecture. This enables fair comparisons across different architectures and makes DNN accelerator design more systematic. This paper describes Timeloop's underlying models and algorithms in detail and shows results from case studies enabled by Timeloop, which provide interesting insights into the current state of DNN architecture design. In particular, they reveal that dataflow and memory hierarchy co-design plays a critical role in optimizing energy efficiency. Also, there is currently still not a single architecture that achieves the best performance and energy efficiency across a diverse set of workloads due to flexibility and efficiency trade-offs. These results provide inspiration into possible directions for DNN accelerator research.
Camera shake is a common cause of blur in cell-phone camera images. Removing blur requires deconvolving the blurred image with a kernel, which is typically unknown and needs to be estimated from the ...blurred image. This kernel estimation is computationally intensive and takes several minutes on a CPU, which makes it unsuitable for mobile devices. This paper presents the first hardware accelerator for kernel estimation for image deblurring applications. Our approach, using a multi-resolution iteratively reweighted least squares deconvolution engine with DFT-based matrix multiplication, a high-throughput image correlator, and a high-speed selective update-based gradient projection solver, achieves a 78x reduction in kernel estimation runtime, and a 56x reduction in total deblurring time for a 1920\times 1080 image, enabling quick feedback to the user. Configurability in kernel size and number of iterations gives up to ten times energy scalability, allowing the system to trade off runtime with image quality. The test chip, fabricated in TSMC 40-nm CMOS technology, consumes 105 mJ for kernel estimation running at 83 MHz and 0.9 V, making it suitable for integration into mobile devices.
While coarse-grained reconfigurable arrays (CGRAs) have emerged as promising programmable accelerator architectures, they require automatic pipelining of applications during their compilation flow to ...achieve high performance. Current CGRA compilers either lack pipelining altogether resulting in low application performance, or perform exhaustive pipelining resulting in high power and resource consumption. We address these challenges by proposing Cascade, an end-to-end open-source application compiler for CGRAs that achieves both state-of-the-art performance and fast compilation times. The contributions of this work are: (1) a novel post place-and-route (PnR) application pipelining technique for CGRAs that accounts for interconnect hop delays during pipelining but in a unique way that avoids cyclic scheduling and place-and-route, (2) a register resource usage optimization technique that leverages the scheduling logic in CGRA memory tiles to minimize the number of register resources used during pipelining, and (3) an automated CGRA timing model generator, an application timing analysis tool, and a large set of existing and novel application pipelining techniques integrated into an end-to-end compilation flow. Cascade achieves 8 -34× lower critical path delay and 7 -190× lower energy-delay product (EDP) across a variety of dense image processing and machine learning workloads, and 3 -5.2× lower critical path delay and 2.5 -5.2× lower EDP on sparse workloads, compared to a compiler without pipelining. Cascade mitigates the performance and energy-efficiency drawbacks of existing CGRA compilers, and enables further research into CGRAs as flexible, yet competitive accelerator architectures.
Simba Shao, Yakun Sophia; Clemons, Jason; Venkatesan, Rangharajan ...
Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture,
10/2019
Conference Proceeding
Package-level integration using multi-chip-modules (MCMs) is a promising approach for building large-scale systems. Compared to a large monolithic die, an MCM combines many smaller chiplets into a ...larger system, substantially reducing fabrication and design costs. Current MCMs typically only contain a handful of coarse-grained large chiplets due to the high area, performance, and energy overheads associated with inter-chiplet communication. This work investigates and quantifies the costs and benefits of using MCMs with fine-grained chiplets for deep learning inference, an application area with large compute and on-chip storage requirements. To evaluate the approach, we architected, implemented, fabricated, and tested Simba, a 36-chiplet prototype MCM system for deep-learning inference. Each chiplet achieves 4 TOPS peak performance, and the 36-chiplet MCM package achieves up to 128 TOPS and up to 6.1 TOPS/W. The MCM is configurable to support a flexible mapping of DNN layers to the distributed compute and storage units. To mitigate inter-chiplet communication overheads, we introduce three tiling optimizations that improve data locality. These optimizations achieve up to 16% speedup compared to the baseline layer mapping. Our evaluation shows that Simba can process 1988 images/s running ResNet-50 with batch size of one, delivering inference latency of 0.50 ms.