We empirically evaluate an undervolting technique, i.e., underscaling the circuit supply voltage below the nominal level, to improve the power-efficiency of Convolutional Neural Network (CNN) ...accelerators mapped to Field Programmable Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing faults due to excessive circuit latency increase. We evaluate the reliability-power trade-off for such accelerators. Specifically, we experimentally study the reduced-voltage operation of multiple components of real FPGAs, characterize the corresponding reliability behavior of CNN accelerators, propose techniques to minimize the drawbacks of reduced-voltage operation, and combine undervolting with architectural CNN optimization techniques, i.e., quantization and pruning. We investigate the effect ofenvironmental temperature on the reliability-power trade-off of such accelerators. We perform experiments on three identical samples of modern Xilinx ZCU102 FPGA platforms with five state-of-the-art image classification CNN benchmarks. This approach allows us to study the effects of our undervolting technique for both software and hardware variability. We achieve more than 3X power-efficiency (GOPs/W ) gain via undervolting. 2.6X of this gain is the result of eliminating the voltage guardband region, i.e., the safe voltage region below the nominal level that is set by FPGA vendor to ensure correct functionality in worst-case environmental and circuit conditions. 43% of the power-efficiency gain is due to further undervolting below the guardband, which comes at the cost of accuracy loss in the CNN accelerator. We evaluate an effective frequency underscaling technique that prevents this accuracy loss, and find that it reduces the power-efficiency gain from 43% to 25%.
On-chip memory (usually based on Static RAMs-SRAMs) are crucial components for various computing devices including heterogeneous devices, e.g., GPUs, FPGAs, and ASICs, to achieve high performance. ...Modern workloads such as deep neural networks (DNNs) running on these heterogeneous fabrics are highly dependent on the on-chip memory architecture for efficient acceleration. Hence, improving the energy efficiency of such memories directly leads to an efficient system. One of the common methods to save energy is undervolting, i.e., supply voltage underscaling below the nominal level. Such systems can be safely undervolted without incurring faults down to a certain voltage limit. This safe range is also called voltage guardband. However, reducing voltage below the guardband level without decreasing frequency causes timing-based faults. In this article, we propose MoRS, a framework that generates the first approximate undervolting fault model using real faults extracted from experimental undervolting studies on SRAMs to build the model. We inject the faults generated by MoRS into the on-chip memory of the DNN accelerator to evaluate the resilience of the system under the test. MoRS has the advantage of simplicity without any need for high-time overhead experiments while being accurate enough in comparison to a fully randomly generated fault injection approach. We evaluate our experiment in popular DNN workloads by mapping weights to SRAMs and measure the accuracy difference between the output of the MoRS and the real data. Our results show that the maximum difference between real fault data and the output fault model of MoRS is 6.21%, whereas the maximum difference between real data and random fault injection model is 23.2%. In terms of average proximity to the real data, the output of MoRS outperforms the random fault injection approach by <inline-formula> <tex-math notation="LaTeX">3.21\times </tex-math></inline-formula>.
Vector architectures lack tools for research. Consider the gem5 simulator, which is possibly the leading platform for computer-system architecture research. Unfortunately, gem5 does not have an ...available distribution that includes a flexible and customizable vector architecture model. In consequence, researchers have to develop their own simulation platform to test their ideas, which consume much research time. However, once the base simulator platform is developed, another question is the following: Which applications should be tested to perform the experiments? The lack of Vectorized Benchmark Suites is another limitation. To face these problems, this work presents a set of tools for designing and evaluating vector architectures. First, the gem5 simulator was extended to support the execution of RISC-V Vector instructions by adding a parameterizable Vector Architecture model for designers to evaluate different approaches according to the target they pursue. Second, a novel Vectorized Benchmark Suite is presented: a collection composed of seven data-parallel applications from different domains that can be classified according to the modules that are stressed in the vector architecture. Finally, a study of the Vectorized Benchmark Suite executing on the gem5-based Vector Architecture model is highlighted. This suite is the first in its category that covers the different possible usage scenarios that may occur within different vector architecture designs such as embedded systems, mainly focused on short vectors, or High-Performance-Computing (HPC), usually designed for large vectors.
With the continued scaling of NAND flash and multi-level cell technology, flash-based storage has gained widespread use in systems ranging from mobile platforms to enterprise servers. However, the ...robustness of NAND flash cells is an increasing concern, especially at nanometer-regime process geometries. NAND flash memory bit error rate increases exponentially with the number of program/erase cycles. Stronger error correcting codes (ECC) can be used to tolerate higher error rates, but these have diminishing returns with increasing P/E cycles and can have prohibitively high power, area, and latency overheads. The goal of this paper is to develop new techniques that can tolerate high bit error rates without requiring prohibitively strong ECC. Our techniques, called Flash Correct-and-Refresh (FCR) exploit the observation that the dominant error source in NAND flash memory is retention errors, caused by flash cells losing charge over time. The key idea is to periodically read, correct, and reprogram (in-place) or remap the stored data before it accumulates more retention errors than can be corrected by simple ECC. Detailed simulations of a solid-state drive (SSD) storage system driven by measured experimental data from error characterization on real flash memory chips show that our techniques provide 46× average lifetime improvement on a variety of workloads at no additional hardware cost. We also find that our techniques achieve lifetime improvements that cannot feasibly be achieved with stronger ECC.
Page-based virtual memory improves programmer productivity, security, and memory utilization, but incurs performance overheads due to costly page table walks after TLB misses. This overhead can reach ...50% for modern workloads that access increasingly vast memory with stagnating TLB sizes.
To reduce the overhead of virtual memory, this paper proposes Redundant Memory Mappings (RMM), which leverage ranges of pages and provides an efficient, alternative representation of many virtual-to-physical mappings. We define a range be a subset of process's pages that are virtually and physically contiguous. RMM translates each range with a single range table entry, enabling a modest number of entries to translate most of the process's address space. RMM operates in parallel with standard paging and uses a software range table and hardware range TLB with arbitrarily large reach. We modify the operating system to automatically detect ranges and to increase their likelihood with eager page allocation. RMM is thus transparent to applications.
We prototype RMM software in Linux and emulate the hardware. RMM performs substantially better than paging alone and huge pages, and improves a wider variety of workloads than direct segments (one range per program), reducing the overhead of virtual memory to less than 1% on average.
Range Translations for Fast Virtual Memory Gandhi, Jayneel; Karakostas, Vasileios; Ayar, Furkan ...
IEEE MICRO,
2016-May-June, 2016-5-00, 20160501, Letnik:
36, Številka:
3
Journal Article
Recenzirano
Modern workloads suffer high execution-time overhead due to page-based virtual memory. The authors introduce range translations that map arbitrary-sized virtual memory ranges to contiguous physical ...memory pages while retaining the flexibility of paging. A range translation reduces address translation to a range lookup that delivers near zero virtual memory overhead.
Multi-core processors are ubiquitous in all market segments from embedded to high performance computing, but only few applications can efficiently utilize them. Existing parallel frameworks aim to ...support thread-level parallelism in applications, but the imposed overhead prevents their usage for small problem instances. This work presents Micro-threads (Mth) a hardware-software proposal focused on a shared thread management model enabling the use of parallel resources in applications that have small chunks of parallel code or small problem inputs by a combination of software and hardware: delegation of the resource control to the application, an improved mechanism to store and fill processor's context, and an efficient synchronization system. Four sample applications are used to test our proposal: HSL filter (trivially parallel), FFT Radix2 (recursive algorithm), LU decomposition (barrier every cycle) and Dantzig algorithm (graph based, matrix manipulation). The results encourage the use of Mth and could smooth the use of multiple cores for applications that currently can not take advantage of the proliferation of the available parallel resources in each chip.
Writing applications that benefit from the massive computational power of future multicore chip multiprocessors will not be an easy task for mainstream programmers accustomed to sequential algorithms ...rather than parallel ones. This article presents a survey of transactional memory, a mechanism that promises to enable scalable performance while freeing programmers from some of the burden of modifying their parallel code.
Architectural simulator platforms are particularly complex and error-prone programs that aim to simulate all hardware details of a given target architecture. Development of a stable cycle-accurate ...architectural simulator can easily take several man-years. Discovering and fixing all visible errors in a simulator often requires significant effort, much higher than for writing the simulator code in the first place. In addition, there are no guarantees that all programming errors will be eliminated, no matter how much effort is put into testing and debugging. This paper presents dynamic runtime testing, a methodology for rapid development and accurate detection of functional bugs in architectural cycle-accurate simulators. Dynamic runtime testing consists of comparing an execution of a cycle-accurate simulator with an execution of a simple and functionally equivalent emulator. Dynamic runtime testing detects a possible functional error if there is a mismatch between the execution in the simulator and the emulator. Dynamic runtime testing provides a reliable and accurate verification of a simulator, during its entire development cycle, with very acceptable performance impact, and without requiring complex setup for the simulator execution. Based on our experience, dynamic testing reduced the simulator modification time from 12–18 person-months to only 3–4 person-months, while it only modestly reduced the simulator performance (in our case under 20 %).