Jigsaw: Scalable software-defined caches Beckmann, Nathan; Sanchez, Daniel
Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques,
2013-Sept.
Conference Proceeding
Shared last-level caches, widely used in chip-multi-processors (CMPs), face two fundamental limitations. First, the latency and energy of shared caches degrade as the system scales up. Second, when ...multiple workloads share the CMP, they suffer from interference in shared cache accesses. Unfortunately, prior research addressing one issue either ignores or worsens the other: NUCA techniques reduce access latency but are prone to hotspots and interference, and cache partitioning techniques only provide isolation but do not reduce access latency.
Chip multiprocessors (CMP) with more cores have more traffic to the last-level cache (LLC). Without a corresponding increase in LLC bandwidth, such traffic cannot be sustained, resulting in ...performance degradation. Previous research focused on data placement techniques to improve access latency in Non-Uniform Cache Architectures (NUCA). Placing data closer to the referring core reduces traffic in cache interconnect. However, earlier data placement work did not account for the frequency with which specific memory references are accessed. The difficulty of tracking access frequency for all memory references is one of the main reasons why it was not considered in NUCA data placement. In this research, we present a hardware-assisted solution called ACTION (Adaptive Cache Block Migration) to track the access frequency of individual memory references and prioritize placement of frequently referred data closer to the affine core. ACTION mechanism implements cache block migration when there is a detectable change in access frequencies due to a shift in the program phase. ACTION counts access references in the LLC stream using a simple and approximate method and uses a straightforward placement and migration solution to keep the hardware overhead low. We evaluate ACTION on a 4-core CMP with a 5x5 mesh LLC network implementing a partitioned D-NUCA against workloads exhibiting distinct asymmetry in cache block access frequency. Our simulation results indicate that ACTION can improve CMP performance by up to 7.5% over state-of-the-art (SOTA) D-NUCA solutions.
Summary
Periodontitis is an inflammatory disease caused by periodontal bacteria in subgingival plaque. These bacteria are able to colonize the periodontal region by evading the host immune response. ...Neutrophils, the host's first line of defense against infection, use various strategies to kill invading pathogens, including neutrophil extracellular traps (NETs). These are extracellular net‐like fibers comprising DNA and antimicrobial components such as histones, LL‐37, defensins, myeloperoxidase, and neutrophil elastase from neutrophils that disarm and kill bacteria extracellularly. Bacterial nuclease degrades the NETs to escape NET killing. It has now been shown that extracellular nucleases enable bacteria to evade this host antimicrobial mechanism, leading to increased pathogenicity. Here, we compared the DNA degradation activity of major Gram‐negative periodontopathogenic bacteria, Porphyromonas gingivalis, Prevotella intermedia, Fusobacterium nucleatum, and Aggregatibacter actinomycetemcomitans. We found that Pr. intermedia showed the highest DNA degradation activity. A genome search of Pr. intermedia revealed the presence of two genes, nucA and nucD, putatively encoding secreted nucleases, although their enzymatic and biological activities are unknown. We cloned nucA‐ and nucD‐encoding nucleases from Pr. intermedia ATCC 25611 and characterized their gene products. Recombinant NucA and NucD digested DNA and RNA, which required both Mg2+ and Ca2+ for optimal activity. In addition, NucA and NucD were able to degrade the DNA matrix comprising NETs.
A significant part of future microprocessor real estate will be dedicated to 12 or 13 caches. These on-chip caches will heavily impact processor performance, power dissipation, and thermal management ...strategies. There are a number of interconnect design considerations that influence power/performance/area characteristics of large caches, such as wire models (width/spacing/repeaters), signaling strategy (RC/differential/transmission), router design, etc. Yet, to date, there exists no analytical tool that takes all of these parameters into account to carry out a design space exploration for large caches and estimate an optimal organization. In this work, we implement two major extensions to the CACTI cache modeling tool that focus on interconnect design for a large cache. First, we add the ability to model different types of wires, such as RC-based wires with different power/delay characteristics and differential low-swing buses. Second, we add the ability to model non-uniform cache access (NUCA). We not only adopt state-of-the-art design space exploration strategies for NUCA, we also enhance this exploration by considering on-chip network contention and a wider spectrum of wiring and routing choices. We present a validation analysis of the new tool (to be released as CACTI 6.0) and present a case study to showcase how the tool can improve architecture research methodologies.
The datacenter introduces new challenges for computer systems around tail latency and security. This paper argues that dynamic NUCA techniques are a better solution to these challenges than prior ...cache designs. We show that dynamic NUCA designs can meet tail-latency deadlines with much less cache space than prior work, and that they also provide a natural defense against cache attacks. Unfortunately, prior dynamic NUCAs have missed these opportunities because they focus exclusively on reducing data movement.We present Jumanji, a dynamic NUCA technique designed for tail latency and security. We show that prior last-level cache designs are vulnerable to new attacks and offer imperfect performance isolation. Jumanji solves these problems while significantly improving performance of co-running batch applications. Moreover, Jumanji only requires lightweight hardware and a few simple changes to system software, similar to prior D-NUCAs. At 20 cores, Jumanji improves batch weighted speedup by 14% on average, vs. just 2% for a non-NUCA design with weaker security, and is within 2% of an idealized design.
The non-uniform distribution of memory accesses among the cache sets results in some sets being used heavily while certain others remaining underutilized. Dynamic associativity management (DAM) is a ...technique to allow the heavily used sets to distribute their load among the lightly used sets thus improving the overall utilization of the cache. CMP-SVR is a previously proposed DAM based technique, where each set is divided into two sections: normal storage (NT) and reserve storage (RT). Some number of ways (25 to 50 percent) from each set are reserved for RT and the remaining ways belong to NT. The sets are divided into groups called fellow-groups and a set can use the reserve-ways of its fellow sets to increase its associativity during execution. Though CMP-SVR improves performance the formation of its fellow-groups is static: once created it never changes. It has been observed that some fellow-groups have more number of heavily used sets than the other fellow-groups. As a result the cache loads are not uniformly distributed among the fellow-groups. Also the behavior of sets changes dynamically: a lightly used set may become heavily used after a number of execution cycles. This paper studies the behavior of each set in detail and proposes a DAM based technique which improves the performance compared to other DAM based techniques. The proposed technique called FS-DAM dynamically creates fellow-groups based on the current set loads ensuring that the heavily used sets are evenly distributed among all the fellow-groups. Such distribution increases the utilization of the cache and hence improves performance. Full system simulation shows an average of 6.62 and 16.74 percent improvements, in FS-DAM as compared to CMP-SVR, in terms of CPI (Cycles Per Instruction) and MPKI (Miss Per Thousand Instructions) respectively. Comparing with Z-Cache the improvements are 6.21 percent (CPI) and 14.65 percent (MPKI). The proposed policy also shows better performance over V-Way and SBC.
NUCA caches have traditionally been proposed as a solution for mitigating wire delays, and delays introduced due to complex networks on chip. Traditional approaches have reported significant ...performance gains with intelligent block placement, location, replication, and migration schemes. In this paper, we propose a novel approach in this space, called FP-NUCA. It differs from conventional approaches, and relies on a novel method of co-designing the last level cache and the network on chip. We artificially constrain the communication pattern in the NUCA cache such that all the messages travel along a few predefined paths (fast paths) for each set of banks. We leverage this communication pattern by designing a new type of NOC router called the Freeze router, which augments a regular router by adding a layer of circuitry that gates the clock of the regular router when there is a fast path message waiting to be transmitted. Messages along the fast path do not require buffering, switching, or routing. We incorporate a bank predictor with our novel NOC for reducing the number of messages, and resultant energy consumption. We compare our performance with state of the art protocols, and report speedups of up to 31 percent (mean: 6.3 percent), and ED2 reduction up to 46 percent (mean: 10.4 percent) for a suite of Splash and Parsec benchmarks. We implement the Freeze router in VHDL and show that the additional fast path logic has minimal area and timing overheads.
Process variations in integrated circuits have significant impact on their performance, leakage, and stability. This is particularly evident in large, regular, and dense structures such as DRAMs. ...DRAMs are built using minimized transistors with presumably uniform speed in an organized array structure. Process variation can introduce latency disparity among different memory arrays. With the proliferation of 3D stacking technology, DRAMs become a favorable choice for stacking on top of a multicore processor as a last level cache for large capacity, high bandwidth, and low power. Hence, variations in bank speed create a unique problem of nonuniform cache accesses in 3D space. In this paper, we investigate cache management techniques for tolerating process variation in a 3D DRAM stacked onto a multicore processor. We modeled the process variation in a four-layer DRAM memory, including cell transistor, capacitor trench, and peripheral circuit, to characterize the latency and retention time variations among different banks. As a result, the notion of fast and slow banks from the core's standpoint is no longer associated with their physical distances with the banks. They are determined by the different bank latencies due to process variation. We develop cache migration schemes that utilize fast banks while limiting the cost due to migration. Our experiments show that there is a great performance benefit in exploiting fast memory banks through migration. On average, a variation-aware management can improve the performance of a workload over the baseline (where one of the slowest bank speed is assumed for all banks) by 16.5 percent. We are also only 0.8 percent away in performance from an ideal memory where no process variation is present.
In chip multiprocessors (CMPs), nonuniform cache architecture (NUCA) is often employed to organize last-level cache (LLC) banks through network-on-chip (NoC). Because of the shrinking feature size ...and unstable operating environment, severe reliability problems unavoidably emerge and cause frequent on-chip component (e.g., cores, cache banks, routers) failures. Typical fault-tolerant CMPs should possess the feature of graceful degradation and function normally with deactivated tiles. However, for CMPs adopting static NUCA, certain physical address areas will become inaccessible when cache banks in a CMP node are isolated from the system. To protect cache from such threats induced by either online or offline faults, we survey several potential solutions and propose the utility-driven node remapping technique by reusing the resources in NoC. In our NoC-assisted remapping scheme, cache accesses to isolated banks are so redirected that cache space contention are successfully balanced and relieved in shared-LLC, thus ensuring the least performance penalty caused by fault isolation. Our experimental results show significant performance improvement over conventional resizing approaches such as set reduction.