Energy costs for data centers continue to rise, already exceeding $15 billion yearly. Sadly much of this power is wasted. Servers are only busy 10--30% of the time on average, but they are often left ...on, while idle, utilizing 60% or more of peak power when in the idle state.
We introduce a dynamic capacity management policy,
AutoScale
, that greatly reduces the number of servers needed in data centers driven by unpredictable, time-varying load, while meeting response time SLAs.
AutoScale
scales the data center capacity, adding or removing servers as needed.
AutoScale
has two key features: (i) it autonomically maintains just the right amount of spare capacity to handle bursts in the request rate; and (ii) it is robust not just to changes in the request rate of real-world traces, but also request size and server efficiency.
We evaluate our dynamic capacity management approach via implementation on a 38-server multi-tier data center, serving a web site of the type seen in Facebook or Amazon, with a key-value store workload. We demonstrate that AutoScale vastly improves upon existing dynamic capacity management policies with respect to meeting SLAs and robustness.
A central question in designing server farms today is how to efficiently provision the number of servers to extract the best performance under unpredictable demand patterns while not wasting energy. ...While one would like to turn servers off when they become idle to save energy, the large setup cost (both, in terms of setup time and energy penalty) needed to switch the server back on can adversely affect performance. The problem is made more complex by the fact that today’s servers provide multiple sleep or standby states which trade off the setup cost with the power consumed while the server is ‘sleeping’. With so many controls, finding the optimal server farm management policy is an almost intractable problem — How many servers should be on at any given time, how many should be off, and how many should be in some sleep state?
In this paper, we employ the popular metric of Energy-Response time Product (ERP) to capture the energy-performance trade-off, and present the first theoretical results on the optimality of server farm management policies. For a stationary demand pattern, we prove that there exists a very small, natural class of policies that always contains the optimal policy for a single server, and conjecture it to contain a near-optimal policy for multi-server systems. For time-varying demand patterns, we propose a simple, traffic-oblivious policy and provide analytical and empirical evidence for its near-optimality.
The Internet suspend/resume model of mobile computing cuts the tight binding between PC state and PC hardware. By layering a virtual machine on distributed storage, ISR lets the VM encapsulate ...execution and user customization state; distributed storage then transports that state across space and time. This article explores the implications of ISR for an infrastructure-based approach to mobile computing. It reports on experiences with three versions of ISR and describes work in progress toward the OpenISR version
Many modern high-performance processors prefetch blocks into the on-chip cache. Prefetched blocks can potentially pollute the cache by evicting more useful blocks. In this work, we observe that both ...accurate and inaccurate prefetches lead to cache pollution, and propose a comprehensive mechanism to mitigate prefetcher-caused cache pollution. First, we observe that over 95% of useful prefetches in a wide variety of applications are not reused after the first demand hit (in secondary caches). Based on this observation, our first mechanism simply demotes a prefetched block to the lowest priority on a demand hit. Second, to address pollution caused by inaccurate prefetches, we propose a self-tuning prefetch accuracy predictor to predict if a prefetch is accurate or inaccurate. Only predicted-accurate prefetches are inserted into the cache with a high priority. Evaluations show that our final mechanism, which combines these two ideas, significantly improves performance compared to both the baseline LRU policy and two state-of-the-art approaches to mitigating prefetcher-caused cache pollution (up to 49%, and 6% on average for 157 two-core multiprogrammed workloads). The performance improvement is consistent across a wide variety of system configurations.
Open Cirrus: A Global Cloud Computing Testbed Avetisyan, Arutyun I; Campbell, Roy; Gupta, Indranil ...
Computer (Long Beach, Calif.),
04/2010, Letnik:
43, Številka:
4
Journal Article
Recenzirano
Open Cirrus is a cloud computing testbed that, unlike existing alternatives, federates distributed data centers. It aims to spur innovation in systems and applications research and catalyze ...development of an open source service stack for the cloud.
We introduce a set of new Compression-Aware Management Policies (CAMP) for on-chip caches that employ data compression. Our management policies are based on two key ideas. First, we show that it is ...possible to build a more efficient management policy for compressed caches if the compressed block size is directly used in calculating the value (importance) of a block to the cache. This leads to Minimal-Value Eviction (MVE), a policy that evicts the cache blocks with the least value, based on both the size and the expected future reuse. Second, we show that, in some cases, compressed block size can be used as an efficient indicator of the future reuse of a cache block. We use this idea to build a new insertion policy called Size-based Insertion Policy (SIP) that dynamically prioritizes cache blocks using their compressed size as an indicator. We compare CAMP (and its global variant G-CAMP) to prior on-chip cache management policies (both size-oblivious and size-aware) and find that our mechanisms are more effective in using compressed block size as an extra dimension in cache management decisions. Our results show that the proposed management policies (i) decrease off-chip bandwidth consumption (by 8.7% in single-core), (ii) decrease memory subsystem energy consumption (by 7.2% in single-core) for memory intensive workloads compared to the best prior mechanism, and (iii) improve performance (by 4.9%/9.0%/10.2% on average in single-/two-/four-core workload evaluations and up to 20.1%) CAMP is effective for a variety of compression algorithms and different cache designs with local and global replacement strategies.
NVOverlay Wang, Ziqi; Choo, Chul-Hwan; Kozuch, Michael A. ...
2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA),
06/2021
Conference Proceeding
The ability to capture frequent (per millisecond) persistent snapshots to NVM would enable a number of compelling use cases. Unfortunately, existing NVM snapshotting techniques suffer from a ...combination of persistence barrier stalls, write amplification to NVM, and/or lack of scalability beyond a single socket. In this paper, we present NVOverlay, which is a scalable and efficient technique for capturing frequent persistent snapshots to NVM such that they can be randomly accessed later. NVOverlay uses Coherent Snapshot Tracking to efficiently track changes to memory (since the previous snapshot) across multi-socket parallel systems, and it uses Multi-snapshot NVM Mapping to store these snapshots to NVM while avoiding excessive write amplification. Our experiments demonstrate that NVOverlay successfully hides the overhead of capturing these snapshots while reducing write amplification by 29%-47% compared with state-of-the-art logging-based snapshotting techniques.
Dataflow analysis-based dynamic parallel monitoring (DADPM) is a recent approach for identifying bugs in parallel software as it executes, based on the key insight of explicitly modeling a sliding ...window of uncertainty across parallel threads. While this makes the approach practical and scalable, it also introduces the possibility of false positives in the analysis. In this paper, we improve upon the DADPM framework through two observations. First, by explicitly tracking new "uncertain" states in the metadata lattice, we can distinguish potential false positives from true positives. Second, as the analysis tool runs dynamically, it can use the existence (or absence) of observed uncertain states to adjust the tradeoff between precision and performance on-the-fly. For example, we demonstrate how the epoch size parameter can be adjusted dynamically in response to uncertainty in order to achieve better performance and precision than when the tool is statically configured. This paper shows how to adapt a canonical dataflow analysis problem (reaching definitions) and a popular security monitoring tool (TAINTCHECK) to our new uncertainty-tracking framework, and provides new provable guarantees that reported true errors are now precise.
Many recent works propose mechanisms demonstrating the potential advantages of managing memory at a fine (e.g., cache line) granularity---e.g., fine-grained deduplication and fine-grained memory ...protection. Unfortunately, existing virtual memory systems track memory at a larger granularity (e.g., 4 KB pages), inhibiting efficient implementation of such techniques. Simply reducing the page size results in an unacceptable increase in page table overhead and TLB pressure.
We propose a new virtual memory framework that enables efficient implementation of a variety of fine-grained memory management techniques. In our framework, each virtual page can be mapped to a structure called a page overlay, in addition to a regular physical page. An overlay contains a subset of cache lines from the virtual page. Cache lines that are present in the overlay are accessed from there and all other cache lines are accessed from the regular physical page. Our page-overlay framework enables cache-line-granularity memory management without significantly altering the existing virtual memory framework or introducing high overheads.
We show that our framework can enable simple and efficient implementations of seven memory management techniques, each of which has a wide variety of applications. We quantitatively evaluate the potential benefits of two of these techniques: overlay-on-write and sparse-data-structure computation. Our evaluations show that overlay-on-write, when applied to fork, can improve performance by 15% and reduce memory capacity requirements by 53% on average compared to traditional copy-on-write. For sparse data computation, our framework can outperform a state-of-the-art software-based sparse representation on a number of real-world sparse matrices. Our framework is general, powerful, and effective in enabling fine-grained memory management at low cost.