Optimistic scheduling with service guarantees Kambatla, Karthik; Yarlagadda, Vamsee; Goiri, Íñigo ...
Journal of parallel and distributed computing,
January 2020, 2020-01-00, Volume:
135
Journal Article
Peer reviewed
Data centers form the core of current information and commercial enterprise. At current scale, any improvement in data center resource utilization leads to substantial savings. We focus on the ...problem of scheduling jobs in distributed execution environments to improve resource utilization. Cluster schedulers like YARN and Mesos base their scheduling decisions on resource requirements provided by end users. It is hard for end-users to predict the exact amount of resources required for a task/job, especially since resource utilization can vary significantly over time and across tasks. In practice, users make highly conservative estimates of peak utilization across all tasks of a job to ensure job completion, leading to resource fragmentation and severe under utilization in production clusters. We present UBIS, a utilization-aware approach to cluster scheduling, to address resource fragmentation, and to improve cluster utilization and job throughput. UBIS considers actual usage of running tasks and schedules opportunistic work on under-utilized nodes. UBIS monitors resource usage on these nodes and preempts opportunistic containers in the event this over-subscription becomes untenable. In doing so, UBIS effectively utilizes wasted resources, while minimizing adverse effects on regularly scheduled tasks. Our implementation of UBIS on YARN demonstrates improvements of up to 30% in makespan for a representative workload and 25% in individual job durations.
•The paper presents a scheduling technique, called UBIS for cloud environments with mixed jobs – some with guaranteed service level requirements (SLAs), and others with flexible requirements.•UBIS utilizes slack in runtime utilization of resources by tasks with guaranteed service level requirements to schedule opportunistic tasks around them.•These opportunistic tasks are terminated if SLAs are likely to be violated, as inferred from dynamic state of the nodes.•By utilizing slack, UBIS significantly increases resource utilization and throughput, while ensuring fairness in highly dynamic environments.•Comprehensive benchmarking of UBIS in cloud deployments with realistic workloads demonstrates significant increase in throughput with associated operational cost reduction, low overhead of monitoring and scheduling, and excellent scalability.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
In this paper, we propose GreenSlot, a scheduler for parallel batch jobs in a datacenter powered by a photovoltaic solar array and the electrical grid (as a backup). GreenSlot predicts the amount of ...solar energy that will be available in the near future, and schedules the workload to maximize the green energy consumption while meeting the jobs’ deadlines. If grid energy must be used to avoid deadline violations, the scheduler selects times when it is cheap. Evaluation results show that GreenSlot can increase solar energy consumption by up to 117% and decrease energy cost by up to 39%, compared to conventional schedulers, when scheduling three scientific workloads and a data processing workload. Based on these positive results, we conclude that green datacenters and green-energy-aware scheduling can have a significant role in building a more sustainable IT ecosystem.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UL, UM, UPCLJ, UPUK
Resource provisioning in Cloud providers is a challenge because of the high variability of load over time. On the one hand, the providers can serve most of the requests owning only a restricted ...amount of resources, but this forces to reject customers during peak hours. On the other hand, valley hours incur in under-utilization of the resources, which forces the providers to increase their prices to be profitable. Federation overcomes these limitations and allows providers to dynamically outsource resources to others in response to demand variations. Furthermore, it allows providers with underused resources to rent them to other providers. Both techniques make the provider getting more profit when used adequately. Federation of Cloud providers requires having a clear understanding of the consequences of each decision. In this paper, we present a characterization of providers operating in a federated Cloud which helps to choose the most convenient decision depending on the environment conditions. These include when to outsource to other providers, rent free resources to other providers (i.e., insourcing), or turn off unused nodes to save power. We characterize these decisions as a function of several parameters and implement a federated provider that uses this characterization to exploit federation. Finally, we evaluate the profitability of using these techniques using the data from a real provider.
Full text
Available for:
CEKLJ, EMUNI, FIS, FZAB, GEOZS, GIS, IJS, IMTLJ, KILJ, KISLJ, MFDPS, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, SBMB, SBNM, UKNU, UL, UM, UPUK, VKSCE, ZAGLJ
On-site renewable energy has the potential to reduce datacenters' carbon footprint and power and energy costs. The authors built Parasol, a solar-powered datacenter, and GreenSwitch, a system for ...scheduling workloads, to explore this potential in a controlled research setting.
Energy efficiency is a major concern to datacenter operators, because the large amounts of energy used by parallel computing infrastructures increases costs and affects the electricity grid. ...Datacenter power consumption can be reduced by applying intelligent control techniques to dynamically adjust power demand, but this is hampered by conflicting objectives. For instance, the workload can be controlled to adjust power, but at the expense of service quality. Or, the cooling infrastructure demand can be manipulated without affecting workloads, but at the risk of shifting the datacenter temperature outside the acceptable limits. This paper proposes a multiobjective, evolutionary approach to solving the problem of energy-aware task scheduling in datacenters. Our approach takes into account three problem objectives (power consumption, temperature, and quality of service) when both computing and cooling infrastructures are holistically controlled. We report the two best solutions to each of these problem objectives, as well as the selected trade-off solutions between them.
Full text
Available for:
EMUNI, FIS, FZAB, GEOZS, GIS, IJS, IMTLJ, KILJ, KISLJ, MFDPS, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, SBMB, SBNM, UKNU, UL, UM, UPUK, VKSCE, ZAGLJ
Cloud providers typically use air-based solutions for cooling servers in datacenters. However, increasing transistor counts and the end of Dennard scaling will result in chips with thermal design ...power that exceeds the capabilities of air cooling in the near future. Consequently, providers have started to explore liquid cooling solutions (e.g., cold plates, immersion cooling) for the most power-hungry workloads. By keeping the servers cooler, these new solutions enable providers to operate server components beyond the normal frequency range (i.e., overclocking them) all the time. Still, providers must tradeoff the increase in performance via overclocking with its higher power draw and any component reliability implications.
In this paper, we argue that two-phase immersion cooling (2PIC) is the most promising technology, and build three prototype 2PIC tanks. Given the benefits of 2PIC, we characterize the impact of overclocking on performance, power, and reliability. Moreover, we propose several new scenarios for taking advantage of overclocking in cloud platforms, including oversubscribing servers and virtual machine (VM) auto-scaling. For the auto-scaling scenario, we build a system that leverages overclocking for either hiding the latency of VM creation or postponing the VM creations in the hopes of not needing them. Using realistic cloud workloads running on a tank prototype, we show that overclocking can improve performance by 20%, increase VM packing density by 20%, and improve tail latency in auto-scaling scenarios by 54%. The combination of 2PIC and overclocking can reduce platform cost by up to 13% compared to air cooling.
As long as virtualization has been introduced in data centers, it has been opening new chances for resource management. Nowadays, it is not just used as a tool for consolidating underused nodes and ...save power; it also allows new solutions to well-known challenges, such as heterogeneity management. Virtualization helps to encapsulate Web-based applications or HPC jobs in virtual machines (VMs) and see them as a single entity which can be managed in an easier and more efficient way.
We propose a new scheduling policy that models and manages a virtualized data center. It focuses on the allocation of VMs in data center nodes according to multiple facets to optimize the provider’s profit. In particular, it considers energy efficiency, virtualization overheads, and SLA violation penalties, and supports the outsourcing to external providers.
The proposed approach is compared to other common scheduling policies, demonstrating that a provider can improve its benefit by 30% and save power while handling other challenges, such as resource outsourcing, in a better and more intuitive way than other typical approaches do.
► Scheduling policy to optimize the allocation of VMs in a virtualized data center. ► Multiple facets modeled in a unified way to maximize the provider’s profit. ► Facets include power efficiency, virtualization overhead, and SLA penalties. ► Supports resource heterogeneity and outsourcing resources to external providers.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UL, UM, UPCLJ, UPUK
Success of Cloud computing requires that both customers and providers can be confident that signed Service Level Agreements (SLA) are supporting their respective business activities to their best ...extent. Currently used SLAs fail in providing such confidence, especially when providers outsource resources to other providers. These resource providers typically support very simple metrics like availability, or metrics that hinder an efficient exploitation of their resources.
In this paper, we propose a resource-level metric for specifying fine-grain guarantees on CPU performance. This metric allows resource providers to allocate dynamically their resources among running services depending on their demand. This is accomplished by incorporating the customer’s CPU usage in the metric definition, but avoiding fake SLA violations when the customer’s task does not use all its allocated resources.
We have conducted the evaluation in a virtualized provider where we have implemented the needed infrastructure for using our metric. As demonstrated in our evaluation, our solution presents fewer SLA violations than other CPU-related metrics while maintaining the Quality of Service.
► Resource-level metric for specifying fine-grain guarantees on CPU performance. ► Flexible CPU allocation by incorporating the applications CPU usage in the metric. ► Metric based on Amazon’s ECUs, making it suitable for heterogeneous environments. ► Use case demonstrating metric usability in a real virtualized provider. ► Only SLA violations that are a real responsibility of the provider are accounted.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UL, UM, UPCLJ, UPUK
Popular Internet services are hosted by multiple geographically distributed data centers. The location of the data centers has a direct impact on the services' response times, capital and operational ...costs, and (indirect) carbon dioxide emissions. Selecting a location involves many important considerations, including its proximity to population centers, power plants, and network backbones, the source of the electricity in the region, the electricity, land, and water prices at the location, and the average temperatures at the location. As there can be many potential locations and many issues to consider for each of them, the selection process can be extremely involved and time-consuming. In this paper, we focus on the selection process and its automation. Specifically, we propose a framework that formalizes the process as a non-linear cost optimization problem, and approaches for solving the problem. Based on the framework, we characterize areas across the United States as potential locations for data centers, and delve deeper into seven interesting locations. Using the framework and our solution approaches, we illustrate the selection trade offs by quantifying the minimum cost of (1) achieving different response times, availability levels, and consistency times, and (2) restricting services to green energy and chiller-less data centers. Among other interesting results, we demonstrate that the intelligent placement of data centers can save millions of dollars under a variety of conditions. We also demonstrate that the selection process is most efficient and accurate when it uses a novel combination of linear programming and simulated annealing.
The reduction of energy consumption in large-scale datacenters is being accomplished through an extensive use of virtualization, which enables the consolidation of multiple workloads in a smaller ...number of machines. Nevertheless, virtualization also incurs some additional overheads (e.g. virtual machine creation and migration) that can influence what is the best consolidated configuration, and thus, they must be taken into account. In this paper, we present a dynamic job scheduling policy for power-aware resource allocation in a virtualized datacenter. Our policy tries to consolidate workloads from separate machines into a smaller number of nodes, while fulfilling the amount of hardware resources needed to preserve the quality of service of each job. This allows turning off the spare servers, thus reducing the overall datacenter power consumption. As a novelty, this policy incorporates all the virtualization overheads in the decision process. In addition, our policy is prepared to consider other important parameters for a datacenter, such as reliability or dynamic SLA enforcement, in a synergistic way with power consumption. The introduced policy is evaluated comparing it against common policies in a simulated environment that accurately models HPC jobs execution in a virtualized datacenter including power consumption modeling and obtains a power consumption reduction of 15% with respect to typical policies.