Large-Scale Systems

                                 Team and Organization
Motivation: The distributed information-technology platform of the future offers amazing opportunities in terms of data gathering, processing and dissemination. Ubiquitous, high-speed networks offer the promise of access to unlimited computation, communication and storage, and potentially could help to address burning problems such as energy shortage, health care and national security. A totally unexpected result of this evolution is a return to large compute/storage/communicate centers, which were once thought to be an endangered species after the advent of the personal computer. Unfortunately, the "infrastructural core" of our distributed IT platform also turns out to be energy-intensive. Data centers are the fastest growing segment of the IT platform in terms of carbon footprint, and the appetite for their services is increasing exponentially with no slow-down in sight. Huge improvements in energy efficiency are essential if we want to maintain this growth while staying economically and environmentally healthy.

Similar concerns apply to the defense community. From the very beginning of computing, large-scale systems have been critical to the nation's security. They are used for a variety of applications including stewardship of nuclear weapons, intelligence gathering and analysis, the design of weapons and armor, and to support fundamental defense research in fields such as material science. The demand for computing cycles to support these missions has been insatiable, and led to a long history of Defense R&D.

In a future where data will not only come from classical computing systems as it does today, but also from millions of sensors and mobile devices that already permeate significant parts of our life, the requirement for energy-efficient large-scale data computation will explode, both for defense and commercial use. Moore's Law driven growth in computing power offers to address this demand. However, the end of Dennard scaling and hence the scaling of energy/operation makes improving the energy-efficiency of computation a critical problem. Novel solutions exploiting heterogeneous processors in combination with innovative memory and interconnect architectures, are necessary to make an inroad. MuSyC proposes to investigate these in collaboration with the GSRC Platform Center. Our contention is however that substantial additional benefits can be obtained by eliminating the built-in inefficiency that exists in large-scale systems. For example, even under very light load conditions, the power load of a typical data center still runs at least 50% of maximum (data centers typically run at approximately 20-40% load.

Vision: The need for an effective and adaptive power management strategy that dynamically balances supply and demand in light of performance, cost and reliability constraints is a common theme emerging in virtually all large-scale distributed systems envisioned for the coming decades. The MuSyC perspective is that the only viable solution is a holistic distributed "measure-predict-and-control" energy management solution that includes all components of the system at all hierarchy levels. Our research vision is to develop a multi-scale cross-layer distributed and hierarchical management scheme that ensures that energy is only consumed if, when and where needed. This must be accomplished while serving dynamically changing workloads with stringent latency and reliability constraints, and requires exploiting all the knobs available at all levels of the design hierarchy.

While the problems and solutions addressed in this theme are applicable to most large-scale system (such as green buildings or complex airborne, seaborne or automotive platforms), we have selected the data center of the future as a compelling driver for our efforts. Truly compelling goals to pursue are the concept of "energy-proportional" computing (as first envisioned by L. Barroso), and to "do nothing well", minimizing energy consumption when nothing (or little) is being done.

We therefore envision a future large-scale computing systems capable of achieving orders-of-magnitude increases in performance while at the same time reducing the carbon footprint by one or two orders of magnitude. To achieve such an unprecedented improvement in energy efficiency while meeting the demands of future workloads we have to consider all the major system components (computation, networking, storage, cooling and power distribution) and their interactions at multiple hierarchy levels (hardware, virtualization layer, operating systems, protocols, applications). This will lead to balanced systems that, by adaptively examining performance needs at all levels of scale, run far more efficiently, even in the face of the demanding and diverse workloads of the future. The results of our effort will drive the design and operation of heterogeneous computing systems of the future, from SOC processors to multi-megawatt distributed datacenters.

The overall structure of the theme in terms of global organization, research topics, individual contributions and their interrelationships is detailed in this figure. The team's contributions touch on critical elements of future datacenter systems across multiple levels of scale. These include energy management strategies at the software, system, and infrastructure level so that power budgets dramatically are reduced while meeting performance, latency and reliability constraints.

To quantify the opportunities for energy management, we will extend our performance prediction methodology to understand the nature of the emerging workloads, applications and their needs. We will contrast these software requirements with emerging technology trends to predict many of the critical barriers to realizing performance and energy efficiency in future large-scale systems and thus direct our research. We will develop tools and methodologies needed to analyze the interaction between future workloads and architectures. This will lead to significant improvements in energy efficiency of the software infrastructure as a whole. To ease the burden of application developers, we will develop energy efficient techniques within runtimes (.NET and Java) and operating system (OS)/virtualized machines (VM) capable of effectively utilizing novel and heterogeneous hardware and software architectures. It will also allow us to explore a three-fold strategy for system energy management. First, workload allocation and scheduling will be performed so as to ensure energy-proportional operation in the active states. Next, duty cycling of the servers is complemented with proxy processing to ensure availability with large energy savings. We will also develop strategies for trading off energy and resilience to enable scaling the energy requirements of applications and hardware within an applications error tolerance. Lastly, the infrastructure management strategies provide smart links to sensors and controls needed to efficiently connect the components, store the data, and provide the electricity to power and cool the overall system.

Measures of Success: The goal of this theme is to design large-scale computing systems capable of achieving orders-of-magnitude increases in performance while at the same time reducing the carbon footprint by one or two orders of magnitude through a holistic multi-scale approach. Progress towards that goal will be measured through prototyping on existing or experimental testbeds (including data center containers at UCSD and USB) combined with simulation and emulation.