Paul Brett's Published Papers

QuickIA: Exploring Heterogeneous Architectures on Real Prototypes

Over the last decade, homogeneous multi-core processors emerged and became the de-facto approach for offering high parallelism, high performance and scalability for a wide range of platforms. We are now at an interesting juncture where several critical factors (smaller form factor devices, power challenges, need for specialization, etc) are guiding architects to consider heterogeneous chips and platforms for the next decade and beyond. Exploring heterogeneous architectures is challenging since it involves re-evaluating architecture options, OS implications and application development. In this paper, we describe these research challenges and then introduce a heterogeneous prototype platform called QuickIA that enables rapid exploration of heterogeneous architectures employing multiple generations of Intel processors for evaluating the implications of asymmetry and FPGAs to experiment with specialized processors or accelerators. We also show example case studies using the QuickIA research prototype to highlight its value in conducting heterogeneous architecture, OS and applications research.

Nagabhushan Chitlur, Ganapati Srinivasa, Scott Hahn, P K Gupta, Dheeraj Reddy, David Koufaty, Paul Brett, Abirami Prabhakaran, Li Zhao, Nelson Ijih, Suchit Subhaschandra, Sabina Grover, Xiaowei Jiang, Ravi Iyer
18th International Symposium on High-Performance Cmoputer Architecture

Access: Smart Scheduling for Asymmetric Cache CMPs

In current Chip-multiprocessors (CMPs), a significant portion of the die is consumed by the last-level cache. Until recently, the balance of cache and core space has been primarily guided by the needs of single applications. However, as multiple applications or virtual machines (VMs) are consolidated on such a platform, researchers have observed that not all VMs or applications require significant amount of cache space. In order to take advantage of this phenomenon, we explore the use of asymmetric last-level caches in a CMP platform. While asymmetric cache CMPs provide the benefit of reduced power and area, it is important to build in hardware/software support to appropriately schedule applications on to cores with suitable cache capacity. In this paper, we address this problem with our ACCESS architecture comprising of: (a) asymmetric caches across a group of cores, (b) hardware support that enables prediction of cache performance on the different sized caches and (c) OS scheduler support to make use of the prediction capability and appropriately schedule applications on to core with suitable cache capacity. Measurements on a working prototype using SPEC2006 benchmarks show that our ACCESS architecture can effectively schedule jobs in an asymmetric cache CMP and provide 23% performance improvement compared to a naive scheduler, and is 97% close to an oracle scheduler in making schedules.

Xiaowei Jiang, Asit Mishra, Li Zhao, Ravishankar Iyer, Zhen Fang, Sadagopan Srinivasan,
Srihari Makineni, Paul Brett, Chita R. Das
17th International Symposium on High-Performance Computer Architecture [PDF]

Bridging functional heterogeneity in multicore architectures

Heterogeneous processors that mix big high performance cores with small low power cores promise excellent single-threaded performance coupled with high multi-threaded throughput and higher performance-per-watt. A significant portion of the commercial multicore heterogeneous processors are likely to have a common instruction set architecture( ISA). However, due to limited design resources and goals, each core is likely to contain ISA extensions not yet implemented in the other core. Therefore, such heterogeneous processors will have inherent functional asymmetry at the ISA level and face significant software challenges. This paper analyzes the software challenges to the operating system and the application layer software on a heterogeneous system with functional asymmetry, where the ISA of the small and big cores overlaps. We look at the widely deployed Intel Architecture and propose solutions to the software challenges that arise when a heterogeneous processor is designed around it. We broadly categorize functional asymmetries into those that can be exposed to application software and those that should be handled by system software. While one can argue that new software written should be heterogeneity-aware, it is important that we find ways in which legacy software can extract the best performance from heterogeneous multicore systems.

Dheeraj Reddy, David Koufaty, Paul Brett, Scott Hahn
ACM SIGOPS Operating Systems Review [PDF]

Hardware Support for Cross-Layer PMU Arbitration

Intel processors offer PerfMon, a set of hardware events and counters that may be programmed in a number of ways for a variety of uses. Traditionally used for application optimization, we are seeing novel nascent uses throughout the software stack: in operating systems, virtualization hypervisors, and even BIOS firmware. Conflict for these counters has already been observed, and is likely to worsen. We posit the need for hardware features to allow `reservation' of and exclusive access to hardware counters, and describe a prototype system to solve the problem.

Rob Knauerhase, Paul Brett, Peggy Irelan
FHMP2010 [PDF]

The 48-core SCC processor: the programmer's view

The number of cores integrated onto a single die is expected to climb steadily in the foreseeable future. This move to many-core chips is driven by a need to optimize performance per watt. How best to connect these cores and how to program the resulting many-core processor, however, is an open research question. Designs vary from GPUs to cache-coherent shared memory multiprocessors to pure distributed memory chips. The 48-core SCC processor reported in this paper is an intermediate case, sharing traits of message passing and shared memory architectures. The hardware has been described elsewhere. In this paper, we describe the programmer's view of this chip. In particular we describe RCCE: the native message passing model created for the SCC processor.

Timothy G. Mattson, Rob F. Van der Wijngaart, Michael Riepen, Thomas Lehnig, Paul Brett, Werner Haas, Patrick Kennedy, Jason Howard, Sriram Vangal, Nitin Borkar, Greg Ruhl, Saurabh Dighe
SC10 [PDF]

Operating System Support for Overlapping-ISA Heterogeneous Multi-core Architectures

A heterogeneous processor consists of cores that are asymmetric in performance and functionality. Such a design provides a cost-effective solution for processor manufacturers to continuously improve both single-thread performance and multi-thread throughput. This design, however, faces significant challenges in the operating system, which traditionally assumes only homogeneous hardware. This paper presents a comprehensive study of OS support for heterogeneous architectures in which cores have asymmetric performance and overlapping, but non-identical instruction sets. Our algorithms allow applications to transparently execute and fairly share different types of cores. We have implemented these algorithms in the Linux 2.6.24 kernel and evaluated them on an actual heterogeneous platform. Evaluation results demonstrate that our designs efficiently manage heterogeneous hardware and enable significant performance improvements for a range of applications.

Tong Li, Paul Brett, Rob Knauerhase, David Koufaty, Dheeraj Reddy and Scott Hahn
HPCA 2010 [PDF]

Operating System Support for Shared-ISA Asymmetric Multi-core Architectures

Current trends in multi-core processor implementation scale by duplicating a single core design many times in a package; however, this approach can cause inefficient utilization of resources, such as die space and power. Recent research has proposed asymmetric cores as an alternative solution. This paper explores the design space for asymmetric multi-core architectures, and presents a case study and prototype of one design in which cores implement overlapping, but nonidentical instruction sets.

We propose fault-and-migrate, which enables the OS to manage hardware asymmetries transparently to applications. Our mechanism traps the fault when a core executes an unsupported instruction, migrates the faulting thread to a core that supports the instruction, and allows the OS to migrate it back when load balancing is necessary. We have also developed three approaches to emulate future asymmetric processors using current hardware. Preliminary evaluation shows that fault-and-migrate enables applications to execute transparently and incurs less than 4% overhead for a SPEC CPU2006* benchmark

Tong Li, Paul Brett, Barbara Hohlt, Rob Knauerhase, Sean D. McElderry, and Scott Hahn
WIOSCA 2008 [PDF]

Using OS Observations to Improve Performance in Multicore Systems

Today's operating systems don't adequately handle the complexities of multicore processors. Architectural features confound existing OS techniques for task scheduling, load balancing, and power management. This article shows that the OS can use data obtained from dynamic runtime observation of task behavior to ameliorate performance variability and more effectively exploit multicore processor resources. The authors' research prototypes demonstrate the utility of observation-based policy

Rob Knauerhase, Paul Brett, Barbara Hohlt, Tong Li, Scott Hahn
IEEE Micro May/June 2008 [PDF]

An Analysis of Performance Interference Effects in Virtual Environments

Virtualization is an essential technology in modern datacenters. Despite advantages such as security isolation, fault isolation, and environment isolation, current virtualization techniques do not provide effective performance isolation between virtual machines (VMs). Specifically, hidden contention for physical resources impacts performance differently in different workload configurations, causing significant variance in observed system throughput. To this end, characterizing workloads that generate performance interference is important in order to maximize overall utility.

In this paper, we study the effects of performance interference by looking at system-level workload characteristics. In a physical host, we allocate two VMs, each of which runs a sample application chosen from a wide range of benchmark and real-world workloads. For each combination, we collect performance metrics and runtime characteristics using an instrumented Xen hypervisor. Through subsequent analysis of collected data, we identify clusters of applications that generate certain types of performance interference. Furthermore, we develop mathematical models to predict the performance of a new application from its workload characteristics. Our evaluation shows our techniques were able to predict performance with average error of approximately 5%.

Younggyun Koh, Rob C. Knauerhase, Paul Brett, Mic Bowman, Zhihua Wen, Calton Pu
ISPASS 2007 [PDF]

Virtualization In The Enterprise

We present how an enterprise IT organization sees virtualization in the enterprise and how it can be applied. We look at key enterprise services and applications used within Intel's IT department and examine the issues associated with virtualizing servers within the context of those services. We demonstrate that virtual machine (VM) isolation does not extend to performance isolation as we show how applications running in separate VMs can significantly interfere with each other. Enterprise services depend on host characteristics like available cycles, platform configurations, and on proximity to other services. We define a taxonomy of these dependencies derived from our study. Next, we describe uses of Intel virtualization technology (Intel VT) that we are investigating. The ability to run multiple operating systems (OS's) is of great interest in our design environment where highly specialized tools are tied closely to OS versions. The ability to checkpoint, suspend, resume, and migrate VMs is very useful when we run long simulations. The ability to allocate VMs at the location of choice opens up other possible use cases, such as network monitoring, security monitoring, and content distribution. We see this capability also enabling safe yet realistic experimentation, as a way to extend virtualization into clients. Finally, we present a real case study applying virtualization to enterprise IT problems This virtualization program achieved higher server utilization, made it easier to manage datacenter assets, and reduced the consumption of datacenter resources (floor space, power, etc.), as well as simplified server releases through standardization.

Jeff Sedayao, Cheng-Chee Koh, Mic Bowman, Robert Knauerhase, Sanjay Rungta, John Vicente, Julia Palmer, Patrick Fabian, Paul Brett, Justin Richardson
Intel Technical Journal August 2006 [PDF]

Monitoring Internet Connectivity using PlanetLab

This paper explores one company's use of PlanetLab for a real application. Intel Corporation is a global enterprise with many Internet "DMZs" and thousands of customers around the world who use them. Intel needs to monitor the quality of service received through these Internet connections from many parts of the world. Doing this with available commercial services or by implementing monitoring systems in rented data center space across the globe would be expensive as well as being relatively inflexible. PlanetLab presents a relatively inexpensive and flexible platform for global scale monitoring but poses significant challenges in developing, deploying, and managing such a widely distributed application in an environment where node available and connectivity can change rapidly. We implemented the global DMZ monitor using PlanetLab nodes and the Distributed Service Management Toolkit (DSMT). DSMT provides a way to distribute code for an application and manage it despite node outages, moving the application to geographically appropriate nodes when nodes become unavailable. We position graphs to allow us to correlate data to either geographical local events or Internet wide events. Connectivity events are propagated using the PSEPR eventing system. Our experience with this implementation has shown that it can detect problems Internet connectivity problems. Future work includes using different protocols such as HTTP for monitoring and to extend DSMT services to monitor other conditions.

Sanjay Rungta, Alex Rentzis, Jeff Sedayao, Robert Adams, Paul Brett
NOMS 2006 [PDF]

Scalable Management

Modern computing environments, such as enterprise data centers, Grids, and PlanetLab, introduce distributed services to address scalability, locality, and reliability. Web Services (WS), in particular, improve decoupling, decentralization, and autonomicity within distributed systems. Unfortunately, scale and decentralization introduce additional problems in distributed services management, such as deployment, monitoring, and lifecycle maintenance.

In this paper, we propose a new approach to management of large scale distributed services, based on three artifacts: scalable publish-subscribe eventing, scalable WS-based deployment, and model-based management. We demonstrate that these techniques improve the manageability of services. In this way we enable service developers to focus on the development of service functionality rather than on management features.

Robert Adams, Paul Brett, Subu Iyer, Dejan S. Milojicic, Sandro Rafaeli, Vanish Talwar
ICAC 2005 [PDF]

A Shared Global Event Propagation And Storage System To Enable Next-Generation Distributed Services

The construction of highly reliable planetary-scale distributed services in the unreliable Internet environment entails significant challenges. Our research focuses on the use of loose binding among service components as a means to deploy distributed services at scale. An event-based publish/subscribe messaging infrastructure is the principal means through which we implement loose binding. A unique property of the messaging infrastructure is that it is built on a collection of off-the-shelf instant messaging servers running on PlanetLab. Using this infrastructure we have successfully constructed long-running services (such as a PlanetLab node status service) with more than 2000 components.

Paul Brett, Rob Knauerhase, Mic Bowman, Robert Adams, Aroon Nataraj, Jeff Sedayao, Michael Spindel
Worlds 2004 [PDF]

Securing the PlanetLab Distributed Testbed: How to Manage Security in an Environment with No Firewalls, with All Users Having Root, and No Direct Physical Control of Any System

PlanetLab is a globally distributed network of hosts designed to support the deployment and evaluation of planetary scale applications. Support for planetary applications development poses several security challenges to the team maintaining PlanetLab. The planetary nature of Planetlab mandates nodes distributed across the globe, far from the physical control of the team. The application development requirements force every user to have access to the equivalent of root on each machine, and use of firewalls is discouraged. If an account is compromised, PlanetLab administrators needed a way to track the actions of users on the nodes. If an entire node is compromised, then the administrators need a way to regain control despite the lack of physical access. Encryption was built into PlanetLab to ensure confidentiality and integrity of system downloads. A special reset packet, combined with keeping a boot CD in the machine, enables PlanetLab system administrators to remotely regain control of machines if they are compromised and return to the nodes into a safe known state. The Linux VServer implementation is used to provide root access to PlanetLab users for development purposes while isolating users from each other. A network abstraction layer provides accounting of traffic and allows safe access to raw sockets. These mechanisms have proven very useful in managing PlanetLab. After a compromise of large numbers of PlanetLab hosts, control of the PlanetLab network was regained in 10 minutes. The compromise spawned a review of PlanetLab security, which pointed out a number of flaws. The need the central site for maintaining PlanetLab was cites as a key weakness. Future work includes distributing the functions of PlanetLab's central administrative database and improving integrity checks.

Paul Brett, Mic Bowman, Jeff Sedayao, Robert Adams, Rob C. Knauerhase, Aaron Klingaman
LISA 2004 [PDF]