Ahmad Faraj, Florida State University

STAR-MPI: Self Tuned Adaptive Routines for MPI Collective Operations

Message Passing Interface (MPI) collective communication routines are widely used in parallel applications. In order for a collective communication routine to achieve high performance for different applications on different platforms, it must be adaptable to both the system architecture and the application workload. Current MPI implementations do not support such software adaptability and are not able to achieve high performance on many platforms. In this paper, we present STAR-MPI (Self Tuned Adaptive Routines for MPI collective operations), a set of MPI collective communication routines that are capable of adapting to system architecture and application workload. For each operation, STAR-MPI maintains a set of communication algorithms that can potentially be efficient at different situations. As an application executes, a STAR-MPI routine applies the Automatic Empirical Optimization of Software (AEOS) technique at run time to dynamically select the best performing algorithm for the application on the platform. We describe the techniques used in STAR-MPI, analyze STAR-MPI overheads, and evaluate the performance of STAR-MPI with applications and benchmarks. The results of our study indicate that STAR-MPI is robust and efficient. It is able to find efficient algorithms with reasonable overheads, and it out-performs traditional MPI implementations to a large degree in many cases.

Rajesh Vivekanandham, Indian Institute of Science

A Scalable Low Power Issue Queue for Large Instruction Window Out of Order Processors

Large instruction windows and issue queues are key to exploiting greater instruction level parallelism in out-of-order superscalar processors. However, the cycle time and energy consumption of conventional monolithic issue queues are high. Previous efforts to reduce cycle time segment the issue queue and pipeline wakeup. Unfortunately, this results in significant IPC loss. Other proposals address energy efficiency by avoiding only the unnecessary tag comparisons.

To address both these issues more efficiently, we propose the Scalable Low power Issue Queue (SLIQ). SLIQ augments a pipelined issue queue with direct indexing to mitigate the problem of delayed wakeups while reducing the cycle time. Also, the SLIQ design naturally leads to significant energy savings by reducing both the number of tag broadcasts and comparisons required.

A 2 segment SLIQ incurs an average IPC loss of 0.2% over the entire SPEC CPU2000 suite, while achieving a 25.2% reduction in issue latency when compared to an ideal monolithic 128-entry issue queue for an 8-wide superscalar processor. An 8 segment SLIQ improves scalability by reducing the issue latency by 38.3% while incurring an IPC loss of only 2.3%. Further, the 8 segment SLIQ reduces the energy consumption and energy-delay product by 48.3% and 67.4% respectively

Greg Bronevetsky, Cornell University

Experimental Evaluation of Application-level Checkpointing for OpenMP Programs

It is becoming important for long-running scientific applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically to disk, and when a failure occurs, the computation is restarted from the last saved state. One common way of doing this, called System-level Checkpointing (SLC), requires modifying the Operating System and the communication libraries to permit the saving of the state of the entire parallel application. Unfortunately, this approach has poor portability since a checkpointer for one system rarely works on a different system. The only portable alternative is Application-level Checkpointing (ALC), where the programmer manually modifies their program to enable CPR, a very labor-intensive task.

We are investigating the use of compiler technology to instrument codes to embed the ability to tolerate faults into applications themselves, making them self-checkpointing and self-restarting on any platform. In~cite{bronevetsky+:asplos04} we described a general approach for checkpointing shared memory APIs at the application level. Since~cite{bronevetsky+:asplos04} applied to only a toy feature set common to most shared memory APIs, this paper shows the practicality of this approach by extending it to a specific popular shared memory API: OpenMP. We describe the challenges involved in providing automated ALC for OpenMP applications and experimentally validate this approach by showing detailed performance results for our implementation of this technique. Our experiments with the NAS OpenMP benchmarks~cite{npb-openmp} and the EPCC microbenchmarks~cite{epcc-microbenchmarks} show generally low overhead on three different architectures: Linux/IA64, Tru64/Alpha and Solaris/Sparc and highlight important lessons about the performance characteristics of this approach.

Jonathan Weinberg, University of California, San Diego

User-Guided Symbiotic Space-Sharing of Real Workloads

Symbiotic space-sharing is a technique that improves system throughput by executing parallel applications in combinations and configurations that alleviate pressure on shared resources. We have shown prototype schedulers that leverage such techniques to improve throughput by 20% over conventional space-sharing schedulers when resource bottlenecks are known. Such evaluations have utilized benchmark workloads and proposed that schedulers be informed of resource bottlenecks by users at job submission time; in this work, we investigate the accuracy with which users can actually identify resource bottlenecks in real applications and the implications of these predictions for symbiotically space-sharing production workloads. Using a large HPC platform, a representative application workload, and a sampling of expert users, we show that user inputs are of value and that for our chosen workload, user-guided symbiotic scheduling can improve throughput over conventional space-sharing by 15-22%.

Dean Hildebrand, University of Michigan

Large Files, Small Writes, and pNFS

Workload characterization studies highlight the prevalence of small and sequential data requests in scientific applications. Parallel file systems excel at large data transfers but sometimes at the expense of small I/O performance. pNFS is an NFSv4.1 high-performance enhancement that provides direct storage access to parallel file systems while preserving NFSv4 operating system and hardware platform independence. This paper demonstrates that distributed file systems can increase write throughput to parallel data storesÑregardless of file sizeÑby overcoming parallel file system inefficiencies. We also show how pNFS can improve the overall write performance of parallel file systems by using direct, parallel I/O for large write requests and a distributed file system for small write requests. We describe our pNFS prototype and present experiments demonstrating the performance improvements.

Shinji Sumimoto, Fujitsu Laboratories

Scalable Communication Layer for Multi-Dimensional Hyper Crossbar Network Using Multiple Gigabit Ethernet

This paper proposes the scalable communication layer for a multi-dimensional hyper crossbar network using multiple Gigabit Ethernet for the PACS-CS system which consists of 2560 single-processor nodes and a 16 x 16 x 10 three dimensional hyper-crossbar network (3D-HXB). To realize a high performance communication layer using multiple existing Ethernet networks, the host processor usage for the communication processing must be minimized. To overcome this problem, we have developed the PM/Ethernet-HXB communication facility. PM/Ethernet-HXB realizes communication protocol processing without mutual exclusion even for Zero-copy communication between the communication buffers of nodes. We have implemented the PM/Ethernet-HXB on SCore cluster system software, and evaluated its communication and application performance. PM/Ethernet-HXB achieves a uni-directional communication bandwidth of 1065 MB/s using nine Gigabit Ethernet links, a unidirectional communication bandwidth of 741 MB/s (98.8% of the theoretical performance), a bidirectional bandwidth of 1401 MB/s (93.4% of the theoretical performance) on the 3D-HXB connections (a total of six Ethernet links). The results of MPI communication bandwidth are a unidirectional bandwidth of 960 MB/s and a bidirectional bandwidth of 1008 MB/s using eight links. These results show that PM/Ethernet-HXB realizes a comparative performance using multiple Gigabit Ethernet networks to dedicated cluster networks such as InfiniBand. The speedups of IS and CG Class C NAS parallel benchmarks are scalable up to using four links on eight node cluster, and performance degradation between 3D-HXB (2 x 2 x 2) and 1-dimensional network (8 x 1) are small.

Kyle Rupnow, University of Wisconsin Madison

Scientific Applications vs. SPEC-FP: A Comparison of Program Behavior

Many modern scientific applications execute on massively parallel collections of microprocessors. Supercomputers such as the Cray XT3 (Red Storm) and Blue Gene/L support thousands to tens of thousands of processors per parallel job. However, individual microprocessor performance remains a critical component of overall performance. Traditional approaches to improve scientific application performance concentrate on floating-point (FP) instructions; however, our studies show that in the scientific applications used at Sandia National Labs, integer instructions constitute a large and critical part of the instruction mix. Although the SPEC-FP benchmark suite is considered representative of FP workloads, it has a much smaller proportion of integer computation instructions than the Sandia scientific applications, with 22.9% as compared to 36.9%. Integer instructions in Sandia applications also behave differently than in SPEC-FP. Integer instruction outputs are reused 8.8x to 13.1x more often in SPEC-FP benchmarks, and integer dataflow in Sandia applications is more complex than in the SPEC-FP suite. In this work, we examine common dataflow and usage patterns of integer instructionsÑinformation essential to develop hardware techniques to accelerate critical scientific applications. We present statistics for SPEC-FP and Sandia applications, summarizing integer computation usage and the size, shape and interface (number of inputs/outputs) of dataflow graphs.

Eduardo Qui–ones, UPC

Selective Predicate Prediction for Out-of-Order Processors

If-conversion transforms control dependencies to data dependencies by using a predication mechanism. It is useful to eliminate hard-to-predict branches and to reduce the severe performance impact of branch mispredictions. However, the use of predicated execution in out-of-order processors has to deal with two problems: there can be multiple definitions for a single destination register at rename time, and instructions with a false predicated consume unnecessary resources. Predicting predicates is an effective approach to address both problems. However, predicting predicates that come from hard-to-predict branches is not beneficial in general, because this approach reverses the if-conversion transformation, loosing its potential benefits. In this paper we propose a new scheme that dynamically selects which predicates are worthy to be predicted, and which one are more effective in its if-converted form. We show that our approach significantly outperforms previous proposed schemes. Moreover it performs within 5% of an ideal scheme with perfect predicate prediction.

Hakan Zeffer, Uppsala University

TMA: A Trap-Based Memory Architecture

The advances in semiconductor technology have set the shared-memory server trend towards processors with multiple cores per die and multiple threads per core. We believe that this technology shift forces a re-evaluation of how to interconnect multiple such chips to form larger systems.

This paper argues that by adding support for coherence traps in future chip multiprocessors, large-scale server systems can be formed at a much lower cost. This is due to shorter design time, verification and time to market when compared to its traditional all-hardware counter part. In the proposed trap-based memory architecture (TMA), software trap handlers are responsible for obtaining read/write permission, whereas the coherence trap hardware is responsible for the actual permission check.

In this paper we evaluate a TMA implementation (called TMA Lite) with a minimal amount of hardware extensions, all contained within the processor. The proposed mechanisms for coherence trap processing should not affect the critical path and have a negligible cost in terms of area and power for most processor designs.

Our evaluation is based on detailed full system simulation using out-of-order processors with one or two dual-threaded cores per die as processing nodes. The results show that a TMA based distributed shared memory system can on average perform within 1 percent of a highly optimized hardware based design.

Jaume Abella, Intel & UPC

Heterogeneous Way-Size Cache

Set-associative cache architectures are commonly used. These caches consist of a number of ways, each of the same size. We have observed that the different ways have very different utilization, which motivates the design of caches with heterogeneous way sizes. This can potentially result in higher performance for the same area, better capabilities to implement dynamically adaptive schemes, and more flexibility for choosing the size of the cache. This paper proposes a novel cache architecture, Heterogeneous Way-Size cache (HWS cache), in which the different cache ways may have different sizes. HWS caches are shown to outperform conventional L1 and L2 caches. For instance, a HWS cache can achieve up to 20% dynamic and leakage energy savings with respect to its conventional cache counterpart, while the hit ratio is practically the same. We also present a Dynamically Adaptive version of the HWS cache (DAHWS cache). DAHWS caches are shown to be more adaptive than conventional architectures. DAHWS caches achieve higher energy savings and lower miss rates than conventional caches due to their higher flexibility. For instance, DAHWS caches reduce the active ratio by 66%, 55% and 41% for L1 instruction, L1 data and L2 caches respectively.

Frank Mueller, North Carolina State University

Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems

Reliability is increasingly becoming a challenge for high-performance computing (HPC) systems with thousands of nodes, such as IBM's Blue Gene/L. A shorter mean-time-to-failure can be addressed by adding fault tolerance to reconfigure working nodes to ensure that communication and computation can progress. However, existing approaches fall short in providing scalability and small reconfiguration overhead within the fault-tolerant layer.

This paper contributes a scalable approach to reconfigure the communication infrastructure after node failures. We propose a decentralized (peer-to-peer) protocol that maintains a consistent view of active nodes in the presence of faults. Our protocol shows response times in the order of hundreds of microseconds and single-digit milliseconds for reconfiguration using MPI over BlueGene/L and TCP over Gigabit, respectively. The protocol can be adapted to match the network topology to further increase performance. We also verify experimental results against a performance model, which demonstrates the scalability of the approach. Hence, the membership service is suitable for deployment in the communication layer of MPI runtime systems, and we have integrated an early version into LAM/MPI.

Martin Rinard, MIT

Probabilistic Accuracy Bounds for Fault-Tolerant Computations that Discard Tasks

We present a new technique for enabling computations to survive software errors and hardware failures while providing a bound on any resulting output distortion. A developer using the technique first partitions the computation into tasks. The execution platform then simply discards any task that encounters a fault and completes the computation by executing any remaining tasks. This technique can substantially improve the robustness of the computation in the face of errors and failures. A potential concern is that discarding tasks may change the result that the computation produces.

Our technique randomly samples executions of the program at varying task failure rates to obtain a quantitative, probabilistic model that characterizes the distortion of the output as a function of the task failure rates. By providing probabilistic bounds on the distortion, the model allows users to confidently accept results produced by executions with failures as long as the distortion falls within acceptable bounds. This approach may prove to be especially useful for enabling computations to successfully survive hardware failures in distributed computing environments.

Our technique also produces a timing model that characterizes the execution time as a function of the task failure rates. The combination of the distortion and timing models quantifies an accuracy/execution time trade-off. It therefore enables the development of techniques that purposefully fail tasks to reduce the execution time while keeping the distortion within acceptable bounds.

Juan Carlos Moure, University Aut—noma of Barcelona

Wide and Efficient Prediction using the Local Trace Predictor

High prediction bandwidth enables performance improvements and power reduction techniques. This paper explores a mechanism to increase prediction width (instructions per prediction) by predicting instruction traces. A thorough analysis shows that predicting traces including multiple branches is not significantly less accurate than predicting single branches. A novel Local Trace Predictor organization is proposed, which increases prediction width without reducing the ratio of prediction accuracy versus memory resources with respect to a Basic Block Predictor. Compared to the previously proposed Next-Trace Predictor, the Local Trace Predictor reduces memory requirements by codifying trace predictions, and by limiting the number of traces starting at the same instruction to 2 or 4. The limit lessens prediction width only slightly, and does not affect prediction accuracy. The overall result is that the Local Trace Predictor outperforms the Next-Trace Predictor for sizes higher than 12 Kbytes.

Hu Chen, Intel China Research Center Ltd.

MPIPP: An Automatic Profile-guided Parallel Processes Placement Toolset in SMP Clusters and Multiclusters

SMPs clusters and multiclusters are widely used to execute message-passing parallel applications. The ways to map parallel processes to processors (or cores) could affect the application performance significantly due to the non-uniform communicate cost in such systems. It is desired to have a tool to map parallel processes to processors (or cores) automatically. Although there have been various efforts to address this issue, the existing solutions either require intensive user intervention, or do not able to handle the situation of multiclusters well. In this paper, we propose a profile-guided approach to find the optimized mapping automatically to minimize the cost of point-to-point communications for arbitrary message passing applications. The implemented toolset is called MPIPP( MPI Process Placement toolset), which includes several components: 1) A tool to get communication profile of MPI applications; 2) A tool to get the network topology of target clusters and 3) An algorithm to find optimized mapping, which is especially more effective than existing graph partition algorithms for multiclusters. We evaluated the performance of our tool with the NPB benchmarks and three other applications in several clusters. Experimental results show that the optimized process placement generated by our tools can achieve significant speedup.

Montse Farreras, Universitat Politecnica de Catalunya

Scaling MPI to short-memory MPPs such as BG/L

Scalability to large number of processes is one of the weaknesses of current MPI implementations. Standard implementations are able to scale to hundreds of nodes, but not beyond. The main problem in these implementations is that they assume some resources (for both data and control-data) will always be available to receive/process unexpected messages. As we will show, this is not always true, especially in short-memory machines like the BG/L that has 64K nodes but each node only has 512Mbytes of memory.

The objective of this paper is to present one algorithm that improves the robustness of MPI implementations for short-memory MPPs, taking care of data and control-data reception, the system will scale up to any number of nodes . The proposed solution achieves this goal without any observable overhead when there are no memory problems. Furthermore, in the worst case, when memory resources are extremely scarce, the overhead will never double the execution time (and we should never forget that in this extreme situation, traditional MPI implementations would fail to execute).

Apan Qasem, Rice University

Profitable Loop Fusion and Tiling Using Model-driven Empirical Search

Loop fusion and tiling are both recognized as effective transformations for improving memory performance of scientific applications. However, because of their sensitivity to the underlying cache architecture and their interaction with each other it is difficult to determine a good heuristic for applying these transformations profitably across architectures. In this paper, we present a model-guided empirical tuning strategy for profitable application of loop fusion and tiling. Our strategy consists of a detailed cost model that characterizes the interaction between the two transformations at different levels of the memory hierarchy. The novelty of our approach is in exposing key architectural parameters within the model for automatic tuning through empirical search. Preliminary experiments with a set of applications on four different platforms show that our strategy achieves significant performance improvement over fully optimized code generated by state-of-the-art commercial compilers. The time spent in searching for the best parameters is considerably less than with other search strategies.

Nicolas Vasilache, INRIA Futurs - France

Violated Dependence Amalysis

The polyhedral model is a powerful framework to reason about high level loop transformations. Yet the lack of scalable algorithms and tools has deterred actors from both academia and industry to put this model to practical use. Indeed, for fundamental complexity reasons, its applicability has long been limited to simple kernels. Recent developments broke some generally accepted ideas about these limitations. In particular, new algorithms made it possible to compute the target code for full SPEC benchmarks while this code generation step was expected not to be scalable.

Instancewise array dependence analysis computes a finite, intentional representation of the (statically unbounded) set of all dynamic dependences. This problem has always been considered non-scalable and/or an overkill with respect to less expressive and faster dependence tests. On the contrary, this article presents experimental evidence of its applicability to full SPEC CPU2000 benchmarks. To make this possible, we revisit the characterization of data dependences, considering relations between time dimensions of the transformed space. Beyond algorithmic benefits, this naturally leads to a novel way of reasoning about violated dependences across arbitrary transformation sequences. Reasoning about violated dependences relieves the compiler designer from the cumbersome task of implementing specific legality checks for each single transformation. It also allows, in the case of invalid transformations, to precisely determine the violated dependences that need to be corrected. Identifying these violations can in turn enable automatic correction schemes to fix an illegal transformation sequence with minimal changes.

Arun Kejariwal, University of California at Irvine

On the Dissection of Performance Potential of Types of Speculative Thread-Level Parallelism

Recent research in thread-level speculation (TLS) has proposed several mechanisms for optimistic execution of difficult-to-analyze serial codes in parallel. Though it has been shown that TLS does help achieve higher levels of parallelism, i.e., beyond what is achievable with techniques proposed for instruction-level parallelism (ILP), the ÔtrueÕ performance potential of TLS over non-speculative thread-level parallelism (TLP). In this paper, we evaluate the same. Further, we dissect the performance potential of TLS into what is achievable via each types --- control speculation, data dependence speculation and data value speculation --- of speculation. Assuming an ideal support for speculative execution, i.e., misspeculation does not incur any overhead, our study shows that, at the loop-level, TLS has a modest overall performance potential of 7.34% and 11.16% for SPEC CFP2006 and SPEC CINT2006 respectively.

Arun Kejariwal, University of California at Irvine

Lightweight Lock-Free Synchronization Methods for Multithreading

Emergence of chip multiprocessors has created a need for exploitation of beyond {DOALL-type thread-level parallelism (TLP). This calls for development of efficient thread synchronization techniques to exploit TLP in general parallel programs with dependences.

For this, several thread synchronization techniques have been proposed in the past. However, these limit the exploitation of fine-grain TLP due to large run-time overhead. Furthermore, the existing approaches can potentially result in (i) deadlocks between the different threads and (ii) non-deterministic run-time execution behavior as these techniques are oblivious of the underlying memory model. In this paper, we propose lightweight lock-free thread synchronization methods to exploit TLP in general parallel programs with dependences. Each synchronization method intrinsically guarantees the following in a multithreaded program:

(a) sequential consistency,

(b) atomicity of writes to the shared synchronization construct and

(c) absence of deadlocks.

This reduces the programming effort considerably, thereby easing the development of software for multithreaded systems.

For each method we formally prove that there cannot occur a deadlock between the different threads. This obviates the cumbersome and time-consuming process of detecting and eliminating deadlocks from the programmer. Experiments show that our synchronization methods incur a minimal overhead of 7.16% on an average. Further, we achieve performance speedups up to 3.39x on kernels extracted from the industry standard SPEC OMPM 2001 benchmarks, on a dedicated Intel Xeon 2.78 GHz 4-way multiprocessor.

Jeremy Buhler, Washington University

Accelerator Design for Protein Sequence HMM Search

Profile hidden Markov models (HMMs) are a powerful approach to describing biologically significant functional units, or motifs, in protein sequences. Databases of such models are regularly compared to large collections of proteins to recognize motifs in them. Exponentially increasing rates of genome sequencing have caused both protein and model databases to explode in size, placing an ever-increasing computational burden on users of these systems.

Here, we describe an accelerated search system that exploits parallelism in a number of ways. First, the application is functionally decomposed into a pipeline, with distinct compute resources executing each pipeline stage. Second, the first pipeline stage is deployed on an FPGA-based systolic array, which yields significant fine-grained parallelism. Third, for some instantiations of the design, parallel copies of the first pipeline stage are used, further increasing the level of coarse-grained parallelism.

A naive parallelization of the first stage computation has serious repercussions for the sensitivity of the search. We present two remedies to this dilemma and quantify when each is most effective. Analytic performance models are used to assess the speedup that can be attained relative to a single-processor software solution. Performance improvements of 1 to 2 orders of magnitude are predicted.

Adam Oliner, Stanford University

Cooperative Checkpointing: A Robust Approach to Large-Scale Systems Reliability

Cooperative checkpointing increases the performance and robustness of a system by allowing checkpoints requested by applications to be dynamically skipped at runtime. A robust system must be more than merely resilient to failures; it must be adaptable and flexible in the face of new and evolving challenges. A simulation-based experimental analysis using both probabilistic and harvested failure distributions reveals that cooperative checkpointing enables an application to make progress under a wide variety of failure distributions that periodic checkpointing lacks the flexibility to handle. Cooperative checkpointing can be easily implemented on top of existing application-initiated checkpointing mechanisms and may be used to enhance other reliability techniques like QoS guarantees and fault-aware job scheduling. The simulations also support a number of theoretical predictions related to cooperative checkpointing, including the non-competitiveness of periodic checkpointing. As high-performance computing systems continue to grow in size and complexity, the robustness conferred by cooperative checkpointing will be crucial for reliably running long jobs on inherently unreliable hardware.

Yogish Sabharwal, IBM India Research Lab

Scalable Algorithms for Global Snapshots in Distributed Systems

Existing algorithms for global snapshots in distributed systems are not scalable when the underlying topology is complete. In a network with $N$ processors, these algorithms require O(N) space and O(N) messages per processor. As a result, these algorithms are not efficient in large systems when the logical topology of the communication layer such as MPI is complete.

In this paper, we propose three algorithms for global snapshot: a grid-based, a tree-based and a centralized algorithm. The grid-based algorithm uses O(N) space but only O(\sqrt N) messages per processor. The tree- based algorithm requires only O(1) space and O(\log N \log w) messages per processor where $w$ is the average number of messages in transit per processor. The centralized algorithm requires only O(1) space and O(\log w) messages per processor. We also show a matching lower bound for this problem.

Our algorithms have applications in checkpointing, detecting stable predicates and implementing synchronizers. We have implemented our algorithms on top of the MPI library on the BlueGene/L supercomputer. Our experiments confirm that the proposed algorithms significantly reduce the message and space complexity of a global snapshot.

Dan Wallin, Uppsala university, Sweden

Multigrid and Gauss-Seidel Smoothers Revisited: Parallelization on Chip Multiprocessors

Efficient solutions require a match between the algorithm and the underlying architecture. The new chip-multiprocessors, CMPs (a.k.a. multicore), feature low intra-chip communication cost and smaller per-thread caches compared to earlier systems. From an algorithmic point of view this means that data locality issues become more important than communication overheads. This may require re-evaluation of many existing algorithms.

We have investigated parallel implementations of multigrid methods using a new temporally blocked, naturally ordered, smoother implementation. Compared with the standard multigrid solution based on the two-color red-black algorithm, we improve the data locality often as much as ten times while our use of a fine-grained locking scheme keeps the parallel efficiency high.

While our algorithm initially was inspired by CMPs, it was surprising to see our OpenMP multigrid implementation run up to 40 percent faster than the standard red-black algorithm on an 8-way SMP system. Studying the smoother part of the algorithm in isolation often shows it performing two iterations at the same time as a single iteration with an ordinary red-black smoother. Running our smoother on a 32-thread UltraSPARC T1 (Niagara) CMP demonstrates the communication cost of our algorithm to be low for such architectures.

Matteo Monchiero, Politecnico di Milano/UPC

Design Space Exploration for Multicore Architectures: A Power/Performance/Thermal View

Multicore architectures are ruling the recent microprocessor design trend. This is due to different reasons: better performance, thread-level parallelism bounds in modern applications, ILP diminishing returns, better thermal/power scaling (many small cores dissipate less than a large and complex one); and, ease and reuse of design.

This paper presents a thorough evaluation of multicore architectures. The architecture we target is composed of a configurable number of cores, a memory hierarchy consisting of private L1 and L2, and shared bus interconnect. We consider parallel shared memory applications. We explore the design space related to the number of cores, L2 cache size and processor complexity, showing the behavior of the different configurations/applications with respect to performance, energy consumption and temperature. Design tradeoffs are analyzed, stressing the interdependency of the metrics and design factors. In particular, we evaluate several chip floor plans. Their power/thermal characteristics are analyzed and they show the importance of considering thermal effects at the architectural level to achieve the best design choice.

Paul Stodghill, Cornell University

A Distributed System Based on Web Services for Computational Science Simulations

In this paper, we describe the ASP system, a test bed based on Web Services for coupled multi-physics simulations. The system is organized as a collection of geographically-distributed software components in which each component provides a Web Service, and uses standard SOAP-based Web Service protocols to interact with other components. There are a number of advantages to organizing a system in this way, which we discuss. We have analyzed the performance of our system for several applications and a number of problem sizes and have found that the overhead for using SOAP-based Web Services is small and tends to decrease as the problem size increases. Our results suggest that potential performance bottlenecks identified in the literature may not be major issues in practice, and that a standards-compliant implementation like ours can delivery excellent scalable performance even on coupled problems, provided Web Services are used judiciously.

Manolis Marazakis, FORTH-ICS

Efficient Remote Block-level I/O over a RDMA-capable NIC

We present a performance evaluation of remote block-level I/O over an RDMA-capable network interface card (NIC), currently under development at FORTH-ICS for use in SAN environments. We focus on offering application programs transparent and cost-effective access to a data storage utility. We find that the NIC's latency and throughput characteristics make it attractive as the underlying interconnect for a storage system.

This paper outlines the motivation for this work, and then proceeds to describe the architecture and state-of-development of our prototype, along with performance measurements. The measurements presented in this paper show that for increasing I/O request sizes the throughput approaches that of directly-attached storage; however important areas of overhead remain to be addressed.

James Balfour, Stanford University

Design Tradeoffs for Tiled CMP On-Chip Networks

We present detailed area and energy models for on-chip interconnection networks and describe tradeoffs in the design of networks for tiled CMPs. We investigate how aspects of the network architecture such as topology, channel width, and buffer size affect the network's performance and contribute to the interconnect overhead. We simulate the performance of a variety of on-chip networks designed for a tiled chip multiprocessor implemented in an advanced VLSI process and report area and energy efficiencies estimated using our models. Our results demonstrate that increased channel widths significantly improve area and energy efficiency in on-chip networks. We describe how introducing a second network to the system can increase performance while improving area and energy efficiency, and evaluate the tradeoffs of strategies for distributing traffic over the subnetworks. Drawing on insights from our analysis, we present a concentrated mesh topology with replicated subnetworks and express channels which provides a 28% improvement in area-delay and a 50% improvement in energy-delay over other networks evaluated in this study.

Mark Hampton, MIT CSAIL

Implementing Virtual Memory in a Vector Processor with Software Restart Markers

Vector processing provides many benefits in the domain of high-performance, energy-efficient computing. However, implementing a precise exception model can be costly in a vector processor due to the need to commit instructions in program order. As a result, vector machines typically do not support virtual memory, as this usually relies on precise exception handling. Lack of virtual memory support is one of the key factors that has hindered vector processing from being more widely used in general-purpose computing.

We remove the in-order commit requirements of precise exceptions by using software restart markers, which divide the program into idempotent regions of code. When executing instructions from a single region, the processor can commit results to architectural state in any order. If an exception occurs, the processor restarts execution at the beginning of the region. Since the values within a region do not need to be buffered, we are able to support virtual memory in a vector processor without incurring significant hardware overhead. Our scheme also removes the requirement of preserving vector register file contents in the event of a context switch. We show that using our approach causes an average performance reduction of less than 4% across a variety of benchmarks.

Joshua Yi, Freescale Semiconductor

The Exigency of Benchmark and Compiler Drift: Designing TomorrowÕs Processors with YesterdayÕs Tools

Due to the amount of time required to design a new processor, one set of benchmark programs may be used during the design phase while another may be the standard when the design is finally delivered. Using one benchmark suite to design a processor while using a different, presumably more current, suite to evaluate its ultimate performance may lead to sub-optimal design decisions if there are large differences between the characteristics of the two suites and their respective compilers. We call this change across time ÒdriftÓ. To evaluate the impact of using yesterdayÕs benchmark and compiler technology to design tomorrowÕs processors, we compare common benchmarks from the SPEC 95 and SPEC 2000 benchmark suites. Our results yield three key conclusions. First, we show that the amount of drift, for common programs in successive SPEC benchmark suites, is significant. In SPEC 2000, the main memory access time is a far more significant performance bottleneck than in SPEC 95, while less significant SPEC 2000 performance bottlenecks include the L2 cache latency, the L1 I-cache size, and the number of reorder buffer entries. Second, using two different statistical techniques, we show that compiler drift is not as significant as benchmark drift. Third, we show that benchmark and compiler drift can have a significant impact on the final design decisions. Specifically, we use a one-parameter-at-a-time optimization algorithm to design two different year-2000 processors, one optimized for SPEC 95 and the other optimized for SPEC 2000, using the energy-delay product (EDP) as the optimization criterion. The results show that using SPEC 95 to design a year-2000 processor results in an 18.5% larger EDP and a 20.8% higher CPI than using the SPEC 2000 benchmarks to design the corresponding processor. Finally, we make a few recommendations to help computer architects minimize the effects of benchmark and compiler drift.

Steve Carr, Michigan Technological University

Feedback-directed Memory Disambiguation Through Store Distance Analysis

Feedback-directed optimization has developed into an important tool in designing and building optimizing compilers. Based upon profiling, memory distance analysis has shown much promise in predicting data locality and memory dependences, and has seen use in locality based optimizations and memory disambiguation. In this paper, we introduce a new form of memory distance, called store distance, which is defined as the number of store references between a load and the previous store accessing the same memory location. We apply store distance analysis to the problem of memory disambiguation in out-of-order issue processors.

By generating a representative store distance for each load, we can apply a compiler/micro-architecture cooperative scheme to direct run-time load speculation. Using store distance, the processor can, in most cases, accurately determine on which specific store instruction a load depends according to its store distance annotation. Our experiments show that the proposed method performs much better than the previous distance-based memory disambiguation scheme, and yields performance very close to perfect memory disambiguation. The store distance based scheme also outperforms the store set technique with a relatively small predicator space and achieves performance comparable to that of a 16K-entry store set implementation for both floating point and integer programs.

Daniel Vanderster, University of Victoria

Sensitivity Analysis of Knapsack-based Task Scheduling on the Grid

The knapsack-based task scheduler has been previously shown to provide Quality of Service to malleable tasks on computational grids. In this study, we measure the sensitivity of the knapsack-derived schedules to variations in the prescribed allocation policies and their corresponding utility functions. In particular, we explore the effects of varying the strengths of an external user-specified monetary metric, an intrinsic estimated response time mediated by nearness to completion time metric, and of varying the shape of a sigmoidal normalizing utility function. The results of our analyses show that the knapsack strategy results in schedules that are consistent with the defined allocation policies. We conclude by indicating the recommended metric weights, that is, those that produce desirable schedule characteristics.

Sudharshan Vazhkudai, Oak Ridge National Lab

Coupling Prefix Caching and Collective Downloads for Remote Data Access

Scientific computing user communities construct complex distributed workflows. Data from these operations is typically archived at mass storage systems or data centers close to supercomputers or instruments. End-users of these datasets, however, usually carry out parts of the workflow at their local computers. In such cases, client-side caching can offer significant gains by hiding wide-area latency and improving performance.

Andreas Moshovos, Univ. of Toronto

BranchTap: Improving Performance with Very Few Checkpoints Through Adaptive Speculation Control

Checkpoint prediction and intelligent management have been recently proposed for reducing the number of coarse-grain checkpoints needed to support high performance through speculative execution. In this work, we take a closer look at various checkpoint prediction and management alternatives comparing their performance and requirements as the scheduler window increases. We also study a few additional design choices. The key contribution of this work is BranchTap, a novel checkpoint allocation strategy that temporarily throttles speculation to reduce recovery cost while allowing speculation to proceed when it is likely to boost performance. BranchTap dynamically adapts to application behavior. We demonstrate that for a 512-entry window processor with a FIFO of just four checkpoints our adaptive speculation control mechanism leads to an average performance degradation of just 1.03% compared to a processor that has an infinite number of checkpoints. This represents an improvement of 23.7% over the underlying prediction-based-only policy which results in an average performance deterioration of 1.35%. For the same configuration, BranchTap improves worst case deterioration drops from 4.82% to 3.50%.

Andrew Lumsdaine, Indiana University

Accelerating Sparse Matrix Computations via Data Compression

Sparse matrix computations are important for many scientific computations, with matrix-vector multiplication being a fundamental operation for modern iterative algorithms. For large sparse matrices, the primary performance limitation on matrix-vector product is memory bandwidth, rather than algorithm performance. In fact, the wide disparity between memory bandwidth and CPU performance suggests that one could trade cycles for bandwidth and still improve the time to compute a matrix-vector product. Accordingly, this paper presents an approach to improving the performance of matrix-vector product based on lossless compression of the index information commonly stored in sparse matrix representations. Two compressed formats, and their multiplication algorithms, are given, along with experimental results demonstrating their effectiveness. For an assortment of large sparse matrices, compression ratios and corresponding speedups of up to 30% are achieved. The efficiency of the compression algorithm allows its cost to be easily amortized across repeated matrix-vector products.

Wei Huang, The Ohio State University

A Case for High Performance Computing with Virtual Machines

Virtual machine (VM) technologies are experiencing a resurgence in both industry and research communities. VMs offer many desirable features such as security, ease of management, OS customization, performance isolation, check-pointing, and migration, which can be very beneficial to the performance and the manageability of high performance computing (HPC) applications. However, very few HPC applications are currently running in a virtualized environment due to the performance overhead of virtualization. Further, using VMs for HPC also introduces additional challenges such as management and distribution of OS images.

In this paper we present a case for HPC with VMs by introducing a framework which addresses the performance and management overhead associated with VM-based computing. Two key ideas in our design are: Virtual Machine Monitor (VMM) bypass I/O and scalable VM image management. VMM-bypass I/O achieves high communication performance for VMs by exploiting the OS-bypass feature of modern high speed interconnects such as InfiniBand. Scalable VM image management significantly reduces the overhead of distributing and managing VMs in large scale clusters. Our current implementation is based on the Xen VM environment and InfiniBand, however, many of our ideas are readily applicable to other VM environments and high speed interconnects.

We carry out detailed analysis on the performance and management overhead of our VM-based HPC framework. Our evaluation shows that HPC applications can achieve almost the same performance as those running in a native, non-virtualized environment. Therefore, our approach holds promise to bring the benefits of VMs to HPC applications with very little degradation in performance.

Matthew Curtis-Maury, College of William and Mary

Online Power-Performance Adaptation of Multithreaded Programs using Hardware Event-Based Prediction

With high-end systems featuring multicore/multithreaded processors and high component density, power-aware high-performance multithreading libraries become a critical element of the system software stack. Online power and performance adaptation of multithreaded code from within user-level runtime libraries is a relatively new and unexplored area of research. We present a user-level library framework for nearly optimal online adaptation of multithreaded codes for low-power, high-performance execution. Our framework operates by regulating concurrency and changing the processors/threads configuration as the program executes. It is innovative in that it uses fast, runtime performance prediction derived from hardware event-driven profiling, to select thread granularities that achieve nearly optimal energy-efficiency points. The use of predictors substantially reduces the runtime cost of granularity control and program adaptation. Furthermore, our prediction model significantly improves prediction accuracy compared to other hardware profile-driven performance prediction models proposed earlier. Our overall framework achieves performance and $ED^{2}$ (energy-delay-squared) levels which are: i) comparable to or better than those of oracle-derived offline predictors; ii) significantly better than those of online predictors using exhaustive or localized linear search. The complete prediction and adaptation framework is implemented on a real multi-SMT system with Intel Hyperthreaded processors and embeds adaptation capabilities in OpenMP programs.

Lieven Eeckhout, Ghent University

Accurate Memory Data Flow Modeling in Statistical Simulation

Microprocessor design is a very complex and time-consuming activity. One of the primary reasons is the huge design space that needs to be explored in order to identify the optimal design given a number of constraints. Simulations are usually used to explore these huge design spaces, however, they are fairly slow. Several hundreds of billions of instructions need to be simulated per benchmark; and this needs to be done for every design point of interest.

Recently, statistical simulation was proposed to efficiently cull a huge design space. The basic idea of statistical simulation is to collect a number of important program characteristics and to generate a synthetic trace from it. Simulating this synthetic trace is extremely fast as it contains a million instructions only.

This paper improves the statistical simulation methodology by proposing accurate memory data flow models. We model (i) load forwarding, (ii) delayed cache hits, and (iii) correlation between cache misses based on path info. Our experiments using the SPEC CPU2000 benchmarks show a substantial improvement upon current state-of-the-art statistical simulation methods. For example, for our baseline configuration we reduce the average IPC prediction error from 10.7% to 2.3%. In addition, we show that performance trends are predicted very accurately, making statistical simulation enhanced with accurate data flow models a useful tool for efficient and accurate microprocessor design space explorations.