Advance Program for ICS'02

All sessions are in the Davis Auditorium in the Schapiro Center for Engineering and Physical Sciences Research (Schapiro CEPSR), except for the second of the two parallel sessions which are held in room 833 SW Mudd.


8:30-9:00 Breakfast
8:45-9:00 Conference opening: Dr. Kemal Ebcioglu (IBM), ICS’02 General Chair, "Welcome to ICS’02"
9:00-10:00 Keynote address:  Tetsuya Sato (ESC Japan),
"Can the Earth Simulator Change the Way Humans Think?"
10:00-10:15 Break
10:15-11:45 Architecture I - John Gurd, chair
11:45-1:00 Lunch1
1:00-2:30 Low Power - Sid Chatterjee, chair Networks - Guang Gao, chair
2:30-2:45 Break
2:45-4:15 Panel Discussion:
"Programming Models for Future High Performance Computing Systems". 
Hans Zima (NASA JPL & U. Vienna), chair

Evening social event: Circle Line Harbor Lights tour of New York Harbor

5:00 Pickup box dinners at the Schapiro Center for Engineering and Physical Sciences Research (Schapiro CEPSR). Dinners can be eaten on the bus, but food is not allowed on the boat.
5:20 Board shuttle at Shapiro Center for Engineering and Physical Sciences Research (Schapiro CEPSR)
5:30 Shuttle to harbor leaves Schapiro CEPSR.
6:30-7:00 Boarding the Circle Line at Pier 83. Arrive early!
(boarding at 6:30)
Circle Line Harbor Lights tour of New York Harbor. A list of some of the sites seen includes,
  • Statue of Liberty
  • Ellis Island
  • Trinity Church
  • Brooklyn Bridge
  • South Street Seaport
  • United Nations buildings
  • Chelsea Piers and Sports Complex


9:00-10:00 Shuttle from harbor, making stops at,
  • Schapiro CEPSR
  • Carman Hall
  • The Lucerne Hotel
  • Astor on the Park



8:30-9:00 Breakfast
9:00-10:00 Keynote address: Alfred Spector (IBM)
"Challenges and Opportunities in Autonomic Computing"
10:00-10:15 Break
10:15-11:45 Compilers I - Rob Schreiber, chair
11:45-1:00 Lunch1
1:00-3:00 Operating Systems - Jason Nieh, chair Memory Wall - David Padua, chair
3:00-3:15 Break
3:15-4:45 Panel Discussion:
"Autonomous Space Systems and Supercomputing". 
Richard Doyle (NASA JPL), chair

Evening social event: Banquet at Becco.

6:50 Board shuttle at Shapiro CEPSR
7:00 Shuttle to banquet leaves Schapiro CEPSR
8:00-10:00 Conference banquet at Becco, 355 W. 46th St. (bet. 8th & 9th Aves.)  (212) 397-7597. Mapquest directions are here.
10:00-11:00 Shuttle from banquet, making stops at,
  • Schapiro CEPSR
  • Carman Hall
  • The Lucerne Hotel
  • Astor on the Park



8:30-9:00 Breakfast
9:00-10:00 Keynote address: David Kuck (Intel),
"Clustered Approaches to HPC via Commodity HW + Highly Evolved SW"
10:00-10:15 Break
10:15-11:45 Architecture II - Alex Nicolau, chair
11:45-1:00 Lunch1
1:00-2:30 Panel: Discussion:
"Dynamic Data Driven Application Systems".
Frederica Darema (NSF/CISE), chair
2:30-2:45 Break
2:45-4:45 Compilers II - Kemal Ebcioglu, chair Applications - Keshav Pingali, chair
4:45-5:15 Closing and Awards


  1. Lunch will not be served during the conference.  A number of eating establishments are within walking distance from the Columbia campus. A restaurant map is available here.

Technical Sessions

Architecture 1 (Monday, 10:15-11:45) - John Gurd, chair

Abstract: Active memory systems help processors overcome the memory wall when applications exhibit poor cache behavior. They consist of either active memory elements that perform data parallel computations in the memory system itself, or an active memory controller that supports address re-mapping techniques that improve data locality. Both active memory approaches create coherence problems -even on uniprocessor systems- since there are either additional processors operating on the data directly, or the processor is allowed to refer to the same data via more than one address. While most active memory implementations require cache flushes, we propose a new technique to solve the coherence problem by extending the coherence protocol. Our active memory controller leverages and extends the coherence mechanism, so that re-mapping techniques work transparently on both uniprocessor and multiprocessor systems.

We present a microarchitecture for an active memory controller with a programmable core and specialized hardware that accelerates cache line assembly and disassembly. We present detailed simulation results that show uniprocessor speedup from 1.3 to 7.6 on a range of applications and microbenchmarks. In addition to uniprocessor speedup, we show single-node multiprocessor speedup for parallel active memory applications and discuss how the same controller architecture supports coherent multi-node systems called active memory clusters.

Abstract: The DIVA (Data IntensiVe Architecture) system incorporates a collection of Processing-In-Memory (PIM) chips as smart-memory co-processors to a conventional microprocessor. We have recently fabricated prototype DIVA PIMs. These chips represent the first smart-memory devices designed to support virtual addressing and capable of executing multiple threads of control. In this paper, we describe the prototype PIM architecture. We emphasize three unique features of DIVA PIMs, namely, the memory interface to the host processor, the 256-bit wide datapaths for exploiting on-chip bandwidth, and the address translation unit. We present detailed simulation results on eight benchmark applications. When just a single PIM chip is used, we achieve an average speedup of 3.3X over host-only execution, due to lower memory stall times and increased fine-grain parallelism. These 1-PIM results suggest that a PIM-based architecture with many such chips yields significantly higher performance than a multi- processor of a similar scale and at a much reduced hardware cost.


HMCS (Heterogeneous Multi-Computer System) is a new parallel processing platform combining massively parallel processors for continuum simulation and particle simulation to realize multi-scale computational physics simulations. We are constructing a prototype system of HMCS with a general purpose scientific parallel processor CP-PACS and a gravity calculation parallel processor GRAPE-6 connecting them via commodity-base parallel network.

On the prototype of HMCS, a microscopic gravity calculation on GRAPE-6 and a macroscopic radiation hydrodynamic calculation on CP-PACS are performed simultaneously to realize very detailed simulation on computational astrophysics. Both systems are connected via parallel network controlling system named PIO (Parallel I/O System).

In this paper, we report the overall concept, design and implementation of HMCS, and our application result for radiative transfer smoothed particle hydrodynamics with self gravity for analysis of galaxy generation.

Low-power (Monday, 1:00-2:30) - Sid Chatterjee, chair

Abstract: Energy efficiency is becoming an increasingly important feature for both mobile and high-performance server systems. Most processors designed today include power management features that provide processor operating points which can be used in power management algorithms. However, existing power management algorithms implicitly assume that lower performance points are more energy efficient than higher performance points. Our empirical observations indicate that for many systems, this assumption is not valid.

We introduce a new concept called critical power slope to explain and capture the power-performance characteristics of systems with power management features. We evaluate three systems - a clock throttled Pentium laptop, a frequency scaled PowerPC platform, and a voltage scaled system to demonstrate the benefits of our approach. Our evaluation is based on empirical measurements of the first two systems, and publicly available data for the third. Using critical power slope, we explain why on the Pentium-based system, it is energy efficient to run only at the highest frequency, while on the PowerPC-based system, it is energy efficient to run at the lowest frequency point. We confirm our results by measuring the behavior of a web serving benchmark. Furthermore, we extend the critical power slope concept to understand the benefits of voltage scaling when combined with frequency scaling. We show that in some cases, it may be energy efficient not to reduce voltage below a certain point.

Abstract: This work addresses the issues of access latency and energy consumption in value predictor design for high-frequency, wide-issue microprocessors. Previous value prediction research allows for generous assumptions regarding table configurations and access conditions, while ignoring prediction latencies and energy issues . However, the latency of a high-performance value predictor cannot always be completely hidden by the early stages of the instruction pipeline as previously assumed, and it causes noticeable performance degradation versus unconstrained value prediction. This paper describes and compares several variations of basic value prediction methods: at-fetch, post-decode, and decoupled.

The performance of at-fetch and post-decode value predictors is limited by the high access latency of accurate predictor configurations. Decoupled value prediction excels at overcoming the high-frequency table access constraints by placing completion-time predictions into a separate and easily accessible storage. However, it has high energy requirements. We study a value prediction approach that combines the latency-friendly approach of decoupled value prediction with a more energy-efficient implementation. The traditional PC-indexed prediction tables are removed and replaced by a queue of prediction traces. This latency and energy aware method of maintaining and distributing speculated values leads to a 58%-95% reduction in value predictor energy consumption versus known value prediction techniques while still maintaining high performance.

Project URL:

Abstract: In some of today's superscalar processors (e.g. Pentium III), the result repositories are implemented as the Reorder Buffer (ROB) slots. In such designs, the ROB is a complex multi-ported structure that occupies a significant portion of the die area and dissipates a non-trivial fraction of the total chip power, as much as 27% according to some estimates. In addition, an access to such ROB typically takes more than one cycle, impacting the IPCs adversely.

We propose a low-complexity and low-power ROB design that exploits the fact that the bulk of the source operand values is obtained through data forwarding to the issue queue or through direct reads of the committed register values. Our ROB design uses an organization that completely eliminates the read ports needed to read out operand values for instruction issue. Any consequential performance degradation is countered by using a small number of associatively-addressed retention latches to hold the most recent set of values written into the ROB. The contents of the retention latches are used to satisfy the operand reads for issue that would otherwise have to be read from the ROB slots. Significant savings of the ROB real estate as well as power savings in the range of 20% to 30% for the ROB are achieved using the proposed technique. At the same time, the fact that results are accessible in a single cycle from the retention latches actually leads to an overall improvement in the IPC of up to 3% on the average for SPEC 2000 benchmarks.

Project URL:

Networks (Monday, 1:00-2:30) - Guang Gao, chair

Abstract: We propose a deterministic fault-tolerant and deadlock-free routing protocol in 2-dimensional (2-D) meshes based on dimension-order routing and the recently proposed odd-even turn model. The proposed protocol, called extended X-Y routing, does not use any virtual channels by prohibiting certain locations of faults and destinations. Faults are contained in a set of disjointed rectangular regions called faulty blocks. The number of faults to be tolerated is unbounded as long as nodes outside faulty blocks are connected in the mesh network. The extended X-Y routing can also be used under a special convex fault region called orthogonal faulty block, which can be derived from a given faulty block by activating some nonfaulty nodes in the block. Extensions to partial adaptive routing, traffic- and adaptivity-balanced using virtual networks, and routing without constraints using virtual channels and virtual networks are also discussed.

Abstract: The Los Alamos Message Passing Interface (LA-MPI) is an end-to-end network-failure-tolerant message-passing system designed for terascale clusters. LA-MPI is a standard-compliant implementation of MPI designed to tolerate network-related failures including I/O bus errors, network card errors, and wire-transmission errors. This paper details the distinguishing features of LA-MPI, including support for concurrent use of multiple types of network interface, and reliable message transmission utilizing multiple network paths and routes between a given source and destination. In addition, performance measurements on production-grade platforms are presented.

Project URL:

Abstract: Decentralized and unstructured peer-to-peer networks such as Gnutella are attractive for certain applications because they require no centralized directories and no precise control over network topology or data placement. However, the flooding-based query algorithm used in Gnutella does not scale; each query generates a large amount of traffic and large systems quickly become overwhelmed by the query-induced load. This paper explores, through simulation, various alternatives to Gnutella's query algorithm, data replication strategy, and network topology. We propose a query algorithm based on multiple random walks that resolves queries almost as quickly as Gnutella's flooding method while reducing the network traffic by two orders of magnitude in many cases. We also present simulation results on a distributed replication strategy proposed in [8]. Finally, we find that among the various network topologies we consider, uniform random graphs yield the best performance.

Compilers 1 (Tuesday, 10:15-11:45) - Rob Schreiber, chair

Abstract: Modulo Scheduling is an instruction scheduling technique that is used by many current compilers. Different approaches have been proposed in the past but there is not a quantitative comparison among them, using the same compiling platform, benchmarks  and architectures.

This paper presents a performance comparison of the most relevant Modulo Scheduling techniques, based on a detailed quantitative evaluation of them. The results point out which are the most effective techniques for different architectures, which is useful for compiler designers when choosing the most appropriate technique for a particular processor architecture.

Abstract: To compete performance-wise, modern VLIW processors must have fast clock rates and high instruction-level parallelism (ILP). Partitioning resources (functional units and registers) into clusters allows the processor to be clocked faster, but operand transfers across clusters can easily become a bottleneck. Increasing the number of functional units increases the potential ILP, but only helps if the functional units can be kept busy.

To support these features, optimizations such as loop unrolling must be applied to expose ILP, and instructions must be explicitly assigned to clusters to minimize cross-cluster transfers. In an architecture with homogeneous clusters, the number of functional units of a given type is typically a multiple of the number of clusters. Thus, it is common to unroll a loop so that the number of copies of a loop body is a multiple of the number of clusters. The result is that there is a natural mapping of instructions to clusters, which is often the best mapping. While this mapping can be obvious by inspection, we have found that existing cluster assignment algorithms often miss this natural split. The consequence is an excessive number of inter-cluster transfers, which slows down the loop.

Because we were unable to find an existing cluster-assignment algorithm that performs well for unrolled loops, we designed our own. This algorithm has been implemented in a production compiler for the TMS320C6000, a two-cluster VLIW architecture. It is tailored for exploiting the patterns that result from either manual or compiler-based unrolling. As demonstrated experimentally, it performs well, even when other optimizations such as SIMD partially obscure the natural split.

Abstract: Software pipelining is widely used as a compiler optimization technique to achieve high performance in machines that exploit instruction-level parallelism. However, surprisingly, there have been few theoretical or empirical results on optimal software pipelining of loops with control flows. In this paper, we present three new contributions for this under-investigated problem. First, we propose a necessary and sufficient condition for a loop with control flows to have an optimally software-pipelined program. We also present a decision procedure to compute the condition. Second, we present two software pipelining algorithms. The first algorithm computes an optimal solution for every loop satisfying the condition, but may run in exponential time. The second algorithm computes optimal solutions efficiently for most (but not all) loops satisfying the condition. Third, we present experimental results which strongly indicate that achieving the optimality in the software-pipelined programs is a viable goal in practice with realistic hardware support.

Operating Systems (Tuesday, 1:00-3:00) - Jason Nieh, chair

Abstract: The Beowulf Distributed Process Space (BProc) is a set of Linux kernel modifications which provides a single system image and process migration facilities for processes running in a Beowulf style cluster. With BProc, all the processes running in a cluster are visible on the cluster front end machine and are controllable via existing UNIX process control mechanisms. Process creation is done on the front end machine and the processes are placed on the nodes where they will run with BProc's process migration mechanism.

These two features combined greatly simplify creating and cleaning up parallel jobs as well as removing the necessity of a user login to remote nodes in the cluster. Removing the need for user logins drastically reduces the mount of software required on cluster nodes.

Job startup with BProc's process migration mechanism is faster than the traditional method of logging into a node and starting the process with rsh. BProc does not affect file or network I/O of processes running on remote nodes so the vast majority of MPI applications will experience no performance loss as a result of being managed by BProc.

Project URLs:,

Abstract: In this paper we introduce DualFS, a new high performance journaling file system that puts data and meta-data on different devices (usually, two partitions on the same disk or on different disks), and manages them in very different ways. Unlike other journaling file systems, DualFS has only one copy of every meta-data block. This copy is in the "meta-data device", a log which is used by DualFS both to read and to write meta-data blocks. By avoiding a time-expensive extra copy of meta-data blocks, DualFS can achieve a good performance as compared to other journaling file systems. Indeed, we have implemented a DualFS prototype, which has been evaluated with microbenchmarks and macrobenchmarks, and we have found that DualFS greatly reduces the total I/O time taken by the file system in most cases (up to 97%), whereas it slightly increases the total I/O time only in a few and limited cases.

Abstract:  Given the increasing performance disparity between processors and storage devices, exploiting knowledge of spatial and temporal I/O requests is critical to achieving high performance, particularly on parallel systems. Although perfect foreknowledge of I/O requests is rarely possible, even estimates of request patterns can potentially yield large performance gains. This paper evaluates Markov models to represent the spatial patterns of I/O requests in scientific codes. The paper also proposes three algorithms for I/O prefetching. Evaluation using I/O traces from scientific codes shows that highly accurate prediction of spatial access patterns, resulting in reduced execution times, is possible.

Project URL:

Abstract: Scientific simulations running on parallel platforms output intermediate data periodically, typically moving the output to a remote machine for visualization. Due to the large data size and slow network, improvements in output and migration performance can significantly reduce simulation turnaround time. In this paper, we propose a novel simulation execution environment that integrates active buffering and compressed migration to hide and reduce the cost of output and migration. Our implementation removes previous shortcomings of active buffering and of compressed migration, and our experiments with real world data sets show that integrating the two techniques brings comparable application-visible I/O performance and much more flexibility compared to compressed migration without active buffering. The performance advantage will increase with slower-to-write file formats such as HDF.

Project URL:

Memory-wall (Tuesday, 1:00-3:00) - David Padua, chair

Abstract: Data prefetching is an effective approach to addressing the memory latency problem. While a few processors have implemented hardware-based data prefetching, the majority of modern processors support data-prefetch instructions and rely on compilers to automatically insert prefetches. However, most prefetching schemes in commercial compilers suffer from two limitations: (1) the source code must be available before prefetching can be applied, and (2) these prefetching schemes target only loops with statically-known strided accesses. In this study, we broaden the scope of software-controlled prefetching by addressing the above two limitations. We use profiling to discover strided accesses that frequently occur during program execution but are not determinable by the compiler. We then use the strides discovered to insert prefetches into the executable directly, without the need for re-compilation. Performance evaluation was done on an Alpha 21264-based system with a 64KB data cache and an 8MB secondary cache. We find that even with such large caches, our technique offers speedups ranging from 3% to 56% in 11 out of the 26 SPEC2000 benchmarks. Our technique has been incorporated into Pixie and Spike, two products in Compaq's Tru64 Unix.

Abstract: Trace caches are used to help dynamic branch prediction make multiple predictions in a cycle by embedding some of the predictions in the trace. In this work, we evaluate a trace cache that is capable of delivering a trace consisting of a variable number of instructions via a linked list mechanism. We evaluate several schemes in the context of an x86 processor model that stores decoded instructions. By developing a new classification for trace cache accesses, we are able to target those misses that cause the largest performance loss. We have proposed a hardware speculation technique, called NonHead Miss Speculation, which removes much of the penalty associated with nonhead misses in the eight applications we studied. Performance improvements ranged from 2% to 20%, with an average speedup of around 10% across our application suite.

Abstract:  A processor must know a load instruction's latency to schedule the load's dependent instructions at the correct time. Unfortunately, modern processors do not know this latency until well after the dependent instructions should have been scheduled to avoid pipeline bubbles between themselves and the load. One solution to this problem is to predict the load's latency, by predicting whether the load will hit or miss in the data cache. Existing cache hit/miss predictors, however, can only correctly predict about 50% of cache misses.

This paper introduces a new hit/miss predictor that uses a Bloom Filter to identify cache misses early in the pipeline. This early identification of cache misses allows the processor to more accurately schedule instructions that are dependent on loads and to more precisely prefetch data into the cache. Simulations using a modified SimpleScalar model show that the proposed Bloom Filter is nearly perfect, with a prediction accuracy greater than 99% for the SPECint2000 benchmarks. IPC (Instructions Per Cycle) performance improved by 19% over a processor that delayed the scheduling of instructions dependent on a load until the load latency was known, and by 6% and 7% over a processor that always predicted a load would hit the cache and with a counter-based hit/miss predictor respectively. This IPC reaches 99.7% of the IPC of a processor with perfect scheduling.

Abstract: The increasing gap in performance between processors and main memory has made effective instructions prefetching techniques more important than ever. A major deficiency of existing prefetching methods is that most of them require an extra port to I-cache. A recent study by [19] shows that this factor alone explains why most modern microprocessors do not use such hardware-based I-cache prefetch schemes. The contribution of this paper is two-fold. First we present a method that does not require an extra port to I-cache. Second, the performance improvement for our method is greater than the best competing method BHGP [23] even disregarding the improvement from not having an extra port.

The three key features of our method that prevent the above deficiencies are as follows. First, too-late prefetching is prevented by correlating misses to dynamically preceding instructions. For example, if the I-cache miss latency is 12 cycles, then the instruction that was fetched 12 cycles prior to the miss is used as the prefetch trigger. Second, the miss history table is kept to a reasonable size by grouping contiguous cache misses together and associated them with one preceding instruction, and therefore, one table entry. Third, the extra I-cache port is avoided through efficient prefetch filtering methods. Experiments show that for our benchmarks, chosen for their poor I-cache performance, an average improvement of 9.2% in runtime is achieved versus the BHGP methods [23], while the hardware cost is also reduced. The improvement will be greater if the runtime impact of avoiding an extra port is considered.

Architecture 2 (Wednesday, 10:15-11:45) - Alex Nicolau, chair

Abstract: Clustered microarchitectures are becoming a common organization due to their potential to reduce the penalties caused by wire delays and power consumption. Fully-distributed architectures are particularly effective to deal with these constraints, and besides they are very scalable. However, the distribution of the data cache memory poses a significant challenge and may be critical for performance. In this work, a distributed data cache VLIW architecture based on an interleaved cache organization along with cyclic scheduling techniques are proposed. Moreover, the use of Attraction Buffers for such an architecture is introduced. Attraction Buffers are a novel hardware mechanism to increase the percentage of local accesses. The idea is to allow the movement of some data towards the clusters that need it.

Performance results for 9 Mediabench benchmarks show that our scheduling techniques are able to hide the increased memory latency when accessing data mapped in a remote cluster. In addition, the local hit ratio is increased by 15% and stall time is reduced by 30% when using the same scheduling techniques with an interleaved cache clustered processor with Attraction Buffers. Finally, the proposed architecture is compared with a state-of-the-art distributed architecture such as the multiVLIW. Results show that the performance of an interleaved cache clustered VLIW processor with Attraction Buffers is similar to that of the multiVLIW architecture, whereas the former has a lower hardware complexity.

Project URL :

Abstract: The reasons for performance losses due to conditional branch mispredictions are first studied. Branch misprediction penalties are broken into three categories: pipeline-fill penalty, window-fill penalty, and serialization penalty. The first and third of these produce most of the performance loss, but the second is also significant. Previously proposed dual (or multi) path execution methods attempt to reduce all three penalties, but these methods are also quite complex. Most of the complexity is caused by simultaneously executing instructions from multiple paths.

A good engineering compromise is to avoid the complexity of multiple path execution by focusing on methods that reduce only the pipeline and window re-fill penalties. Dual Path Instruction Processing (DPIP) is proposed as a simple mechanism that fetches, decodes, and renames, but does not execute, instructions from the alternative path for low confidence predicted branches at the same time as the predicted path is being executed. All the stages of the pipeline front-end are hidden once the misprediction is detected. This method thus targets the pipeline-fill penalty and is  shown to achieve a good trade-off between performance and complexity. To reduce the window-fill penalty, we further propose the addition of a pre-scheduling engine that schedules instructions from the alternative path in an estimated execution order. Thus, after a misprediction, a high number of instructions from the alternate path can be immediately issued to execution, achieving an effect similar to very fast re-filling of the window. Performance evaluation of DPIP in a 14-stage superscalar processor (like IBM Power 4) shows an average IPC improvement of up to 10% for the bzip2 benchmark, and an average of 8% for ten benchmarks from the SPECint95 and SPECint2000 suites.

Abstract: Predicated Execution has been put forth as a method for improving processor performance by removing hard-to-predict branches. As part of the process of turning a set of basic blocks into a predicated region, both paths of a branch are combined into a single path. There can be multiple definitions from disjoint paths that reach a use. Waiting to find out the correct definition that actually reaches the use can cause pipeline stalls.

In this paper we examine a hardware optimization that dynamically collects and analyzes path information to determine valid dependences for predicated regions of code. We then use this information for an in-order VLIW predicated processor, so that instructions can continue towards execution without having to wait on operands from false dependences. Our results show that using our Disjoint Path Analysis System provides speedups over 6% and elimination of false RAW dependences of up to 14% due to the detection of erroneous dependences in if-converted regions of code.

Project URL:

Compilers 2 (Wednesday, 2:45-4:45) - Kemal Ebcioglu, chair

Abstract: Processing and analyzing large volumes of data plays an increasingly important role in many domains of scientific research. The complexity and irregularity of datasets in many domains make the task of developing such processing applications tedious and error-prone.

We propose use of high-level abstractions for hiding the irregularities in these datasets and enabling rapid development of correct data processing applications. We present two execution strategies and a set of compiler analysis techniques for obtaining high performance from applications written using our proposed high-level abstractions. Our execution strategies achieve high locality in disk accesses. Once a disk block is read from the disk, all iterations that access any of the elements from this disk block are performed. To support our execution strategies and improve the performance, we have developed static analysis techniques for: 1) computing the set of iterations that access a particular right-hand-side element, 2) generating a function that can be applied to the meta-data associated with each disk block, for determining if that disk block needs to be read, and 3) performing code hoisting of conditionals.

We present experimental results from a prototype compiler implementing our techniques to demonstrate the effectiveness of our approach.

Abstract: Data access costs contribute significantly to the execution time of applications with complex data structures. As the latency of memory accesses becomes high relative to processor cycle times, application performance is increasingly limited by memory performance. In some situations it may be reasonable to trade increased computation costs for reduced memory costs. The contributions of this paper are three-fold: we provide a detailed analysis of the memory performance of a set of seven, memory-intensive benchmarks; we describe Computation Regrouping, a general, source-level approach to improving the overall performance of these applications by improving temporal locality to reduce cache and TLB miss ratios (and thus memory stall times); and we demonstrate significant performance improvements from applying Computation Regrouping to our suite of seven benchmarks. With Computation Regrouping, we observe an average speedup of 1.97, with individual speedups ranging from 1.26 to 3.03. Most of this improvement comes from eliminating memory stall time.

Project URL:

Abstract: We present a points-to analysis that aims at enabling loop-based dependence analysis in the presence of Java references. The analysis is based on an abstraction called element-wise points-to (ewpt) mapping. An ewpt mapping summarizes, in a compact representation, the relation between a pointer and the heap object it points to, for every instance of the pointer inside a loop and for every array element directly accessible through this pointer. Such instance-wise and element-wise information is especially important for loop-based dependence analyses and for a language where multi-dimensional arrays are implemented as arrays of pointers. We describe an iterative algorithm to compute ewpt mappings. We also present techniques to remove objects from ewpt mappings for destructive updates.

The points-to algorithm was implemented and evaluated on a set of benchmark programs. We demonstrate that ewpt information can significantly improve the precision of dependence analysis. In many cases, the dependence analysis reports no false dependences due to array accesses.

Abstract: We present a novel Hybrid Analysis technology which can efficiently and seamlessly integrate all static and run-time analysis of memory references into a single framework that is capable of performing all data dependence analysis and can generate necessary information for most associated memory related optimizations. We use HA to perform automatic parallelization by extracting run-time assertions from any loop and generating appropriate run-time tests that range from a low cost scalar comparison to a full, reference by reference run-time analysis. Moreover we can order the run-time tests in increasing order of complexity (overhead) and thus risk the minimum necessary overhead. We accomplish this by both extending compile time IP analysis techniques and by incorporating speculative run-time techniques when necessary. Our solution is to bridge 'free' compile time techniques with exhaustive run-time techniques through a continuum of simple to complex solutions. We have implemented our framework in the Polaris compiler by introducing an innovative intermediate representation called RT_LMAD and a run-time library that can operate on it. Based on the experimental results obtained to date we hope to to automatically parallelize most and possibly all PERFECT codes, a significant accomplishment.

Applications (Wednesday, 2:45-4:45) - Keshav Pingali, chair

Abstract: Two physical objects cannot occupy the same space at the same time. Simulated physical objects do not naturally obey this constraint. Instead, we must detect when two objects have collided-- we must perform collision detection. This work presents a simple voxel-based collision detection algorithm, an efficient parallel implementation of the algorithm, and performance results.

Project URL:

Abstract:  Recently there has been a lot of interest in improving the infrastructure used in medical applications. In particular, there is renewed interest on non-invasive, high-resolution diagnostic methods. One such method is digital, 3D ultrasound medical imaging. Current state-of-the-art ultrasound systems use specialized hardware for performing advanced processing of input data to improve  the quality of the generated images. Such systems are limited in their capabilities by the underlying computing architecture and they tend to be expensive due to the specialized nature of the solutions they employ.

Our goal in this work is twofold: (i) To understand the behavior of this class of emerging medical applications in order to provide an efficient parallel implementation and (ii) to introduce a new benchmark for parallel computer architectures from a novel and important class of applications. We address the limitations faced by modern ultrasound systems by investigating how all processing required by advanced beamforming algorithms can be performed on modern clusters of high-end PCs connected with low-latency, high-bandwidth system area networks. We investigate the computational characteristics of a state-of-the-art algorithm and demonstrate that today's commodity architectures are capable of providing almost-real-time performance without compromising image quality significantly.

Project URL:

Abstract: LLNL's hypre library is an object-oriented library for the solution of sparse linear systems on parallel computers. While hypre facilitates rapid-prototyping of complex parallel applications, our experience is that without careful attention to temporal data locality, node performance of applications developed using hypre will fall significantly short of peak performance on architectures based on modern microprocessors. In this paper, we describe our experiences analyzing and tuning the performance of smg98, a benchmark that exercises hypre's semicoarsening multigrid solver. In the original code, the lack of temporal data reuse in the registers and caches significantly hurts performance. We describe a variety of techniques we applied to hand-tune the performance of hypre's semicoarsening multigrid solver. We expect that similar strategies will be applicable to other solvers and codes based on hypre as well. We present performance measurements of smg98 on both SGI Origin and Compaq Alpha platforms. Overall, our optimizations improve the node performance of smg98 by nearly a factor of two on large problems.

Abstract: This paper develops a performance model that is used to control the adaptive execution the ATR code for solving large stochastic optimization problems on computational grids. A detailed analysis of the execution characteristics of ATR is used to construct the performance model that is then used to specify (a) near-optimal dynamic values of parameters that govern the distribution of work, and (b) a new task scheduling algorithm. Together, these new features minimize ATR execution time on any collection of compute nodes, including a varying collection of heterogeneous nodes. The new adaptive code runs up to eight-fold faster than the previously optimized code, and requires no input parameters from the user to guide the distribution of work. Furthermore, the modeling process led to several changes in the Condor runtime environment, including the new task scheduling algorithm, that produce significant performance improvements for master-worker computations as well as possibly other types of grid applications.

Project URL: