20th ACM International Conference on Supercomputing


Site Contents
Paper Submission
Conference Committees
Travel Information
ICS Archive

Saturday July 1




High Productivity Programming Languages and Models

Chair: Professor Hans P. Zima
JPL, California Institute of Technology
Pasadena, California, USA, and
University of Vienna, Austria

High performance computing has become the third pillar of science and technology, providing the superior computational capability required for dramatic advances in fields such as DNA analysis, drug design and astrophysics. However, during the past decade, progress has been impeded by a growing lack of adequate language and tool support. In today's dominating programming paradigm, users are forced to adopt a low-level programming style similar to assembly language if they want to fully exploit the capabilities of parallel machines. This leads to high-cost software production and error-prone programs that are difficult to write, reuse, and maintain. Emerging peta-scale architectures with tens of thousands of processors and applications of growing size and complexity will further aggravate this problem.

This workshop will outline the state-of-the-art in programming paradigms for high performance computing, identify the challenges posed by future architectures and their applications, and discuss the requirements for high-productivity programming languages that represent a viable compromise between the dual goals of high-level abstraction and target code efficiency. The overarching goal is to make scientists and engineers more productive by increasing programming language usability and time-to-solution, without sacrificing performance.

The three major approaches towards solving this problem will be presented by leading experts in their respective fields:

The emerging class of Partitioned Global Address Space (PGAS) languages provide a memory model in which a global address space is logically partitioned and distributed across processing units. Programs are formulated based on the Single-Program-Multiple-Data (SPMD) model. Robert W. Numrich from the University of Minneapolis will discuss Co-Array Fortran, a PGAS language that will be included in the upcoming Fortran 2008 standard.

The high-productivity languages developed in the DARPA HPCS program are characterized by object-oriented approaches based on a global address space, support for multi-threading and locality-awareness at a high level of abstraction, and an extended set of features enhancing programming safety. Two languages will be presented in the workshop: Vijay Saraswat from the IBM T.J.Watson Research Laboratory and Penn State University will discuss X10, developed in the PERCS project, while Bradford Chamberlain from Cray Inc. and Hans Zima from the University of Vienna and JPL will describe the major features of the
Chapel language developed in the Cascade project.

The third part of the workshop focuses on very high-level domain- specific approaches. Markus Schordan from the Vienna University of Technology will present new compiler technology supporting the translation of domain- specific abstractions into efficient target code. Finally, P. Sadayappan from Ohio State University describes a domain-specific language for quantum-chemistry applications and a parallel MATLAB effort.

Presentations will be 30 minutes each. Abstracts for the presentations follow:

High Productivity Languages for Parallel Architectures: Introduction and Overview
Hans P. Zima

This talk outlines the state-of-the-art in programming paradigms for high performance computing and describes the challenges posed by future architectures and advanced applications. We focus on the requirements for high-productivity programming languages that represent a viable compromise between the dual goals of high-level abstraction and target code efficiency. The final part of the talk provides a short overview of the topics discussed in the workshop.

A Co-Array Fortran Tutorial
Robert W. Numrich
Minnesota Supercomputing Institute
Minneapolis, Minnesota, USA

Co-Array Fortran is a simple extension to Fortran 95 that allows the programmer to write parallel applications using a familiar Fortran syntax. The International Fortran Standards Committee is in the process of adding co-arrays to Fortran 2008 as a standard feature. Co-Array Fortran assumes an SPMD model with explicit data decomposition and explicit synchronization supplied by the programmer. It is one of a triad of explicit SPMD models that share a similar underlying programming model. The tutorial includes a brief overview of the other two language extensions, Unified Parallel C (UPC) and Titanium, an extension to Java, but the main emphasis will be on the details of the co-array model with examples of how to apply it to some typical problems in parallel programming. In addition, it will demonstrate how to combine the co-array extension with the object-oriented features of Fortran 95 to define distributed objects with associated parallel methods for numerical linear algebra and PDE solvers. With luck, we will have remote access to a Cray-X1 for a live demonstration of the power of Co-Array Fortran.

X10: An Object-Oriented Approach to Scalable Non-Uniform Cluster Computing
Vivek Sarkar
IBM T.J. Watson Research Center

The dominant emerging multiprocessor structure for the future is a Non-Uniform Cluster Computing (NUCC) system with nodes that are built out of multi-core SMP chips with non-uniform memory hierarchies, and coupled with interconnects ranging in performance from those found in commodity clusters such as blade servers to high-end supercomputers such as the Blue Gene. Unlike previous generations of hardware evolution, this shift towards multi-core SMP chips will have a major impact on mainstream software. Current OO language facilities for concurrent and distributed programming are inadequate for addressing the needs of NUCC systems because they do not support the notions of non-uniform data access within a node, or of tight coupling of distributed nodes.

We have designed a modern object-oriented programming language, X10, for high performance, high productivity programming of NUCC systems, as part of the IBM PERCS project in the DARPA HPCS program. A member of the partitioned global address space family of languages, X10 highlights the explicit reification of locality in the form of places; lightweight activities embodied in async, future, foreach, and ateach constructs that go beyond the SPMD model; a construct for termination detection (finish); the use of lock-free synchronization (atomic blocks); and the manipulation of cluster-wide global data structures. In this talk, we will give an overview of the X10 programming model and language. If time permits, a demo will also be presented of the X10 reference implementation and Eclipse-based development environment.

This is joint work with other members of the X10 core team --- Raj Barik, Philippe Charles, Christopher Donawa, Robert Fuhrer, Allan Kielstra, Igor Peshansky, Christoph von Praun, and Vijay Saraswat.

This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004.

Parallel Programming in Chapel: An Example-Oriented Introduction
Bradford L. Chamberlain
Cray Inc.
Seattle, Washington, USA

This tutorial will introduce the audience to the Chapel parallel programming language. Chapel is being developed at Cray Inc. as part of the DARPA HPCS program to improve productivity on high-performance computing systems by 2010. The tutorial will start with an exploration of Chapel's parallel programming model. This model
extends and generalizes the global-view model of HPF and other high-level data-parallel languages. The advantages to global-view parallel programming will be discussed.

The tutorial will then discuss Chapel's support for generalized arrays and domains. Derived from ZPL's regions, domains in Chapel are a first-class abstraction used to represent index sets that are optionally distributed between multiple compute locales. In addition to supporting traditional multidimensional arrays, Chapel's domains also support sparse arrays, associative arrays (sometimes called dictionaries), and a form of graphs called opaque arrays. In addition, arrays and domains are extensible in Chapel, allowing users to specify their implementation and distribution between locales.

The rest of the tutorial will then examine the NAS MG benchmark. We will see how the global-view parallel programming model simplifies the code that the programmer needs to write. Further, it will be shown that the generalized arrays, specifically sparse arrays, can easily be applied to the Chapel version of the MG benchmark to make it more efficient without requiring significant changes to the code. It will be seen that a similar optimization made to the Fortran and MPI version of the benchmark would require significant changes to the

Time permitting, we will also examine the NAS FT benchmark and show similar advantages of Chapel's parallel programming model and generalized array support. We will argue that given the right abstractions, it becomes easier for a programmer not only to experiment with different parallel implementations but also to create
new parallel algorithms.


An Approach to User-Defined Data Distributions in Chapel
Hans P. Zima

This talk will describe an approach to data distributions in Chapel, which has been developed at Caltech/JPL in the framework of the DARPA HPCS project Cascade, in cooperation with Cray Inc. The main challenge for the specification of user-defined data distributions is to expose the user to enough level of detail regarding parallel
execution to grant effective communication of problem-specific knowledge, while concealing the unproductive details related to low- level parallel programming such as communication, synchronization, and the explicit distinction between local and remote memory accesses. Rather than offering a fixed set of built-in distributions, Chapel provides a distribution class interface which allows the explicit specification of the mapping of elements in a collection to units of uniform memory access, the control of the arrangement of elements within such units, the definition of sequential and parallel iteration over collections, and the specification of allocation policies. The result is a concise high-productivity programming model that separates algorithms from data representation and enables reuse of distributions, allocation policies, and specialized data structures.

ROSE: An Infrastructure for Abstraction-Aware High-Level Optimization of Scientific Applications
Markus Schordan
Vienna University of Technology, Vienna, Austria

Making software development more productive can be considered as a process of defining and using higher levels of abstraction. Abstractions can be domain-specific and most often are user-defined. Domain-specific abstractions have long been supported using libraries, but where they implement critical features, performance can be problematic. When the semantics of the user-defined ,abstraction is only defined by its implementation, lacking a high-level specification, the ability of the compiler to perform optimizations is often compromised. Ultimately program analysis is critical, but often not readily obtained. The development of compile-time support for abstractions defines a new avenue to the support of high-level abstractions, and a technique to easily define customized domain-specific languages. Tailoring an existing language to include compile-time support for domain specific abstractions defines a new approach to the efficient development of new
languages. The addition of language restrictions further refines the definition of domain-specific support for increased productivity in software development.

We shall present an infrastructure, ROSE, that supports research on such a type of compiler technology. ROSE permits the optimization of user-defined abstractions. We shall demonstrate this approach with examples from our own work in optimizing high-level abstractions of high-performance scientific applications. We shall also discuss the architecture of ROSE and the integration of tools, such as the Program Analysis Generator (PAG) from AbsInt, for performing abstract interpretation. For supporting various kinds of transformations, ROSE
offers multiple levels of rewrite interfaces, which we demonstrate with a number of transformations. The optimizations are performed as source-to-source transformations and eventually the optimized application codes are passed to a vendor compiler to generate machine specific code.

Our approach also feeds back into the better design of domain specific abstractions by addressing performance as a separate aspect. The approach could well be a significant mechanism to addressing the poor economic characteristics related to the support of computer languages within scientific computing. By enabling domain-specific optimizations for a general- purpose language, we define an alternative to the development of a whole new language from scratch. It is well suited to smaller application domains, and by addressing both higher levels of abstraction and performance, it supports better software development productivity.

High Level and Domain-Specific Languages: Parallel Global Address Space Support
P. Sadayappan
Ohio State University
Columbus, Ohio, USA

This talk will describe two ongoing efforts at developing high-productivity parallel programming language environments: 1) a domain-specific high-level language targeted at a class of compute-intensive quantum chemistry computations, and 2) a parallel MATLAB effort. These projects are built on top of the ARMCI (Aggregate Remote Memory Copy Interface) and GA (Global Arrays) libraries for parallel global address
space support.


International Workshop on Advanced Low Power Systems

Chair, Professor Hironori Nakajo
Tokyo University of Agriculture and Technology.

ALPS focuses on the current technological challenges in developing power-aware computing systems, ranging from servers to portable and embedded devices. The goal of this workshop is to bring together people from industry and academia that are interested in all aspects of power-aware computing. The workshop will provide a relaxed forum to present and discuss new ideas, new research directions and to review current trends in this area. The topics of the workshop include any issue related to power-aware computing.

8:25 Opening, Hironori Nakajo

08:30-09:15 Session 1: Invited Talk, Chair: Hironori Nakajo

  • Energy-Efficient Embedded System Design at 90nm and Below-- A System-Level Perspective --
    Tohru Ishihara (Kyushu University, Japan)

09:30-11:10 Session 2: Power-aware Compilation, Chair: Kenji Kise

  • Dynamic Voltage and Frequency Scaling Method based on Statistical Analysis
    H. Sasaki, Y. Ikeda, M. Kondo, and H. Nakamura (University of Tokyo, Japan)
  • Empirical Study for Optimization of Power-Performance with On-Chip Memory
    C. Takahashi, M. Sato, D. Takahashi, T. Boku, H. Nakamura, M. Kondo, and M. Fujita (University of Tsukuba, Japan)
  • Performance Evaluation of Compiler Controlled Power Saving Scheme
    J. Shirako, M. Yoshida, N. Oshiyama, Y. Wada, H. Nakano, H. Shikano, K. Kimura, H. Kasahara (Waseda University, Japan)
  • Optimizing the Profile-Guided Real-Time Voltage Scheduling Considering the System Maximum Frequency
    H. Yi, X. Yang, and J. Chen (National University of Defense Technology, P.R.China)

11:25-12:45 Session 3: Low-power Design, Chair: Toshinori Sato

  • Program Phase Detection Based Dynamic Control Mechanisms for Pipeline Stage Unification Adoption
    J. Yao, H. Shimada, S. Tomita, Y. Nakashima, S. Mori (Kyoto University, Japan)
  • Reducing Energy in Instruction Caches by Using Multiple Line Buffers with Prediction
    K. Ali, M. Aboelaze, and S. Datta (York University, Canada)
  • Power and Performance Advantages of the Highly Clustered Microarchitecture [S]
    Y. Sato, K. Suzuki, and T. Nakamura (Tohoku University, Japan)
  • Low Power FSM Synthesis with Testability [S]
    S. Chaudhury, J. S. Rao, and S. Chattopadhyay (Indian Institute of Technology Khargpur, India)

12:45 Closing, Hironori Nakajo

[S] short presentation


Performance Tuning Techniques for HPC Applications

David Klepacki
IBM T.J. Watson Research Center

Simone Sbaraglia
University of Cagliari, Italy
IBM T.J. Watson Research Center


This tutorial is for scientists, engineers and programmers who wish to study the performance behavior of their applications and develop rapid techniques to tune them on modern computing systems.


The physical limitations that device scaling is imposing on the overall need to satisfy the demand for increased computing power results in computing systems that are inherently more complex in design. Examples include the introduction of more sophisticated processor cores capable of vector instructions having boundary and size requirements, multiple floating-point operations in the same cycle, hyperthreading, prefetching, more complex memory hierarchies involving multiple levels of cache having various associativity classes and even network fabric that may introduce non-uniform interconnect topologies. And, as intelligent and capable as compiler technology has become, they are not able to resolve all of the performance issues that surround this complexity. The compiler's responsibility is to guarantee data integrity and correctness of function. The performance behavior is still ultimately decided by the knowledge and skill of the programmer.

We understand that people use computers to achieve scientific progress, and should not be burdened with spending inordinate amounts of time to tune applications as part of this goal. On the other hand, at some point time becomes realized as money so the necessity of having to execute applications as efficiently as possible is imposed by the additional economic constraints. So, what we present here are the most efficient methods we have found to help such researchers in their quest to optimize their productivity.
Specifically, this tutorial will cover the most important skills of application performance tuning that will net the largest gains with the minimum of effort. In particular, we will cover:

  • Quick review of system architecture components relevant to an application programmer and how to exploit them in your code.
  • Essential instruction cycle analysis and associated hand-tuning skills that every programmer should know.
  • Complexity analysis to ensure scalability of parallel application development.
  • Identifying load imbalance and other communication bottlenecks when using MPI.
  • Understanding false sharing and its drain on performance of shared memory parallel applications (e.g., threaded applications, OpenMP).
  • How to study the detailed memory movement and data access patterns occurring in your application.
  • Some useful techniques to improve I/O performance.

We focus on time-tested practical techniques based on our years of experience in application performance modeling and tuning. We also introduce you to the software tools of the trade to maximize efficiency and productivity.

Presenters' Profiles

David Klepacki manages the Advanced Computing Technology for Exploratory Server Systems at the IBM T.J. Watson Research Center in Yorktown Heights, NY. He is a senior staff member at IBM Research with more than twenty years of experience in high performance computing. He has worked in industry in a variety of areas relating to computational science including high performance processor design, numerically intensive computation, computational physics, parallel computing, application benchmarking, and cluster computing. He holds advanced degrees in physics, electrical and computer engineering, including a Ph.D. in theoretical nuclear physics from Purdue University.

Simone Sbaraglia earned a Ph.D. in mathematics at the University of Rome, with thesis work on optimization theory. He currently holds a research faculty position at the University of Cagliari in Italy, and collaborates with IBM Research on the design and development of application performance modeling technologies. Simone is also the architect of the "pSigma" infrastructure and Principal Investigator for NSF-funded research on "Simulation of Deep Memory Hierarchy Systems". Prior to his current position, Simone Sbaraglia has been a Research Staff Member at IBM Research and a member of the National Research Council, Institute of Applied Computing in Italy.


ACM     ICS