|
Saturday July 1
Workshops
Tutorial

High Productivity Programming Languages and Models
Chair: Professor Hans P. Zima
JPL, California Institute of Technology
Pasadena, California, USA, and
University of Vienna, Austria
High performance computing has become the third pillar of science and
technology, providing the superior computational capability required
for dramatic advances in fields such as DNA analysis, drug design and
astrophysics. However, during the past decade, progress has been impeded
by a growing lack of adequate language and tool support. In today's
dominating programming paradigm, users are forced to adopt a low-level
programming style similar to assembly language if they want to fully
exploit the capabilities of parallel machines. This leads to high-cost
software production and error-prone programs that are difficult to write,
reuse, and maintain. Emerging peta-scale architectures with tens of
thousands of processors and applications of growing size and complexity
will further aggravate this problem.
This workshop will outline the state-of-the-art in programming paradigms
for high performance computing, identify the challenges posed by future
architectures and their applications, and discuss the requirements for
high-productivity programming languages that represent a viable compromise
between the dual goals of high-level abstraction and target code efficiency.
The overarching goal is to make scientists and engineers more productive
by increasing programming language usability and time-to-solution, without
sacrificing performance.
The three major approaches towards solving this problem will be presented
by leading experts in their respective fields:
The emerging class of Partitioned Global Address Space (PGAS) languages
provide a memory model in which a global address space is logically
partitioned and distributed across processing units. Programs are
formulated based on the Single-Program-Multiple-Data (SPMD) model.
Robert W. Numrich from the University of Minneapolis will discuss
Co-Array Fortran, a PGAS language that will be included in the upcoming
Fortran 2008 standard.
The high-productivity languages developed in the DARPA HPCS program
are characterized by object-oriented approaches based on a global
address space, support for multi-threading and locality-awareness
at a high level of abstraction, and an extended set of features enhancing
programming safety. Two languages will be presented in the workshop:
Vijay Saraswat from the IBM T.J.Watson Research Laboratory and Penn
State University will discuss X10, developed in the PERCS project,
while Bradford Chamberlain from Cray Inc. and Hans Zima from the University
of Vienna and JPL will describe the major features of the
Chapel language developed in the Cascade project.
The third part of the workshop focuses on very high-level domain-
specific approaches. Markus Schordan from the Vienna University of
Technology will present new compiler technology supporting the translation
of domain- specific abstractions into efficient target code. Finally,
P. Sadayappan from Ohio State University describes a domain-specific
language for quantum-chemistry applications and a parallel MATLAB
effort.
Presentations will be 30 minutes each. Abstracts for the presentations
follow:
High Productivity Languages for Parallel Architectures: Introduction
and Overview
Hans P. Zima
This talk outlines the state-of-the-art in programming paradigms for high
performance computing and describes the challenges posed by future architectures
and advanced applications. We focus on the requirements for high-productivity
programming languages that represent a viable compromise between the dual
goals of high-level abstraction and target code efficiency. The final
part of the talk provides a short overview of the topics discussed in
the workshop.
A Co-Array Fortran Tutorial
Robert W. Numrich
Minnesota Supercomputing Institute
Minneapolis, Minnesota, USA
Co-Array Fortran is a simple extension to Fortran 95 that allows the
programmer to write parallel applications using a familiar Fortran syntax.
The International Fortran Standards Committee is in the process of adding
co-arrays to Fortran 2008 as a standard feature. Co-Array Fortran assumes
an SPMD model with explicit data decomposition and explicit synchronization
supplied by the programmer. It is one of a triad of explicit SPMD models
that share a similar underlying programming model. The tutorial includes
a brief overview of the other two language extensions, Unified Parallel
C (UPC) and Titanium, an extension to Java, but the main emphasis will
be on the details of the co-array model with examples of how to apply
it to some typical problems in parallel programming. In addition, it
will demonstrate how to combine the co-array extension with the object-oriented
features of Fortran 95 to define distributed objects with associated
parallel methods for numerical linear algebra and PDE solvers. With
luck, we will have remote access to a Cray-X1 for a live demonstration
of the power of Co-Array Fortran.
X10: An Object-Oriented Approach to Scalable Non-Uniform Cluster Computing
Vivek Sarkar
IBM T.J. Watson Research Center
The dominant emerging multiprocessor structure for the future is a
Non-Uniform Cluster Computing (NUCC) system with nodes that are built
out of multi-core SMP chips with non-uniform memory hierarchies, and
coupled with interconnects ranging in performance from those found in
commodity clusters such as blade servers to high-end supercomputers
such as the Blue Gene. Unlike previous generations of hardware evolution,
this shift towards multi-core SMP chips will have a major impact on
mainstream software. Current OO language facilities for concurrent and
distributed programming are inadequate for addressing the needs of NUCC
systems because they do not support the notions of non-uniform data
access within a node, or of tight coupling of distributed nodes.
We have designed a modern object-oriented programming language, X10,
for high performance, high productivity programming of NUCC systems,
as part of the IBM PERCS project in the DARPA HPCS program. A member
of the partitioned global address space family of languages, X10 highlights
the explicit reification of locality in the form of places; lightweight
activities embodied in async, future, foreach, and ateach constructs
that go beyond the SPMD model; a construct for termination detection
(finish); the use of lock-free synchronization (atomic blocks); and
the manipulation of cluster-wide global data structures. In this talk,
we will give an overview of the X10 programming model and language.
If time permits, a demo will also be presented of the X10 reference
implementation and Eclipse-based development environment.
This is joint work with other members of the X10 core team --- Raj
Barik, Philippe Charles, Christopher Donawa, Robert Fuhrer, Allan Kielstra,
Igor Peshansky, Christoph von Praun, and Vijay Saraswat.
This work has been supported in part by the Defense Advanced Research
Projects Agency (DARPA) under contract No. NBCH30390004.
Parallel Programming in Chapel: An Example-Oriented Introduction
Bradford L. Chamberlain
Cray Inc.
Seattle, Washington, USA
This tutorial will introduce the audience to the Chapel parallel programming
language. Chapel is being developed at Cray Inc. as part of the DARPA
HPCS program to improve productivity on high-performance computing systems
by 2010. The tutorial will start with an exploration of Chapel's parallel
programming model. This model
extends and generalizes the global-view model of HPF and other high-level
data-parallel languages. The advantages to global-view parallel programming
will be discussed.
The tutorial will then discuss Chapel's support for generalized arrays
and domains. Derived from ZPL's regions, domains in Chapel are a first-class
abstraction used to represent index sets that are optionally distributed
between multiple compute locales. In addition to supporting traditional
multidimensional arrays, Chapel's domains also support sparse arrays,
associative arrays (sometimes called dictionaries), and a form of graphs
called opaque arrays. In addition, arrays and domains are extensible
in Chapel, allowing users to specify their implementation and distribution
between locales.
The rest of the tutorial will then examine the NAS MG benchmark. We
will see how the global-view parallel programming model simplifies the
code that the programmer needs to write. Further, it will be shown that
the generalized arrays, specifically sparse arrays, can easily be applied
to the Chapel version of the MG benchmark to make it more efficient
without requiring significant changes to the code. It will be seen that
a similar optimization made to the Fortran and MPI version of the benchmark
would require significant changes to the
code.
Time permitting, we will also examine the NAS FT benchmark and show
similar advantages of Chapel's parallel programming model and generalized
array support. We will argue that given the right abstractions, it becomes
easier for a programmer not only to experiment with different parallel
implementations but also to create
new parallel algorithms.
An Approach to User-Defined Data Distributions in Chapel
Hans P. Zima
This talk will describe an approach to data distributions in Chapel,
which has been developed at Caltech/JPL in the framework of the DARPA
HPCS project Cascade, in cooperation with Cray Inc. The main challenge
for the specification of user-defined data distributions is to expose
the user to enough level of detail regarding parallel
execution to grant effective communication of problem-specific knowledge,
while concealing the unproductive details related to low- level parallel
programming such as communication, synchronization, and the explicit
distinction between local and remote memory accesses. Rather than offering
a fixed set of built-in distributions, Chapel provides a distribution
class interface which allows the explicit specification of the mapping
of elements in a collection to units of uniform memory access, the control
of the arrangement of elements within such units, the definition of
sequential and parallel iteration over collections, and the specification
of allocation policies. The result is a concise high-productivity programming
model that separates algorithms from data representation and enables
reuse of distributions, allocation policies, and specialized data structures.
ROSE: An Infrastructure for Abstraction-Aware High-Level Optimization
of Scientific Applications
Markus Schordan
Vienna University of Technology, Vienna, Austria
Making software development more productive can be considered as a
process of defining and using higher levels of abstraction. Abstractions
can be domain-specific and most often are user-defined. Domain-specific
abstractions have long been supported using libraries, but where they
implement critical features, performance can be problematic. When the
semantics of the user-defined ,abstraction is only defined by its implementation,
lacking a high-level specification, the ability of the compiler to perform
optimizations is often compromised. Ultimately program analysis is critical,
but often not readily obtained. The development of compile-time support
for abstractions defines a new avenue to the support of high-level abstractions,
and a technique to easily define customized domain-specific languages.
Tailoring an existing language to include compile-time support for domain
specific abstractions defines a new approach to the efficient development
of new
languages. The addition of language restrictions further refines the
definition of domain-specific support for increased productivity in
software development.
We shall present an infrastructure, ROSE, that supports research on
such a type of compiler technology. ROSE permits the optimization of
user-defined abstractions. We shall demonstrate this approach with examples
from our own work in optimizing high-level abstractions of high-performance
scientific applications. We shall also discuss the architecture of ROSE
and the integration of tools, such as the Program Analysis Generator
(PAG) from AbsInt, for performing abstract interpretation. For supporting
various kinds of transformations, ROSE
offers multiple levels of rewrite interfaces, which we demonstrate with
a number of transformations. The optimizations are performed as source-to-source
transformations and eventually the optimized application codes are passed
to a vendor compiler to generate machine specific code.
Our approach also feeds back into the better design of domain specific
abstractions by addressing performance as a separate aspect. The approach
could well be a significant mechanism to addressing the poor economic
characteristics related to the support of computer languages within
scientific computing. By enabling domain-specific optimizations for
a general- purpose language, we define an alternative to the development
of a whole new language from scratch. It is well suited to smaller application
domains, and by addressing both higher levels of abstraction and performance,
it supports better software development productivity.
High Level and Domain-Specific Languages: Parallel Global Address
Space Support
P. Sadayappan
Ohio State University
Columbus, Ohio, USA
This talk will describe two ongoing efforts at developing high-productivity
parallel programming language environments: 1) a domain-specific high-level
language targeted at a class of compute-intensive quantum chemistry
computations, and 2) a parallel MATLAB effort. These projects are built
on top of the ARMCI (Aggregate Remote Memory Copy Interface) and GA
(Global Arrays) libraries for parallel global address
space support.

Chair, Professor Hironori Nakajo
Tokyo University of Agriculture and Technology.
ALPS focuses on the current technological challenges in developing power-aware
computing systems, ranging from servers to portable and embedded devices.
The goal of this workshop is to bring together people from industry
and academia that are interested in all aspects of power-aware computing.
The workshop will provide a relaxed forum to present and discuss new
ideas, new research directions and to review current trends in this
area. The topics of the workshop include any issue related to power-aware
computing.
8:25 Opening, Hironori Nakajo
08:30-09:15 Session 1: Invited Talk, Chair: Hironori Nakajo
- Energy-Efficient Embedded System Design at 90nm and Below-- A System-Level
Perspective --
Tohru Ishihara (Kyushu University, Japan)
09:30-11:10 Session 2: Power-aware Compilation, Chair: Kenji Kise
- Dynamic Voltage and Frequency Scaling Method based on Statistical
Analysis
H. Sasaki, Y. Ikeda, M. Kondo, and H. Nakamura (University of Tokyo,
Japan)
- Empirical Study for Optimization of Power-Performance with On-Chip
Memory
C. Takahashi, M. Sato, D. Takahashi, T. Boku, H. Nakamura, M. Kondo,
and M. Fujita (University of Tsukuba, Japan)
- Performance Evaluation of Compiler Controlled Power Saving Scheme
J. Shirako, M. Yoshida, N. Oshiyama, Y. Wada, H. Nakano, H. Shikano,
K. Kimura, H. Kasahara (Waseda University, Japan)
- Optimizing the Profile-Guided Real-Time Voltage Scheduling Considering
the System Maximum Frequency
H. Yi, X. Yang, and J. Chen (National University of Defense Technology,
P.R.China)
11:25-12:45 Session 3: Low-power Design, Chair: Toshinori Sato
- Program Phase Detection Based Dynamic Control Mechanisms for Pipeline
Stage Unification Adoption
J. Yao, H. Shimada, S. Tomita, Y. Nakashima, S. Mori (Kyoto University,
Japan)
- Reducing Energy in Instruction Caches by Using Multiple Line Buffers
with Prediction
K. Ali, M. Aboelaze, and S. Datta (York University, Canada)
- Power and Performance Advantages of the Highly Clustered Microarchitecture
[S]
Y. Sato, K. Suzuki, and T. Nakamura (Tohoku University, Japan)
- Low Power FSM Synthesis with Testability [S]
S. Chaudhury, J. S. Rao, and S. Chattopadhyay (Indian Institute of
Technology Khargpur, India)
12:45 Closing, Hironori Nakajo
[S] short presentation

Performance Tuning Techniques for HPC Applications
David Klepacki
IBM T.J. Watson Research Center
Simone Sbaraglia
University of Cagliari, Italy
IBM T.J. Watson Research Center
Audience
This tutorial is for scientists, engineers and programmers who wish
to study the performance behavior of their applications and develop
rapid techniques to tune them on modern computing systems.
Outline
The physical limitations that device scaling is imposing on the overall
need to satisfy the demand for increased computing power results in
computing systems that are inherently more complex in design. Examples
include the introduction of more sophisticated processor cores capable
of vector instructions having boundary and size requirements, multiple
floating-point operations in the same cycle, hyperthreading, prefetching,
more complex memory hierarchies involving multiple levels of cache having
various associativity classes and even network fabric that may introduce
non-uniform interconnect topologies. And, as intelligent and capable
as compiler technology has become, they are not able to resolve all
of the performance issues that surround this complexity. The compiler's
responsibility is to guarantee data integrity and correctness of function.
The performance behavior is still ultimately decided by the knowledge
and skill of the programmer.
We understand that people use computers to achieve scientific progress,
and should not be burdened with spending inordinate amounts of time
to tune applications as part of this goal. On the other hand, at some
point time becomes realized as money so the necessity of having to execute
applications as efficiently as possible is imposed by the additional
economic constraints. So, what we present here are the most efficient
methods we have found to help such researchers in their quest to optimize
their productivity.
Specifically, this tutorial will cover the most important skills of
application performance tuning that will net the largest gains with
the minimum of effort. In particular, we will cover:
- Quick review of system architecture components relevant to an application
programmer and how to exploit them in your code.
- Essential instruction cycle analysis and associated hand-tuning
skills that every programmer should know.
- Complexity analysis to ensure scalability of parallel application
development.
- Identifying load imbalance and other communication bottlenecks
when using MPI.
- Understanding false sharing and its drain on performance of shared
memory parallel applications (e.g., threaded applications, OpenMP).
- How to study the detailed memory movement and data access patterns
occurring in your application.
- Some useful techniques to improve I/O performance.
We focus on time-tested practical techniques based on our years of
experience in application performance modeling and tuning. We also introduce
you to the software tools of the trade to maximize efficiency and productivity.
Presenters' Profiles
David Klepacki manages the Advanced Computing Technology for Exploratory
Server Systems at the IBM T.J. Watson Research Center in Yorktown Heights,
NY. He is a senior staff member at IBM Research with more than twenty
years of experience in high performance computing. He has worked in
industry in a variety of areas relating to computational science including
high performance processor design, numerically intensive computation,
computational physics, parallel computing, application benchmarking,
and cluster computing. He holds advanced degrees in physics, electrical
and computer engineering, including a Ph.D. in theoretical nuclear physics
from Purdue University.
Simone Sbaraglia earned a Ph.D. in mathematics at the University of
Rome, with thesis work on optimization theory. He currently holds a
research faculty position at the University of Cagliari in Italy, and
collaborates with IBM Research on the design and development of application
performance modeling technologies. Simone is also the architect of the
"pSigma" infrastructure and Principal Investigator for NSF-funded
research on "Simulation of Deep Memory Hierarchy Systems".
Prior to his current position, Simone Sbaraglia has been a Research
Staff Member at IBM Research and a member of the National Research Council,
Institute of Applied Computing in Italy.
|