
\documentstyle[12pt,epsf]{article}

\setlength{\textwidth}{6.1in}
\setlength{\textheight}{8.0in}

\setlength{\oddsidemargin}{0pt}
\setlength{\evensidemargin}{0pt}

\begin{document}

%\centerline{\large\bf C-SAFE CS Task Annual Report}

%\vspace{0.3in}
%\centerline{\bf C-SAFE CS Associate Director: Tom Henderson}

%\vspace{0.2in}
%\centerline{\bf C-SAFE CS PI: Al Davis}
%\centerline{\bf C-SAFE CS PI: Chuck Hansen}
%\centerline{\bf C-SAFE CS PI: Chris Johnson}
%\centerline{\bf C-SAFE CS PI: Bob Kessler}
%\centerline{\bf C-SAFE CS PI: Gary Lindstrom}

%\vspace{0.1in}
%\centerline{\bf 22 September 1998}

%\newpage

The major areas of the Computer Science effort include:
\begin{itemize}
  \item {\bf Problem Solving Environment}: Our major work this year
  has been to map out the transition from the exisiting SCIRun PSE to
  the new C-SAFE PSE (called SimSAFE).  Other areas addressed include
  the development of a distributed computing model, and
  the study of specific computational modules and needs from the three
  C-SAFE steps.  Also, we have participated in the development of a
  software architecture for meshing, numerical solvers and
  visualization schemes.

  \item {\bf Visualization}: Step-centric visualization modules have
  been explored, in particular, for the Container Dynamics Material
  Point Method, as well as CFD data visualization.  We have also
  worked on Gigabyte dataset visualization and multi-pipe
  rendering in which multiple graphics adapters are exploited.

  \item {\bf Performance Analysis}: We have investigated existing
  continuous profiling systems for the SGI O2K platform, and developed
  some new tools for exploiting counters that the standard performance
  tools do not use.  We have also collected representative test codes
  from the national labs, the PSE group and the C-SAFE steps in order
  to characterize performance issues.

  \item {\bf Software and Data Management}: Much of the first year has
  been spent developing a coherent approach to Scientific Data
  Management (SDM), as well as software engineering issues.  We have a
  well specified software development management structure.  SDM has
  two major parts: (1) a web-based repository and control facility,
  and (2) infrastructure support for managing datasets themselves.

\end{itemize}

\section{Problem Solving Environment}

During the first year of the University of Utah ASCI Alliance Center, the
problem solving environment team has worked on the infrastructure and its
parallelization which will be required to integrate codes from various
CSAFE researchers/teams.  In addition, we have also worked towards
integration by examining several national lab codes that the CSAFE
researchers are planning to use.  Finally, we continue the process of
coordinating with researchers in the other steps, as well as researchers in
computer science for the pieces that will be integrated into the final
CSAFE software product.

First, we have implemented peer-to-peer distributed computing capabilities
in SCIRun.  This will become a necessity to the CSAFE project (see Figure
~\ref{fig:PSE1}) as we scale beyond 128 nodes on the Origin 2000 and need
to do a combination of explicit message passing between clusters of 128
node shared memory environments.  Previously, our modeling, simulation and
visualization tools were all executed on the same shared-memory parallel
machine (SGI Origin 2000).  With these new capabilities, we can execute a
simulation on one machine, simultaneously using visualization capabilities
from a different machine.
%
\begin{figure}[h]
\epsfxsize = 6.5in
\hfil\epsffile{fig_remote.ps}\hfil
\label{fig:PSE1}
\end{figure}
%
As we plan to also utilize the LLNL IBM system, we are trying to
design a communication infrastructure that is largely independent of
the number of nodes within the shared memory partition.  This is
important since the SGI Origin 2000 clusters will contain 128 nodes
per shared memory cluster and the IBM system will contain a cluster of
approximately 4 nodes per shared memory partition.  Creating such a
flexible communication infrastructure, while at the same time
affording maximal scalability is challenging.  Furthermore, such an
infrastructure needs to be tightly integrated within the CSAFE problem
solving environment.


The integration process will not be complete until the techniques and
computer programs from all steps are complete.  However, in order to get a
head start on the process, we have worked with preliminary versions of the
computer programs, or in some cases other non-CSAFE computer programs which
are similar.

The Fire Spread team has chosen to use SAMRAI as the basis on which to
build their simulations.  Steve Parker, one of the PSE staff members,
traveled to Lawrence Livermore National Laboratory in order to begin
initial dialog with the SAMRAI team about how it might be integrated with
SimSAFE.  We have identified those pieces which each team will need to work
on, and the initial results of this are expected in the near future.

\begin{figure}[bt] \label{cfdlib}
\epsfxsize = 6.5in
\hfil\epsffile{fig_cfdlib.ps}\hfil
\caption{An example of the integration of CFDLIB with SCIRun.  On the
left shows the CFDLIB module connected to visualization components
from SCIRun.  The visualization on the right is an example problem as
it is being solved by CFDLIB.  The researcher can visualize the
results of CFDLIB, even as the computation is in progress.}
\end{figure}

The Container Dynamics team has selected the Material Point Method for the
simulation of the container.  In order to gain experience with the method,
we have integrated CFDLIB, a computational fluid dynamics code from Los
Alamos National Laboratory.  An example of this integration is shown
in figure~\ref{cfdlib}.  CFDLIB posed many challenges to the problem
solving environment, most of them due to the fact that CFDLIB was written
in Fortran.  During the process, we identified those features of Fortran
that should be avoided in the future order to make integration easier:

\begin{enumerate}

  \item Fortran pause and stop statements interact badly with the
        multi-threaded nature of SCIRun.  These statements should ideally
        be removed and replaced with error codes, or calls to an error
        function.

  \item Fortran files must be closed at the end of the run of the
        module, or they will not be able to read from the files on the next
        run.  This has been traditionally considered ``good form'' for
        Fortran programs, but many programs do not follow it.  Fortunately,
        this is an easy problem to fix by inserting the proper close
        statements.

  \item Common blocks also cause a limitation.  If a Fortran program
        uses common blocks, then only one instance of that module can be
        active at a time.  If this restriction is not followed, multiple
        programs may try to use the same common block at the same time, and
        probably cause an error.  This is due to dynamic linking of
        modules, which is used in lieu of separate processes to achieve
        high performance.  This could also create a problem if a Fortran
        common block of the same name is used by another Fortran program.

\end{enumerate}

According to discussions that we have had with other CSAFE researchers,
the proposed uses of Fortran in the project will avoid most of these
problems.

In addition to integration, we added support for particles sets in the
problem solving environment.  These facilities were used by the
visualization team to implement visualization tools which the container
dynamics team is currently using.  This tool is described further in the
Visualization section.

\section{Visualization}

During the first year of the University of Utah ASCI Alliance Center, the
scientific visualization team has worked on three main research and
development issues:

\begin{enumerate}

\item Creation of a new parallel method for performing isosurface extraction
for
      very large data sets using ray tracing.

\item Development of new grid-specific visualization tools for the
      container dynamics and fire simulation applications

\item Research of new algorithms and tools for multi-pipe rendering for
      large-scale surface and volume visualization.

\end{enumerate}


\subsection{Gigabyte Datasets}


\subsubsection{Isosurfacing through Ray Tracing}

The most common technique for generating a given isosurface is to create an
explicit polygonal representation for the surface using a technique such as
Marching Cubes.  This surface is subsequently rendered with attached
graphics hardware accelerators such as the SGI Infinite Reality Engine.
The Marching Cubes isosurface algorithm can generate an extraordinary
number of polygons, which take time to construct and to render.  For very
large (i.e., greater than several million polygons) surfaces the isosurface
extraction and rendering times limit the interactivity.  In such cases, it
is often necessary to employ polygonal decimation algorithms to reduce the
total number of polygons.  This, in addition, adds to the total isosurface
processing time.

Rather than generate geometry representing the isosurface and rendering it
with a z-buffer, we have created a parallel method that uses ray tracing.
For each pixel we trace a ray through a volume and do an analytic
isosurface intersection computation.  Although this method has a high
intrinsic computational cost, its simplicity and scalability make it ideal
for large datasets on current

\subsubsection{Interactive rendering of 500M Cell Datasets}

As an example of the success of the {\em real-time ray tracer}, we
experimented with various parameters using the Visible Woman CT data set.
The Visible Woman data set is from the National Library of Medicine and
consists of 512x512x1736 volume of 16 bit data.  Using the Marching Cubes
isosurface algorithm, the polygon extraction time on an R10K SGI is
approximately 40 Sec, the rendering time is approximately another 2 secs.
If the Marching Cubes technques was entirely scalable, then our new
technique would be approximately 200 times faster than the optimized
Marching Cubes technique.  Our technique renders a particular isosurface at
approximately 10-15 frames per sec.  That is equivalent to 5 GVoxels/second
(500M cells * 10 frames per second)!
We have also worked with LANL to render RAGE generated datasets (see
Figure ~\ref{fig:RAGE}).
\begin{figure}
\epsfxsize = 6.5in
\hfil\epsffile{fig_rage.ps}\hfil
\caption{RAGE Generated dataset, rendered using the real-time ray
tracer.  This is a 512x512x512 dataset, consisting of 90 time steps.
Using the real-time ray tracer, we can render isosurface and cutting
planes of the time-varying dataset at several frames per second.}
\label{fig:RAGE}
\end{figure}

\subsection{Grid Specific Visualization: SimSAFE based Visualization Tool}

We've worked with the container dynamics (CD) group for a SCIRun based
visualization tool which is targeted for their Material Point Method
(MPM) simulations.  This was
given to them the first week of June and they are currently experimenting
with it.  This will give them first-hand experience using the SCIRun
problem solving environment to post-process MPM simulations.  MPM uses
both
gridless (particles) and grid-based computational data structures; we allow
them to visualization both types of data structures by exploiting spatial
coherence by displaying all in the same space, as well as, temporal
coherence provided by differing time steps.  The SimSAFE grid specific
visualization tools include several modules (see Figure
~\ref{fig:MPM}):
\begin{figure}
\epsfxsize = 6.5in
\hfil\epsffile{fig_phil.ps}\hfil
\caption{Material Point Method Simulation}
\label{fig:MPM}
\end{figure}
\begin{itemize}

  \item (Grid) scalar field: slicing, color mapping, isosurfacing.
  \item (Grid) vector field: vector display (hedgehogs).
  \item (Particle) particle vis: renders particles, color maps scalars
  onto them, and renders as either spheres or points.

\end{itemize}

\subsection{Multi-pipe Rendering}

Polygon rendering is expensive so exploiting the multiple graphics pipes
within the SGI Origin 2000 is attractive.  Current generation of high-end
graphics adapters, such as the SGI Infinite Reality Engine, have a quoted
maximum polygon rendering rate of approximately 10M polygons/second.
However, in practice this rate is difficult to achieve, primarily due to
geometry traversal (appropriate display list) and pixel coverage of the
polygons.  In exploratory scientific visualization, isosurfaces are
typically represented by a large number of polygons.  As the isovalue is
changed, the resulting surface, and perhaps the topology, is modified
making display lists impractical.  As a result, when viewing isosurfaces of
large data sets rendering becomes a bottle neck.  We have recently seen the
attachment of multiple graphics adapters on high-end SGI Origin 2000
supercomputers.  The multiple graphics pipes can be used for multiple users
or they can be combined to provide a single user the power of parallel
rendering.  Until recently, there has not been much experimentation with
this configuration for polygon rendering.

At the University of Utah, we have implemented a technique for exploiting
the parallel (up to eight) Infinite Reality graphics adapters on the
University's Origin 2000.  One can divide the image and assign each
graphics adapter a subimage for rendering.  This approach uses either
preculling of visible polygons (software based frustum culling) or uses the
hardware based visibility culling in the graphics adapters themselves.  The
problem with software based visibility culling is the dependency on spatial
hierarchies for the underlying surface.  When generating an isosurface,
such pre-processing would be time-prohibitive.  The difficulty with using
the hardware based visibility culling is that each graphics adapter needs
to process the entire set of polygons for its subimage.  Using a {\em
sort-last} rendering scheme provides a different approach.  In {\em sort
last} rendering, each graphics adapter renders only a portion of the
polygons and the resulting partial images are combined, using depth
comparison, to product the final result.  This is the approach we have
chosen to use since it provides better interactivity without the burden of
preprocessing the isosurface once it is created.

Our new algorithm proceeds as follows: each graphics adapter renders $n/p$
polygons, where $n$ is the total set of polygons representing an isosurface
and $p$ is the number of available graphics adapters.  When this step is
complete, the partial images need to be composited.  We use the Binary-Swap
method for image composition.  This method composites in logarithmic time
and utilizes all of the graphics adapters in parallel.  Each graphics
adapter swaps $\frac{1}{2}$ the remaining subimage and z-buffer with its
partner, a stencil buffer is created where the new Z value is closer to the
viewpoint than the old Z value and pixels are rendered from the new image
using the stencil buffer which performs the correct hidden surface
elimination.  This process is repeated until all graphics adapters have
exchanged subimages which results in a correct final image distributed
among the different graphics adapters.  The final stage collects the
subimages to a single graphics adapter.

Surprisingly, the tradeoff for multi-pipe rendering occurs around 1M
polygons.  Given the large number of polygons extracted using typical
isosurface methods, this clearly provides acceleration through the parallel
rendering.

\section{Performance Analysis} 

Existing performance monitoring tools provide relatively poor support for
tuning highly parallel programs on modern machine platforms.  In particular
they do not provide information about critical communication issues and in
the case of share memory machines such as the SGI Origin 2000, they do not
provide information about locality of memory reference.  Existing tools
also provide an extremely clumsy user interface which makes them hard to
use.  More importantly they collect huge amounts of data but provide no
support to help the user understand the aggregate meaning of the data and
to reason about the consequences of algorithmic design decisions.
Furthermore the problem solving environment tools and the performance
tuning tools have been heretofore separate environments.  The 
result is that most large parallel codes utilize a small fraction
(often less than 10\%) of the peak performance of the large
tera-scale machine platforms that they run on.  The problem is
exacerbated by the fact that most applications programmers do not
have a detailed understanding of the architecture of the host
machine or what those details might mean in terms of structuring
codes for optimal performance.

The C-Safe performance analysis effort is attempting to create
a performance analysis tool suite which is tightly integrated
into our SCIrun problem solving environment and which uses the
same set of visualization tools to view performance data that 
is used to analyze the scientific data produced by the programs
once they are optimized.  The tool development effort is based
on a detailed understanding of the underlying architecture and its
effects on the program's performance.  The hope is that we can
construct these tools in such a way that the architecture details
can be specified as parameters to the tool and thus remove the
need for applications programmers to learn yet another complex
discipline.

Given the 3 machine platform choices of the ASCI program at the
present time and the fact that the University of Utah has a
64 processor IBM SP-2 and a 64 processor SGI Origin 2000, the
focus of this work will initially be on the Origin platform
and as the effort matures we will expand the focus to include
the SP platform.

The first year goals were directed at the Origin platform and include:
\begin{enumerate}

\item Use existing SGI performance analysis tools and develop an
understanding of their strengths and defects.

\item If possible, understand SGI's plans for future performance
analysis tool development in order to avoid duplication of effort,
as well as to insure our ability to leverage new tool offerings
from SGI.

\item Collect a representative suite of parallel program codes that we
could use to determine the utility of our tool development effort.
\end{enumerate} 

All of these goals have been accomplished, although item 3
intrinsically requires continued attention as SGI's plans change and
adapt over time.
These activities have shaped our strategy for performance analysis
tool development.


\subsection{Existing SGI Tools}

Existing SGI tools include a set of stand alone
programs (perfex, dlook, dprof), the SpeedShop suite of programs and
Performance CoPilot (PCP).   Perfex and SpeedShop were used locally
and the PCP tool could not be used since there were no local licenses.
Hence we set up a summer internship for a graduate student (Uros
Prestor) to join the performance analysis group of John McCaulpin
at SGI where PCP could subsequently be used and analyzed.  The result
of this initial effort was an extension to the perfex tool that was
created to provide more fine grain control of the monitor experiments
than was currently possible.  This tool is called pdb.

In general, the SGI tools roughly fall into two camps, single program and
NUMA.  The first group includes perfex and SpeedShop.  The focus of these
tools is a single user program.  It can be multi-threaded, and there are
some provisions to evaluate the impact of multithreading.  However, the
primary focus is on single CPU performance rather than aggregate parallel
system performance.  Perfex is the prime example in this regard.  It is a
tool that is widely used within SGI.  Perfex provides a simple interface to
the R10000 performance counters.  On each R10K processor, there are two
dedicated counters which can simultaneously count any two events, chosen
from the set of 30 possible events.  These include a CPU cycle counter,
issued and graduated instruction counters, primary and secondary
instruction and data cache misses and cache coherency protocol counters
(external intervention/invalidation requests/hits and upgrade requests on
clean/shared lines).  Perfex is also capable of computing a few basic
performance indexes: IPC, MFLOPS, L1 data hit rate, cache and memory
bandwidth.  The pdb tool permits an arbitrary selection of counters and
provides a slightly better although still primitive output data collection
capability.

The principal drawback of perfex is that it does the post-mortem analysis
of the application as a whole.  It's not possible to focus on parts of the
execution.  There is a library interface available, but it is very crude
and not particularly useful in practice. It turns out that it is much
better to use the underlying IRIX system calls.  When evaluating
multi-threaded programs, perfex is only useful in detecting false cache
line sharing.  There is no means to evaluate the use of system resources
other than processor counters.  In its scope, SpeedShop is similar to
perfex.  It lets one evaluate the performance of a single (multithreaded)
process.  It integrates a number of tests: PC sampling (used with function
call counting to reconstruct call graphs), per-process kernel usage
statistics (memory size, page fault rate, system call rate, etc.) as well
as providing the R10K counter data since it includes the perfex
functionality.  They key drawback to both perfex and SpeedShop is that
neither tool monitors critical communication performance metrics such
as the offered load on the interconnect and the level of communication
locality that the algorithm under test achieves.

On the other hand, the NUMA utilities can be used to control and display
NUMA-related information.  Dlook can be used to display the placement of
each page of memory allocated to the application.  Dplace can control
the placement of threads and memories across the machine.  Dprof is a
memory sampling tool which can be used to construct memory access
histogram.  The downside is that it incurs a huge overhead, primarily
due to the way it is
implemented.  These tools give the user control over the thread
placement policies but they do not provide any information about NUMA
resources are used (interconnect utilization, remote accesses to the local
memory, etc.).

In order to explore underlying IRIX interface to R10K performance counters,
the pdb tool was written.  Pdb can be thought of as an extension of perfex:
it will run a program for you and arrange to collect R10K performance
counter data.  When the program terminates, it will print various
performance metrics.  Just like perfex it will compute CPI and MFLOPS.  It
will also compute instruction mix, instruction and data cache hit and reuse
rates, cache and memory bandwidth and branch misprediction rate.

\subsection{SGI Development Effort} 

The purpose of the summer internship was to gain access to SGI's 
tool development plans as well as to system documentation which
is not publically available at the University level.  The more
concrete effort was to
develop tools to access hardware performance counters in Origin systems
other than just the R10K's CPU counters.  The Origin family architecture consists of
two-CPU nodes, linked together with a CrayLink interconnect.  On each
node, the central chip (the Hub) maintains a set of six counters which
can count six sets of events.  On each router chip, there is also a
64-bit performance register assigned to each of the six router links.
Also, there is a set of Hub crossbar depth registers and an Xbow (I/O
module) counter but they proved to be of little use.  With the exception
of router link utilization registers, none of this data was previously
available to applications programmers.

The first task was to develop a kernel driver for the Hub performance
counters.  It turned out there already was a set of system calls which
could be used to extract this data, but it was severely limited in
functionality and full of bugs.  The system code had to be rewritten and
expanded to provide a user level program which would control and print the
counters.  This tool, mdperf, is a perfex-style command line interface for
the Hub performance counters.  It is used to control the counters on each
node and print counter values.  It is also possible to profile an
application, i.e., enable the counters before the application starts and
print the counters when it exits.

We also learned about internal tools that SGI doesn't provide as
part of their current IRIX release.  One such tool was called evrate which
uses the R10K performance counters.  It sets up one counter as a trigger to
generate signals to the process;  when a signal is received, the trigger
is reloaded and the other R10K register is stored in a file, together
with the PC of the interrupted instruction.  The data obtained with
evrate can be used to plot timing diagrams.  For example, setting a
trigger register to count cycles and capture to count cache misses, one
can reconstruct a cache miss time-line and therefore quickly focus on
the key area of the application which results in lost cycles due to
cache misses.

This idea was expanded and the dperf tool was written.  In essence, dperf is like
evrate except that in addition to R10K counters, on each trigger
interrupt it stores Hub performance counters and nearest link
utilization register as well.  In order to facilitate high frequency (1
ms and below) polling of Hub and router utilization registers, the
IRIX operating system interface needed to be expanded.

There are several current problems with dperf.  First, it requires a
modified IRIX kernel that is not publically available.  The modifications 
to IRIX were proposed for a subsequent release but as yet they
have not yet been approved.
Secondly, the raw data is simply captured on a set of files.
The problem of appropriately visualizing the data has yet to
be solved.  The PCP visualization tools were tried but they proved inadequate.  Third,
the sampling overhead of dperf was measured and it is relatively
high (15\% for 1 ms sampling, 30\% for 500 us and 60+\% for 250 us).  A lot
of this overhead can be attributed to the long code path and several
context switches.  This can be improved significantly and this effort
will be part of the year 2 effort.

The attempt to incorporate dperf inside PCP, provided a detailed
opportunity to evaluate PCP.  It turns out that PCP is very useful to
overall system performance analysis.  It lets one measure node activity,
kernel statistics and even router link utilization.  However, the overhead
of PCP and its sampling granularity render it useless when one tries to use
PCP to evaluate application performance.

\subsection{Code Suite Acquisition and Analysis} 

Our evaluation focused, primarily, on the following issues: 1) what
information could be obtained using the hardware counters in the R10000
CPU, 2) what information was available using profiling tools that are supplied by
SGI on the Origin 2000 machine (in the SpeedShop tool suite), 3) and
attempt to understand the best way to obtain useful data.

Item \#1 suggested using perfex, a command line interface to R10000
counters, to produce numbers reflecting events at the CPU level. We are
intereseted in numbers for cache miss rate, TLB miss rates, branch
mispredicts, and instructions graduated versus instructions issued.  Though
it is possible to multiplex events using the hardware counters, we found
that the numbers produced were not as useful as we would have liked.  One
approach to mitigating this drawback is the PDB tool previously mentioned.
The other approach was to look only at a single event per run of the
program.  This approach involved writing scripts that ran perfex for each
event of interest so that multiplexing-related skewing would not occur.

Items \#2 involved the use of the SpeedShop tool {\em ssrun} to produce output
files which were then analysed using the {\em prof} utility.  This allowed us
to find exactly where in a program that an event of interest occurred.  For
instance, perfex was able to show that a certain program exhibited what we
consider to be high cache miss rate, but not the places in the program
where these misses are occurring -- a necessity if a programmer is to have
any hope of addressing the high miss rate.  Through the use of {\em ssrun} and
{\em prof} we are able to pinpoint where cache misses occur.  This helps guide
a programmer to where a problem exists in the code.

Item \#3 focused on the best way to set up a data-gathering environment for
the information that we are interested in obtaining.  We decided that
multiplexing the hardware counters to gather statistics for many events
simultaneously was not a good approach.  To mitigate this using the
SpeedShop tools we wrote scripts that allowed us to look at single events,
albeit at the expense of having to run the program once for each event that
was of interest.  This proved to be time-consuming in regards to runtime,
but gave much better results.  We used SpeedShop's and perfex's ability to
utilize environment variables to specify events of interest in these
scripts.  We were then able to gather data not only about the cache miss
rate in a program as a whole, for instance, but also to gather explicit
line numbers in files where the majority of misses were occurring.

In addition to using hardware counters and profiling to gather information
about how a program performs we also used various optimization flags to the
compiler to better understand what optimization at the compiler level could
or could not do for improving program performance.  The programs were
compiled using differing compiler optimization levels and then performance
analysis was again performed using the profiling scripts that we
developed.  

All programs were run and profiled without compiler optimization and then
some of interest were compiled, run and profiled again with higer levels of
compiler optimization.  The choice was to run with the highest optimization
level (-O3) when optimization was desired.  It is interesting that often
-O2 optimization often out performed -O3.  We are still investigating
the cause of this phenomenon.

These programs were not run on high numbers of processors or with
extrememly large data sets for two reasons: 1) it has been difficult to have
access to high numbers of processors without waiting an inordinate amount of
time for resources to become available, 2) these codes have been run
primarily to investigate the methods of gathering data about parallel
codes, 3) quick turn-around time has been a priority, thus dictating
smaller datasets and the desire to request a smaller numbers of processors.

We are now at the point where scaling the programs and datasets has become
very important.  More parallelism in our experiments and the use of larger
data sizes is our next focus.  Also, programs that enjoy wider use, as
supplied by our collegues at the University or from 
others associated with the ASCI Alliance, are desirable and we are pursuing
their acquisition. 
Using the SGI-supplies tools in the infrastructure that we have begun to
develop, and using the tools that we are developing, on these larger problem
sets is the natural direction that our investigations are taking.

The programs that we have analyzed and brief descriptions are:
\begin{itemize}
  \item  benchsolve (U of U: C and C++ conjugate gradient solver)
  \item electro (U of U: Fortran77 electron transfer code)
  \item aicmd (U of U: 3D molecular dynamics)
  \item streams (SGI: C code to stress the memory system)
  \item hydro (LANL: Fortran90 2-D Lagrangian Hydrodynamics code)
  \item heat77 (LANL: Fortran77 3-D HEAT Diffusion Solver)
  \item heat90 (LANL: Fortran90 3-D HEAT Diffusion Solver)
  \item sweep3d (LANL: Fortran77 3-D geometry neutron transport)
\end{itemize}

\section{Simulation Management / Scientific Data Management}

This task deals with overall management of C-SAFE simulation software,
simulation runs, and associated datasets (configuration, input, output, and
interpretations).  During this first year of funding, this task has resolved
into two concrete subtasks. 
The first is design and specification of a system for defining, launching,
monitoring and controlling simulation experiments of the intricacy and scale
we envision.  The second is developing local capabilities and
resources for modern
scientific data management -- particularly as practiced by
researchers at the DOE national labs.

\subsection{Simulation Management}

One realization reached early in the project was that the C-SAFE simulation
system and supporting scientific and
engineering knowledge will be highly complex and
diverse.  Simply defining what constitutes a single complete and
meaningful simulation
run across all the steps, computing resources, and scientific and engineering
models is a daunting challenge.  In order to do this, there must be a
widely available, easy to use tool for specifying, exercising, and
interpreting simulation experiments comprising the full sequence of C-SAFE
steps.

Thus far our efforts toward such a tool have been in three areas:
(a) clarifying what is required, desired and feasible in such a tool;  
(b) becoming familiar with state of the art
web-based data management technologies, and
(c) surveying existing systems that
address similar or related needs.  

\begin{description}
\item [(A) Design:]  We are designing a 
web-based (Java applet) GUI providing a query based
interface into a repository of information about simulation experiment
designs, results, plans and interpretations.  The paradigm will be a workflow
model.  It will necessarily involve remote, even detached (from the internet),
long running computational processes.  At first it will be very simple, just
recording what experiments have been run, along with user supplied
annotations, and access paths to programs and data.  As our experience and
needs grow, facilities will be added, including querying capabilities and
perhaps very rudimentary/pragmatic structured knowledge about the purposes and
interpretations of simulation results.  A major unresolved issue is how this
facility will relate to (integrate or interface with) SCIRun.  At present, 
their goals are seen as being complementary.  SCIRun emphasizes computational
steering, incremental algorithms and immediate visualization, whereas this
tool would emphasize distribution -- concurrent browsing at a distance,
repository data management, and (simple) semantic information about our
simulation assets.  

\item [(B) Technologies:] We have
explored Java / database technologies in considerable depth.
We have experimented with three products: the Java Persistent Storage
Engine from Object Design Inc., the Symantec Visual Cafe Java
DB Development Environment, and the Oracle8 extended relational
database management system.  The Java PSE stores Java object graphs,
while the Symantec system interfaces to traditional DBs, e.g., relational,
supporting standard interfaces such as ODBC.  Oracle8 is the flagship product
from Oracle, a leading RDBMS vendor.
Our preliminary decision is to use Visual Cafe for applet development, and
Oracle8 for the data repository.  This architecture will require server-side
database support (servlets), due to the Java applet security restrictions.
If the data representation requirements evolve to include storage of complex
object graphs, we may decide to replace (or augment) the Oracle8 repository
with the Java PSE, perhaps interfacing to ODI's full ODBMS, ObjectStore.

\item [(C) Existing systems:] Our survey
is still underway here, but so far the
most interesting candidate is SimTracker, from LLNL -- for both technical and
programmatic reasons.  
SimTracker has many of the characteristics described under (A).
It seems it is particularly good at displaying snapshots of the progress of
long running computations.  
%However, it has been
%difficult to obtain details about
%SimTracker, and thus far impossible to obtain its code, due to LLNL 
%security restrictions.
%We believe, however, that the SimTracker developers would like to
%produce a usable by collaborators such as the C-SAFE project.
We are pursuing access to SimTracker, and we believe that the
SimTracker developers would like to produce a version usable by
ASCI Alliance partners.
\end{description}

\subsection{Scientific Data Management}

Gary Lindstrom has begun participating in the activities of the ASCI TriLab
Data Models and Formats group.
This group is quite active, and appears to have both dynamic leadership
and active support by client groups within the labs.
The group is striving to develop both a comprehensive
interesting underlying mathematics for scientific
data description, based on the topological notions of ``fiber
bundles,'' and a proof of concept tool set, including a standardized API.
Gary Lindstrom
attended a working meeting in Livermore, Oct. '97, and got a good
glimpse of the quality of people involved, their motivations and their working
relationships (all of which seem excellent).  There is at least one large
project (Sierra, at Sandia), which has committed to
use this data model, and is pushing hard for practicality and implementation
timeliness.

At this point we are uncertain whether these developments will have any near
term impact on C-SAFE, but certainly any data management facilities that we
implement should at least anticipate support of the ASCI SDM group common data
format, when our code is delivered to the labs and other users.
An important next step is to determine which specific data formats
are
most important to the C-SAFE project, and lobby (if necessary) for their
support under ASCI DMF.

\section{Software Engineering}

The SAFE system is built from the SCIRun system developed by Chris
Johnson, Steve Parker and the Scientific Computing and Imaging group
at the University of Utah.  The alternatives were to start a new
problem solving environment from scratch, or to build on what already
existed in the SCIRun system.  It was decided that there was not
enough software development resource in the CSAFE project to develop a
new system.

SimSAFE employs a blend of object-oriented (C++), imperative (C and
Fortran), scripted (Tcl) and visual (the SAFE Dataflow interface)
languages to build this interactive environment.  The basic SAFE
system provides an optimized dataflow programming environment, a
sophisticated data model library, resource management and development
features.  SAFE modules implement components for computational,
modeling and visualization tasks.
  
We have assigned a Software Engineer (SWE) to each of the steps (Fire
Spread, Container Dynamics, and High Energy Materials) as well as to
the PSE task.  Each SWE has teo major duties: (1) oversee software
development within the step or task, and (2) help migrate step modules
into the common SimSAFE PSE.  There is a Software Development Advisory
Committee to help manage software development.  This committee reports
to the Computer Science task leader (Tom Henderson).

\end{document}
