C-SAFE Home

Information/Instructions/Sandia/RedStorm

From C-SAFE Wiki

Jump to: navigation, search

Contents

Before You Begin

WARNING!!! RedStorm no longer exists... you can't use it now!

  • Before your account on Redstorm will be activated you must complete the training course found here.

RedStorm @ SNL

  • Compiling on reddish can be faster than compiling on Redstorm (head nodes). This is due to the fact that reddish has four processors, while RS has 2 (or perhaps 1, as the OS is in 'single-thread' mode...)
  • However, reddish can be hammered a lot of the time, so compiling on Redstorm may make sense in these cases.
  • NOTE: reddish and redstorm do NOT share a file system.

Transferring (pftp2) executable to RedStorm

However, then you will need to ftp (actually use pftp2) to copy resulting sus to redstorm for execution.

How To Get Help

Send email to:

  • Reddish: reddish-help@sandia
  • RedStorm: redstorm-help@sandia

Misc. Issues

  • Reddish
    • Only have a 1 GB quota in home directory, so checkout and build in /projects/uintah/users.
      • Make your own user directory in there... you must be in the wg-uintah group to do this.
    • Note, this space is not backed up...

Logging In

  • ssh <user_name>@srngate.sandia.gov

Choose "2" (ssh). Machine name: redstorm (or reddish if redstorm isn't available).

  • Redstorm - The real machine's login nodes.
  • Reddish - A linux cluster with a cross-compiler.

Unix Group

  • Make sure to tell Dav to get you added to the Uintah Unix (wg-uintah) group.

scp vs scp

Sandia has two versions of ssh/scp, depending on whether you want to go between Sandia ... so make sure you use the correct one.

  • Within Sandia use:
    • scp -> /usr/local/bin/scp
    • ssh -> /usr/local/bin/ssh
  • Outside of Sandia use:
    • scp -> /usr/bin/scp
    • ssh -> /usr/bin/ssh

To minimize the confusion add the following to your .cshrc file

alias scp       'echo "Use rscp or lscp"'
alias ssh       'echo "use rssh or lssh"'
alias lscp      '/usr/local/bin/scp \!*'
alias lssh      '/usr/local/bin/ssh \!*'
alias rscp      '/usr/bin/scp \!*'
alias rssh      '/usr/bin/ssh \!*'

Or, perhaps even better, place symbolic links in your ~/bin directory:

cd ~/bin
ln -s /usr/local/bin/scp lscp
ln -s /usr/local/bin/ssh lssh
ln -s /usr/bin/scp rscp
ln -s /usr/bin/ssh rssh
ln -s /usr/bin/ssh-agent rssh-agent
ln -s /usr/bin/ssh-add rssh-add

And make sure to put ~/bin at the beginning of your path.

Getting the Code (SVN)

Note, svn is in the normal path on reddish. On RedStorm, it is in /usr/local/unsupported/bin.

Add the following to your ~/.subversion/servers file:

[groups]
code = code.sci.utah.edu

[code]
http-proxy-host = wwwproxy.sandia.gov
http-proxy-port = 80

Then check out the code:

svn co https://code.sci.utah.edu/svn/SCIRun/trunk/src SCIRun/src

Where to put the code?

NOTE: Checkout (and build) the code in /projects/uintah/users/<your user id>. You will need to create the directory for yourself. To do so, you must be in the wg-uintah group.

Compiler Information

You must use the PGI (cross) compilers. These are the defaults on RedStorm, but must be turned on on Reddish with this command:

> module load PrgEnv-pgi-xc

PGI Version 6.0:

  • CC
  • f77

MPI

  • Note, the MPI headers/libraries are 'built into' the compiler, so we don't need to explicitly list them on the compile/link line.

Configure

You don't 'configure' on Redstorm. You need to grab the restorm64opt directory from the /projects/uintah/users/common/ directory.

Optimized

  • cp -r /projects/uintah/users/common/redstorm64opt .../SCIRun/redstorm64opt

Debug

  • cp -r /projects/uintah/users/common/redstorm64dbg .../SCIRun/redstorm64dbg

Make

Then just type make -j4 sus. (SCIRun does not build on RedStorm.)

Warnings

You may see some warnings when 'sus' is linked, such as:

  • warning: system is not implemented and will always fail
    • ...UintahThirdparty/hypre-2.0.0-install/lib/libHYPRE.a...: In function `time_getWallclockSeconds':
    • warning: times is not implemented and will always fail...
  • You will also see:
    • /opt/xt-pe/1.5.59a/bin/snos64/CC: INFO: catamount target is being used
      • This just informs the user that a cross compiler is being used.

You can safely ignore these warnings as these functions are (should) not be called when 'sus' is run on Red Storm.

  • The following used to occur, and though irritating, could be ignored as it did not cause any problems:
cc -Minline -O3 -fastsse -fast   -Minform=severe -DREDSTORM  -Llib -lgmalloc  sus.o prereqs Packages\
/Uintah/StandAlone/sus   -o sus
/opt/xt-pe/1.5.59a/bin/snos64/cc: INFO: catamount target is being used
File with unknown suffix passed to linker: prereqs
File with unknown suffix passed to linker: Packages/Uintah/StandAlone/sus
/usr/bin/ld: sus.o: No such file: No such file or directory

File Systems

Jobs producing large amounts of data should be run from the /scratch* file systems. To get a directory on these file systems use mkdir from the /scratch* directory:

> cd /scratch*
> mkdir username

In addition you will need to set your directory striping or you may get crashes when producing large amounts of output:

> set_dir_stripe username 64

Running SUS

You cannot just type 'sus' on the redstorm command line, as it would try to run 'sus' on the head node (a linux box), but 'sus' was compiled for the micro-kernel back end compute nodes. If you do try to run it, you will get this error:

Segmentation fault

To run 'sus', please use YOD or Batch Jobs (see below).

Stack Trace

The stack trace returned on Redstorm does not include the function names, just function addresses. The function names corresponding to these addresses can be found using the "StackTrace" program in SCIRun/src/Packages/Uintah/tools/StackTrace.

Here is an example of using this tool:

1) Run sus... a stack trace is generated:

rslogin01:623:redstorm64opt/Packages/Uintah/StandAlone> yod -sz 4 sus -mpi -mpm inputs/MPM/disks.ups 
Parallel: 4 processors (using MPI)
Parallel: MPI Level Required: 0, provided: 1
Date:    Wed Apr 16 13:58:56 2008
Machine: rslogin01
Simulation Component:   mpm
Load Balancer:          SimpleLoadBalancer
Scheduler:              MPIScheduler
Patch layout:           (2,2,1)

[ed: ...]

done taskgraph compile (0.003291 seconds)
Created 3248 total particles
Compiling taskgraph...
^G^G^GThread "main"(pid 3) caught signal SIGSEGV at address (nil) (segmentation violation)
RedStorm Stack Trace:

[ed: The following is the stack trace that you will need to use:]

[0x20fc6f]
[0x21290b]
[0x2164d8]
[0x1a8a104]
[(nil)]
[0x1fae1ea]
[0x1b35b52]
[0x43760b]
[0x44bb1b]
[0x4473be]
[0x42638c]
[0x42486f]
[0x209e7d]
[0x201a01]
[0x200027]

Abort signalled by pid: 3
Occured for thread: "main"

2) Compile StackTrace and generate needed files:

# Go to the src side of the tree;

> cd .../SCIRun/src/Packages/Uintah/tools/StackTrace/

# Build the executable:

> make

# Create a list of sorted symbols from the sus executable:

 > nm ../../../../../redstorm64opt/Packages/Uintah/StandAlone/sus | grep -v "      U " | grep -v "       w " | pgdecode | sort > ! sus.symbols.sorted

# Copy the stack trace into a file (eg: sus.stacktrace).

[0x20fc6f]
[0x21290b]
[0x2164d8]
[0x1a8a104]
[(nil)]
[0x1fae1ea]
[0x1b35b52]
[0x43760b]
[0x44bb1b]
[0x4473be]
[0x42638c]
[0x42486f]
[0x209e7d]
[0x201a01]
[0x200027]

# Run the StackTrace program:

rslogin01:575:Packages/Uintah/tools/StackTrace> ./StackTrace sus.symbols.sorted sus.stacktrace 
Number of symbols read: 10000
[Ed: ...]
Number of symbols read: 380000

Total Number of Symbols: 389213

stack trace (raw):

[0x20fc6f]
[0x21290b]
[0x2164d8]
[0x1a8a104]
[(nil)]
[0x1fae1ea]
[0x1b35b52]
[0x43760b]
[0x44bb1b]
[0x4473be]
[0x42638c]
[0x42486f]
[0x209e7d]
[0x201a01]
[0x200027]

stack trace (with names):

20fc6f -- SCIRun::getStackTrace(void
21290b -- SCIRun::Thread::niceAbort(void
2164d8 -- SCIRun::handle_abort_signals(int,
1a8a104 -- _sig_handler
Warning: Did not find valid address (0x...) in: '[(nil)]'
1fae1ea -- void
1b35b52 -- Uintah::Task::doit(const
43760b -- Uintah::DetailedTask::doit(const
44bb1b -- Uintah::MPIScheduler::runTask(Uintah::DetailedTask
4473be -- Uintah::MPIScheduler::execute(int,
42638c -- Uintah::AMRSimulationController::executeTimestep(double,
42486f -- Uintah::AMRSimulationController::run()
209e7d -- pgCC_compiled.
201a01 -- cstart
200027 -- _start

!!!WARNING!!! - Every time you recompile sus, you will need to regenerate the sorted symbol list!

Misc

  • Redstorm optimization flags: OPT_FLAGS = -Minline -O3
  • To turn off warnings: CXXFLAGS => -Minform=severe

Demangler

  • Use the standard system nm to get object information from .o files.
  • The name demangler is pgdecode. (Do not use c++filt)

Batch Jobs (on RedStorm)

  • You must add this environmental variable
.cshrc: setenv RS_ACCOUNT UUTA/0001
.bash:  export RS_ACCOUNT "UUTA/0001"
  • Useful Commands:
  1. qsub <batch script>  : submit a job
  2. qdel <job ID>  : delete a job from the queue
  3. xtshowmesh  :command lists info about number of nodes available.
  4. qstat -a  : queue status
  5. The script /projects/uintah/users/common/jobs.sh is useful for monitoring your jobs.

____________________________________________________________________

  • SAMPLE pbs SCRIPT for a 512 processor job on redstorm
#!/bin/csh
#PBS -N S4
#PBS -eo
#PBS -l walltime=00:10:00
#PBS -l size=256
cd /scratch1/tharman/Study04/
set OUT="out.S4.000"
yod -small_pages -VN ./sus -mpi HTContainer_arches_mpmice.ups  >& $OUT

Moab Job Scheduler at Sandia

Here is a helpful document on using the Moab job scheduler at Sandia: Moab Scheduler.

Batch Queue Limits

The RedStorm help desk says to run the "mdiag" command (see below) to get information on queue limits.

DEFAULT.WCLIMIT is the default value that your job will request if you don't specify a wall time. It's set to 1 hour. MAX.WCLIMIT is the longest that a job can request, and still run in the queue. It's currently 3 days. If you ask for more than that, Moab will quietly put a hold on your job until such time as the limit is raised enough to run your job.

rslogin03:602:~> mdiag -c standard -c
Class/Queue Status

ClassID        Priority Flags        QDef              QOSList* PartitionList        Target Limits

standard             20 ---          ---                   ---  ---                   0.00     ---
  CAPACITY=1888  DEFAULT.FEATURES=[DUAL][dual]  STATE=active  DEFAULT.WCLIMIT=1:00:00  MAX.WCLIMIT=3:00:00:00

Yod

  • By default yod uses 2 MB pages which does not work well with Uintah. Using the option -small_pages causes yod to use 4KB pages which leads to around a 2 times speed up.
  • Performance testing has shown that using the -VN to run 512 tasks on 256 nodes is only around 1% slower than running 512 tasks on 512 nodes. Thus whenever possible you should run with the -VN flag.
  • To run a job (from redstorm):
    • yod -small_pages -VN -sz 4 sus -mpi <other args>
    • NOTE: you need to use "-mpi" to tell sus it is in MPI mode.
    • When running inside of a pbs script the -sz parameter can be omitted and yod will use all assigned nodes.
    • To pass an env variable through to the program, use "-setenv VAR=value" (eg: -setenv SCI_DEBUG=TaskDBG:+).
  • Debugging
    • totalviewcli yod -small_pages -np -a -np 4 a.out
      • dgo
        • yes
      • dfocus 2
      • dlist

Blas/LaPack Libs

On RedStorm, blas/lapack routines are found in libacml.a. This library is located in /opt/acml/3.6.1/pgi64/lib. (Or $ACML_DIR).

RedStorm Machine Info

Redstorm is partitioned into two sides referred to as red and black. Red is the classified side and black is the unclassified side. Each side is comprised of some combinations of three sections of the machine. Section A has 3360 nodes each with 2GB of memory, section B has 6240 nodes with either 2GB or 3GB of memory, and section C has 3360 nodes with either 2GB, 3GB, or 4GB or memory. Typically section A is assigned to black and C is assigned to red. Section B is moved between red and black. The current configuration of Redstorm can be viewed here: [1]


Back to: Main:Information:Instructions:Sandia

Personal tools