LAM Linux cluster configure line:
../src/configure --enable-package=Uintah
'--enable-optimize=-march=pentium4 -msse -msse2 -O3'
--enable-assertion-level=0 --with-mpi=/usr/local/lam-mpi
--with-petsc=/usr/sci/projects/Uintah/Thirdparty/1.0.0/Linux/gcc-3.2-lam-32bit/petsc-2.1.1
--with-hypre=/usr/sci/projects/Uintah/Thirdparty/1.0.0/Linux/gcc-3.2-32bit/hypre-1.7.7b
Question 3: ( Jan, 2003 -- M. Hartner )
How long can jobs run on the linux cluster? How many processors?
Answer:
Please see:
http://www.csafe.utah.edu/Information/Instructions/inferno.html#cpu_usage
Question 4: ( (Date Not Specified) -- (Author Not Specified) )
When I submit a job on the cluster (from /tmp/banerjee in inf004 in
this case) using qsub, I get the following message
qsub: Bad UID for job execution
What am I doing wrong?
Answer:
You must run 'qsub' from inf001.
Question 5: ( June 2003 -- Mark/Dav )
When I run a pbs batch job, my output files are not group/world readable.
Answer:
The umask is hard-coded as 077 in the PBS src.
I think they hard-coded it because jobs are not run through your shell,
but are started directly by PBS, so they don't have a umask from your dot
files.
I can recompile with a different umask, but then every file created from a
batch job would be world-readable.
Something like this should work, near the end of your batch file:
mpirun -O -np $NUM_PROCS /bin/tcsh -c "sus <args to sus>"
This will run whatever umask setting you have in your .cshrc file
** OR **
After the mpi call in your batch script, you could:
% cd top_of_data_dir
% chmod -R go+rX *
'Course, if the script doesn't finish, then this wouldn't work...
Question 6: ( June 2003 -- Mark Hartner )
How do I see which nodes on inferno are down?
Answer:
xpbsmon is a good way to see the status of the cluster.
pbsnodes -I will show which nodes are down.
Question 7: ( June 2003 -- Mark Hartner )
I get errors such as:
> > Unable to copy file 6991.inf001.OU to inf003.sci.utah.edu:/home/sci/likai/SCIRun/linux32opt/Packages/Uintah/StandAlone/mpm-8-1/batch.job.o6991
> > >>> error from copy
> > inf003.sci.utah.edu: Connection refused
> > Unable to copy file 6991.inf001.OU to inf003.sci.utah.edu:/home/sci/likai/SCIRun/linux32opt/Packages/Uintah/StandAlone/mpm-8-1/batch.job.o6991
> > >>> error from copy
> > inf003.sci.utah.edu: Connection refused
> > yboard-interactive).
> > lost connection
> > >>> end error output
What does this mean?
Answer:
Your home directory might be group writeable. The batch system uses ssh
to copy files around, and it refuses to authenticate a user with a group
writeable home directory. That is why you are getting errors.
If you need a place to share files, I would suggest you make a
subdirectory within your home directory and set the permissions
appropriately.
Question 8: ( June 2003 -- Steve )
Why are we not using sse and sse2 flags on debug builds on the cluster?
Answer:
They really are optimization options - they make it use special
instructions in the pentium 4 to make the code faster. It doesn't make
a big difference, even in optimized mode. If you want to make the debug
code faster, use --enable-debug="-O -g". G++ can mix debug and
optimization.
Question 9: ( Jul, '03 -- J. Davison de St. Germain )
How do I get system status on inferno (the linux cluster)?
Answer:
Use the script 'usage' in /usr/sci/projects/Uintah/scripts/inferno.
Or you can directly use the commands "qstat -a" or "pbsnodes -l".
Question 10: ( Aug 2003 -- J. Davison de St. Germain )
For how much time can I run jobs on inferno?
Answer:
Please see the usage policy at
../Instructions/inferno.html. Jobs that do not meet follow
this policy may be deleted without notice.
Question 11: ( November 03 -- Bryan )
What do I do when I have weird problems on the cluster?
Answer:
Send mail to cluster-users@sci.utah.edu. Send your job number and job
output files in the email.
Question 12: ( November 03 -- Bryan )
What does it mean when I see an error like this running MPI on the cluster:
> It seems that some error has occurred during MPI_INIT. This will
> cause your process to abort. These kinds of errors are usually
> system-related, such as running out of disk space, running out of
> memory, or something more serious such as data not being passed
> between processes properly. That is, you should not be seeing this
> error message; if you are, somethings is likely Very Wrong with your
> system. :-(
>
> Perhaps this Unix error message will help:
>
> Unix errno: 1252
> Unknown error 1252
Answer:
We have seen this before. If it happens send your job id and job output files
to cluster-users@sci.utah.edu. The problems we have had with this are
semaphores and other shared-memory items not being cleaned up.
Question 13: ( Feb, 2004 -- Bryan, Stas, Mark )
I get the following warning when I submit a job on the cluster:
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
Is anything wrong with the .pbs file?
Answer:
No, you can safely ignore this message. It just means that there is
no interactive control in your job.
Question 14: ( December 2, 2004 -- Randy Jones )
How can I pass enviroment variables to the cluster nodes with mpirun on Inferno?
Answer:
Optional flags to mpirun:
-x varname[=value][,varname[=value],...]
This passes the environment variable varname to all the nodes used in mpirun. If you specify the =value portion, it will set the specified variable with that value. Otherwise it will use the value of that variable in the current environment. I.e.,
mpirun -x SCI_SIGNALMODE=exit,MALLOC_STATS=malloc_stats
Question 15: ( May 05 -- Bryan/Dav )
How do I run 2+ serial jobs on 1 node on inferno so I can utilize all the CPUs?
Answer:
Place the following as the command section of your batch script (instead of
the mpirun ... line):
program 1 &
program 2
wait
If you need to move output after one of the jobs finished, you can try:
program 1 ; cp $SCRATCHDIR1/*.dat $WORKDIR1 &
program 2 ; cp $SCRATCHDIR2/*.dat $WORKDIR2
wait
where you can set SCRATCHDIR as the dir the job runs in and WORDIR as the current dir.
Question 16: ( May 05 -- Todd )
Why does my scirun build (that I built from my main sus tree) crash?
Answer:
We have found that building scirun with '-O3 -msse -msse2 -march=pentium4' can cause
scirun to crash. We recommend configuring scirun with --enable-optimize
instead of --enable-optimize=<flags>.
Question 17: ( 6/17/05 -- Jim Guilkey )
I would like to put through a test case that requires 75 nodes for
approximately 3 minutes later on today. This will tie up the queue
until it runs but will make the nodes available again after 3 minutes.
Does anybody have a problem with this?
Answer:
Actually, on inferno, it won't tie up the queue. Inferno isn't
FIFO. Rather, if you put in a 75 node job, it'll sit there until
75 nodes are free, which might be a while, since if anyone puts in
a job behind yours, even if there are 74 nodes free, they'll go
first.
CATEGORY: Coding
Question 1: ( (Date Not Specified) -- (Author Not Specified) )
In looking at the code to help Jim track down a memory leak, I
did a cursory search for new in both *.h and *.cc files within
Uintah/. There are many instances were new is used instead of scinew.
Is there any reason we should prefer new over scinew? If not, then I
will go in and change the new to scinew.
Answer:
You cannot use scinew in some instances. When making an array of user-defined
objects (not any of the built-in types), you cannot use scinew unfortunately.
Do not use scinew in Array3, or in any templated function where you
are allocating an array or the templated type. (i.e. new T[5]). This
is due to a "bug" in the C++ specification and various vendors'
interpretation of the spec.
Otherwise, scinew should be preferred...
It is fine to allocate a single object, just not an array.
The one in Array3.h where it allocates the data is the dangerous one.
Question 2: ( (Date Not Specified) -- S. Parker )
Why the single makefile?
Answer:
That is only one of the reasons that we went towards the single makefile
approach. The global make clean is just an artifact of the single
makefile. Adding a local make clean would be hard, unless it just
did:
find . -name "*.o" -print | xargs rm
which would work but might not always do what you want either.
Personally, I never use make clean. I just do rm *.o;gmake or
the above find statement. This leaves the .ii files which makes for
a faster link.
Anybody that wants to implement a local make clean, here is the
idea of how to do it: use gmake's pattern match rule to look for
all of the $(CLEANOBJS) at or below the subdir.
clean:
rm $(filter $(DIR)/%, $(CLEANOBJS))
where getting DIR is left as an exercise to the reader...
Question 3: ( (Date Not Specified) -- Dav )
How do use debug streams? (Environment variable?)
Answer:
setenv SCI_DEBUG TaskGraph:+[FileNameToStoreInfoIn][,VarName:+[file]]
Question 4: ( (Date Not Specified) -- Dav )
How do I use TAU?
Answer:
in configVars.mk:
TAU_MAKEFILE := /res/sci/data1/TAU/tau/sgin32/lib/Makefile.tau-sgitimers-sproc
ifneq ($(TAU_MAKEFILE),)
include $(TAU_MAKEFILE)
endif
....
type "make cleantau;make"
On Nirvana:
/usr/projects/Uintah/tau/sgi64/lib/Makefile.tau-profile-trace
% tau_merge *.trc sus.trc
% tau_convert -pv sus.trc tau.edf sus.pv
% vampir sus.pv
For subsets of the trace files (when you have a lot of trace files):
> tau_merge sus1.trc sus11.trc sus21.trc sus.trc
> tau_convert -nocomm -pv sus.trc tau.edf sus.pv
> vampir sus.pv
Question 5: ( (Date Not Specified) -- (Author Not Specified) )
How do you compile on LLNL? I've been just doing it on 1
processor. Do you submit a job and use more processors? How
do I compile at LLNL?
Answer:
I've been just running "gmake -j4" on blue. Or -j8 on frost. I have
not been submitting a job. If you are compiling interactively, I
suggest using frost as it is much faster. However, if the login node
on frost is being hammered, you can try submitting an "xterm" job and
then compiling on the node you get. Usually it doesn't take too long
to get a single node this way.
I think something like this should work:
> echo "xterm -display taurus.sci.utah.edu:0" | psub
Make sure you have xhost + set on your local machine.
Question 6: ( (Date Not Specified) -- (Author Not Specified) )
Monitor/top machine usage monitoring tool on Livermore IBM SP?
Answer:
One other little trick... if you want to see the "top" output on LLNL,
you need to use the monitor program. I have mine aliased:
frost001:22:~> which top
top: aliased to "monitor -top"
Question 7: ( Sep 2002 -- T. Harman )
How long does it take to compile SCIRun/sus?
Answer:
Here are some recompile times for both blue and rapture with optimized
builds. I used 4 processors on both machines and I touched the same
file. You might mention this at your next crt conference call.
Blue
Real 793.49
User 280.32
System 173.03
Rapture
real 188.791
user 196.366
sys 26.554
Frost
Real 216.57
User 117.10
System 68.41
This web page has some historic compile time results:
http://www.csafe.utah.edu/Information/Instructions/CompileTimes.html
Question 8: ( (Date Not Specified) -- (Author Not Specified) )
How do I do performance analysis?
Answer:
Here is something I found out: do NOT compile your program with -pg,
but DO use -g. Then do this:
setenv LD_PROFILE libPackages_Uintah_CCA_Components_MPM.so
./sus ...
sprof ../../../lib/libPackages_Uintah_CCA_Components_MPM.so
Unfortunately, it will only give you profiles for a single .so, which is
very annoying, but it is a first step.
Steve
> Steve was right regarding gprof: I made some stupid little so and linked
> it with a small program, and gprof doesn't seem to cross so's. I tested
> this about a million times with different function usage, and different
> linking styles (linking against a static library, or just linking all the
> files together). Everything seems to work except the shared libs.
Question 9: ( (Date Not Specified) -- S. Parker )
How do I get the debugger to come up automatically (under
linux)?
Answer:
Here is the magic environment variable to get gdb in a new window
whenever sus crashes on linux.
setenv SCI_DBXCOMMAND "gnome-terminal -x gdb sus %d"
or
setenv SCI_DBXCOMMAND "xterm -e gdb sus %d"
This only works if you run sus from Packages/Uintah/StandAlone,
otherwise you will need to add a path to sus (gdb
/whataever/Packages/Uintah/StandAlone/sus)
Question 10: ( Feb '03 -- Guilkey, Parker )
On my SGI with a mountain fresh build I'm picking up
ld64: ERROR 28 : GP-relative sections overflow by 0x35d1 bytes. Please recompile with a smaller -G value.
You can see gprel section layout with -m -aoutkeep
See the explanation in the gp_overflow(5) manpage.
ld64: INFO 152: Output file removed because of error.
--- lib/libCore_Datatypes.so ---
*** Error code 2 (ignored)
C++ prelinker: warning: could not locate library -lCore_Datatypes; assuming /usr/lib/libCore_Datatypes.a
C++ prelinker: warning: nm returned a nonzero error status
ld64: FATAL 9 : I/O error (-lCore_Datatypes): No such file or directory
gmake: *** [lib/libCore_Algorithms_Geometry.so] Error 2
Here's my configure line
../src/configure --enable-64bit --enable-package=Uintah --with-thirdparty=/usr/installed/Thirdparty/1.7/IRIX64/MIPSpro-7.3.1.1m-64bit --enable-optimize=-Ofast
Should I just turn optimize down to O2 or is there a magic systune
knob to turn?
Answer:
Jim writes: Magic knob, you need to set -G0, e.g.
Steve writes: Turn optimization down to O2 and compile Core/Datatypes,
then you can turn it back up. It is not a systune variable, it is a
problem with Core/Datatypes getting too big - we haven't seen it in a
while. (Not sure if this is pertinent based on Jim's response...)
Question 11: ( Feb 'O3 -- Dav )
How can I track down memory problems in sus/SCIRun?
Answer:
Make sure your build is configured with --enable-sci-malloc.
If you set the environment variable MALLOC_STRICT (under tcsh: setenv
MALLOC_STRICT) then the memory management system will fill "memory"
with "bogus" data that can help track down memory errors. NOTE: if
you set MALLOC_STRICT and suddenly your program starts dieing, it is
very likely that there is an uninitialized variable in your code that
(luckily) defaulted to 0 and thus worked... However the default to 0
is a coincidence and should not be relied upon.
Question 12: ( Feb '03 -- Dav )
What is LD_LIBRARY_PATH used for?
Answer:
The environment variable LD_LIBRARY_PATH tells the runtime linker
where to look for dynamic libraries that need to be loaded by your
program. If your LD_LIBRARY_PATH variable points to libraries that
where created by a different compiler than your application, you can
experience strange behavior. Usually LD_LIBRARY_PATH should not be
set (as sus/scirun build in library path information when they are
linked), however you can use this variable to dynamically use
different libraries if you know what you are doing.
Question 13: ( Feb '03 -- Parker )
Just an answer to a question I asked on Monday. I'm now running a 32
node job on inf. I think the problem I was having on Monday may have
been related to iterating outside the bounds of my arrays. It's not
clear why this didn't kill smaller jobs as well, but that's the only
thing that has been fixed that I know of.
Answer:
With an optimized build, this is not suprising. Iterating just outside
of a small array is still "close" in memory. However, iterating just
outside of a large array can be very far away, causing a crash. Trying
it on a debug build you should have gotten an assertion failure
independent of the size of the array.
The lesson: if weird things happen, try a debug build...
Question 14: ( Mar '03 -- Parker )
I did a cvs update -Pd from src, and then a gmake, and I get this error:
> gmake: *** No rule to make target `../src/Core/Util/sci_system.c',
> needed by `Core/Util/sci_system.o'. Stop.
How do I fix this?
Answer:
(See also the "repair.sh" entry in this FAQ.)
This is a typical error when you do an update after files have been
removed from SCIRun. The easiest fix is:
touch ../src/Core/Util/sci_system.c
gmake
rm ../src/Core/Util/sci_system.c
The problem occurs because the make system is trying to determine the
"age" of the dependency file in order to determine if the (.cc/.c)
file in question should be rebuilt (into a new .o). This also occurs
frequently if a .h file is removed from the tree. Other times this
occurs include when you build on one architecture (or specific
machine) and then try to build on a different architecture (and
sometimes machine.)
Question 15: ( April 2004 -- Bryan Worthen )
How do I make emacs insert tabs instead of spaces?
Answer:
To insert tabs instead of spaces, add this to your .emacs:
(setq-default indent-tabs-mode nil)
Question 16: ( Aug 30, 2004 -- J. Davison de St. Germain )
I'm getting an error message like the following when compiling (after
I did a 'cvs update'):
No rule to make target `../src/Dataflow/Modules/Render/SCIBaWGL.h',
needed by `Dataflow/Modules/Render/OpenGL.o'. Stop.
(Note, the 'no rule' target can be anything and the 'needed by' can
also be anything.)
Answer:
...you can use the "repair.sh" script located in .../SCIRun/src/scripts/ to
fix this.
eg:
> cd SCIRun/<bin>
> ../src/scripts/repair.sh SCIBaWGL.h
The repair script will search all the .d files (or depend.mk files on
the SGI) for the bad include file (in this case, SCIBaWGL.h) and
remove the corresponding .d and .o files. Then you can just type make
and it will rebuild what is necessary.
CATEGORY: Configure
Question 1: ( Jan '03 -- Dav )
Why do I get (and how do I fix) this error during configure:
./config.status --recheck running /bin/sh ../src/configure --with-thirdparty=/export/space/scratch/SCIRun1.8.0/1.8/Linux/gcc-3.2-32bit '--enable-package=BioPSE MatlabInterface' --enable-debug --no-create --no-recursion
checking for gcc...
gcc
checking for C compiler default output...
a.out
a.out
conftest.c
checking whether the C compiler works...
configure: error: cannot run C compiled programs.
If you meant to cross compile, use `--host'.
or (void*) problem.
Answer:
This error usually occurs when you are using a different compiler (or
compiler version) then the Thirdparty was compiled with. Another
possibility is if you have your LD_LIBRARY_PATH variable set with
stuff that does not work with the default compiler.
Also, if configure was created using the wrong version of autoconf,
this might happen.
Question 2: ( A long time ago... -- (Author Not Specified) )
Make complains that fspec.pl is not executable.
Answer:
Configure is supposed to "chmod +x" this file. It appears
not to the first time. Manually do the chmod if you run
into this problem. (chmod +x Packages/Uintah/tools/fspec.pl)
CATEGORY: Documentation
Question 1: ( Feb '03 -- Dav )
Where is web documentation on the Q machine (qscfe1 @ LANL)?
Answer:
Local docs:
http://www.csafe.utah.edu/Information/Instructions/qsc.html
LANL docs (you will need a Z# and a pass code):
https://icnn1.lanl.gov/ldswg/icnn/content/qsc/help
Question 2: ( Oct 2003 -- Bryan Worthen )
How do I use doxygen?
Answer:
See doxygen.html
CATEGORY: Fortran
Question 1: ( (Date Not Specified) -- S. Parker )
What to do with variable names that are too long. How do I use
this PASS3 thing that you mention?
Answer:
To pass an array into fortran, we must also pass the lower and
upper bounds. On the SGI, we do this with two integer arrays
(low and high) with 3 elements (for the x,y,z bounds). However,
GNU fortran does not allow this type of array:
double precision A(low(1):high(1), low(2):high(2), low(3):high(3))
It does however allow this:
double precision A(low_x:high_x, low_y:high_y, low_z:high_z)
So the fortran interface generates 2 different versions: the first
form on SGI because it is more efficient, and the second form on linux
because it works. For the most part, the fortran code doesn't see
this. However, if you are passing an array into a subroutine, you
need to do:
call sub(A, A_low, A_high)
on the SGI and:
call sub(A, A_low_x, A_low_y, A_low_z, A_high_x, A_high_y, A_high_z)
on linux/g77. To make this easier, I made the PASS3 macro, which is
short for passing a 3 dimensional array.
call sub(PASS3(A))
which will do the right thing in both cases. The only problem is with
very long array names. When this gets expanded:
call sub(PASS3(long_name))
to:
call sub(long_name, long_name_low_x, long_name_low_y, long_name_low_z, ...
then it will easily overflow the 72 character limit for fortran code.
Thus the need for the PASS3A/PASS3B macros:
call sub(PASS3A(long_long_name)
& PASS3B(long_long_name),
which will just split the name expansion onto two different lines of less
than 72 characters.
Question 2: ( August 2003 -- Bryan Worthen )
Can I use Fortran 90 compilers or does sus only support Fortran 77?
Answer:
We have currently made little investigation into compiling with Fortran 90.
We intend to look into this a little more in the future, but not for the
moment. However, if you know what you're doing, you may try to use Fortran
90.
Question 3: ( August 2003 -- Bryan Worthen )
Do I need to use gen-fspec with my fortran code? If so, how do I set it
up?
Answer:
Click for the answer
CATEGORY: Graphics
Question 1: ( September 3, 2003 -- Kurt/Biswajit )
How can I make mpeg movies from the raw frames ?
Answer:
You'll need a few pieces of software:
pnmflip (can be found on rapture and used by raw2ppm.csh below)
raw2ppm (on rapture also, and also used by raw2ppm.csh)
mpeg_encode (grab a copy for an SGI from ~kuzimmer/bin)
You'll need a parameter file for mpeg_encode:
look at ~kuzimmer/tools/pnm.param
And you'll need a simple cshell script:
look at ~kuzimmer/tools/raw2ppm.csh
Copy the raw2ppm.csh file into the directory where all of your
*.[moviename].raw files are.
Edit the dimensions in the script to match the frame size of your raw frames.
Then type raw2ppm.csh at the command line.
It will begin converting all of your raw files to ppm files.
mpeg_encode likes ppm files. While you are converting files you will want to
copy the pnm.param file to this same directory.
You will also want to edit the pnm.param file. Edit the OUTPUT line (line 4
in my pnm.param file) to set the file name for the movie. Then edit the INPUT
(or lines 16-18 in my file) to match the names of your *.[moviename].ppm files
then set the begining and end numbers. So for example if you want to make a
movie of the files 021.mymovie.ppm to 653.mymovie.ppm the INPUT section of the
parameter file would read:
INPUT
*.mymovie.ppm [021-653]
END_INPUT
Once you have all of your .ppm files and you've edited your parameter file,
just type:
mpeg_encode pnm.param
If you have multiple directories of raw files, the easiest thing to do is
change the numbering of the files, then merge them together into one
directory, then perform the above steps.
Question 2: ( September 3, 2003 -- Jim/Biswajit )
How can I configure and run the Real Time Ray Tracer (rtrt) to make
movies ?
Answer:
Step 1: Go to one of the SGI parallel machines (rapture, muse etc.).
Step 2: Configure and build.
../src/configure '--enable-package=Uintah Teem rtrt' --enable-optimize
--enable-64bit --with-glut=/usr/sci/local --with-glui=/usr/sci/local
--with-teem=/usr/sci/projects/SCIRun/Thirdparty/teem/IRIX64/MIPSpro-7.3.1.3m-64bit
gmake -j2
Step 3: Set the display variable to your machine.
setenv DISPLAY [yourmachine].utah.edu:0.
Step 4: Run rtrt.
rtrt -np 16 -no_shadows -bv 0 -scene scenes/uintahparticle2 -rate 1.0
/local/csafe/raid1/[uda_file] -timesteplow 55 -timestephigh 55
-timestepinc 1 -radius 0.0008
Basic Instructions for RTRT :
Left click - sets min crop value
Middle click - color by this value
Right click - sets max crop value
Use these to twiddle with the color map range
Control+Left click - Set min value for color map
Control+Middle click - Reset color map range
Control+Right click - Set max value for color map
Use these to narrow in on a region of the histogram
Shift+Left click - Set min for histogram viewing
Shift+Middle click - Reset histogram viewing to original
Shift+Right click - Set max for histogram viewing
The only way to control the animation rate is from the command line (yech!).
You can specify the animation rate with
-rate [number of frames to display in one second -- default 3].
This can be a float. If you want to display each frame for 2 seconds,
use -rate 0.5.
As far as the movie making thing went, I used MovieMaker and then converted
the file to a mpeg. You have the raw movie file. Try the QuickTime format
too, and see how they compare with quality/size.
I try to only create movies that are less than 2 minutes. If you want, we
can get the media crew to piece together some sequences. For a presentation,
movieslonger than 30 seconds to a minute get really boring.
Question 3: ( 01/26/06 -- Todd )
How do a make a montage of jpg images.
Answer:
Suppose you have 9 jpgs that you want resized to 640x480 and placed in a single image
montage -geometry "640x480" -tile 3x3 1.jpg 2.jpg 3.jpg 4.jpg 5.jpg 6.jpg 7.jpg 8.jpg 9.jpg montage.jpg
CATEGORY: LANL
Question 1: ( 1/03 -- Dav )
How do I log into the LANL machines (Theta,Q)?
Answer:
Use your crypto card to get you login password. Then "ssh
portal.lanl.gov". From portal, you can ssh to theta or qscfe1.
When going to qscfe1 from portal, you must use "ssh -1 qscfe1".
Question 2: ( 10/03 -- Randy )
Why do my submitted jobs not start on Q?
Answer:
If you are seeing this message:
prun: Error: insufficient cpus in allocated resource use -O to override
Then you might want to check that you typed "bsub < batch.job" instead
of typing "bsub batch.job" which will not work.
If your job dies on startup because of a "Caught: unknown exception",
then just try re-submitting your job. This is a known problem that
we are still chasing. It seems to only happen on large (128 procs or
greater) runs.
Question 3: ( Oct 2003 -- Bryan Worthen )
How can I send data faster from the labs (pscp)?
Answer:
See pscp.html
Question 4: ( Oct 2003 -- Bryan Worthen )
How do I use long-term storage at LANL
Answer:
1) Make sure you are registered to use this service.
Go to https://register.lanl.gov
Click main menu on the left side.
Under Authentication Accounts, click on "High Performance Computing"
If you don't see "Open HPSS Storage" under the list of granted accounts,
click on "Request New Account" on the left side.
Check the Open HPSS Storage box, and click Submit.
It could take a while to get your account, so try to do this before
you need it.
2) From Q (or somewhere else on lanl), type psi. You will be inside your
HPSS filesystem, and normal file system commands work here just like
normal unix commands, and if you prepend a bang (!), it will happen
in the local filesystem.
The command 'store' will copy a file/directory to HPSS, and the command
'get' will copy it to the current local direcetory.
Question 5: ( July, 2004 -- Dav )
How do I log into Q (or LANL) now that portal is gone?
OR
How do I use VPN with LANL Q?
Answer:
I just installed the Windows VPN client that lanl provides. You can
get it here (you will need your z# and password):
http://protected.lanl.gov/nst/VPNinstructions.html
There are also downloads and instructions for Linux/Solaris and Mac.
I followed their simple instructions and it went very smoothly. I
was able to connect using VPN and then "ssh dav@qscfe1.lanl.gov"
without a problem.
Once on qscfe1, I was able to ssh and scp back to muse/raid1. It
seems like this should be a viable, if not extremely convenient,
method of doing work at LANL. You won't be able to do this on any
machine that requires a local network to be maintained (ie, any
machine that mounts a necessary network drive.)
CATEGORY: LLNL
Question 1: ( April 2, 2003 -- Randy Jones )
Where is Hypre at LLNL?
Answer:
Please refer to: Building sus on Frost:  Step 5
Randy Jones: The following is no longer needed (I believe):
Ok, at LLNL this is where everything is:
HYPRE_DIR := -L/usr/apps/hypre/beta/lib
HYPRE_INC := -I/usr/apps/hypre/beta/include
HYPRE_LIB := -lHYPRE_LSI -lHYPRE_blas -lHYPRE_struct_ls
-lHYPRE_struct_mv
and on rapture, it is in my home directory and you want the 1.7.7b
version. If you are getting errors in the mli_* files, do this:
mv FEI_mv FEI_mv.hide
./configure
make
Apparently they do there development on Linux and didn't run into this
problem. They said it's fix now but the version hasn't been released
yet.
Question 2: ( Thu, 12 Sep 2002 -- Wing )
What are some hints on running at LLNL?
Answer:
I had a meeting with Barbara while I was at LLNL. She gave me some
hints on using blue and frost.
Here are some of the questions that I asked:
pdebug vs pbatch ?
It's not always faster to submit your jobs in the debug pool. If the
debug is being used alot (like on frost), submitting to the batch pool
with a short time (like 30 mins) will get your job to run earlier. A
good command to check is "spj"
Leaving a processor free on each node?
>From her experiense, leaving a processor free from each node doesn't
help much on blue (there are only 4 processors per node) but helps a lot
at frost. She said they can rearrange the configuration and give us the
debug node also next time for our big run.
Optimal big case during normal run?
Using the lowest maximum allowed can usually get the cases to run pretty
fast. Like on frost, 24 nodes is the max during the day and lots of
people run cases that size. If you ask for 32, then it might not run
for several days. And on blue is 128 nodes but of course there is the 2
hr. during the day factor. But I guess we will just have to do the
dependent condition. Basically by doing this, you have a better chance
to get your job to run since it can be done either during the day or at
night.
Another helpful pstate that I'm using is:
pstat -A -o jid,name,user,status,maxtime,used,maxnodes,xct,prio
JID NAME USER STATUS MAXCPUTIME USED MAXNODES
XCT PRIORITY
11867 nb_pen_nw.run deveritt *MULTIPLE 50:00 0:00
0 0 0.000
This will give you info about other cases like how long they asked for
and how much long and their priority.
Question 3: ( April 2, 2003 -- Randy Jones )
Where are the thirdparty libs located on Blue/Frost?
Answer:
Please refer to: Building sus on Frost:  Step 5
Question 4: ( April 2, 2003 -- Randy Jones )
What configure line do you use on Frost (the IBM SP at LLNL)?
Answer:
Please refer to: Building sus on Frost:  Step 5
Question 5: ( Jan '03 -- Dav )
How do I use long term storage at LLNL?
Answer:
I have looked into the question of long term storage at LLNL. Turns
out that it is as simple as ftp'ing whatever you want to
storage.llnl.gov.
> ftp storage.llnl.gov
With ftp you can "mkdir", "cd", "put", and "get", etc. I have not
tried it, but it is supposed to be very easy.
Question 6: ( Feb '03 -- Wing )
How do I check our machine queue usage/time on frost/blue at LLNL?
Answer:
Here is the command. Change the dates as needed.
pcsusage -bm -b utah -u all -tb oct 01 2002 -te dec 31 2002
Question 7: ( Apr '03 -- Dav d. )
How do I request dedicated time on frost or blue (LLNL)?
Answer:
First, coordinate the request with Dav and the Homebrew team.
You will need to IPA first at this web site:
https://access.llnl.gov/ipa/login
Then go to this web site:
https://lc.llnl.gov/computing/forms/expedited_runs.html
It will ask you for your LLNL user id and password.
Question 8: ( July, 2003 -- James/Dav )
On Frost, why doesn't it let me allocate more than 256MB of memory?
Answer:
By default, AIX executables can use only 256MB. This is
determine by a bit in the header of the executable, it
is not a property of the code itself.
You can change this setting at link time by adding the link
option '-bmaxdata:0x80000000' to your link line. No recompilation
is otherwise necessary. The leading '8' indicates how many
256MB segments you want to have (for a maximum of 2GB).
You determine an existing executable's limit using 'dump -ov a.out'.
The last two lines will be something like:
maxSTACK maxDATA SNbss magic modtype
0x00000000 0x00000000 0x0003 0x010b 1L
The number under maxDATA indicates how much memory you
can use. The default '0x00000000' is 256MB.
You can change an existing executable's limit using the
'setbmaxdata' script:
setbmaxdata 8 a.out
Then using dump -ov a.out, you will see for the last two lines:
maxSTACK maxDATA SNbss magic modtype
0x00000000 0x80000000 0x0003 0x010b 1L
To speed up debugging, etc. we often recommend using
0x70000000 unless your application really needs all
2GB. I would also recommend using 'dump -ov' on your
executable linked with -bmaxdata to make sure
you are getting what you want.
BTW, if you are using g++, you need to add -Wl, before
- -bmaxdata in order to get it to work properly. Otherwise,
g++ will interpret it as -b -m -a, etc. which causes
really bad things to happen and cryptic error messages.
Question 9: ( Aug 30, 2004 -- T. Harman )
How do I determine the number of nodes being used on ALC (at LLNL)?
Answer:
Use the "usage" script (modified from the inferno script of the same
name by Todd) to get this information. (The script is located in
/usr/gapps/uintah/bin/usage.)
Question 10: ( 06/22/2005 -- J. Davison de St. Germain )
How do I get onto LLNL's Thunder machine? What is it?
Answer:
To get access to LLNL's Thunder cluster, you need to send a
request to dav@sci.utah.edu. He will then approve the request and
forward it to LLNL (lc-support@llnl.gov). For information about
the Thunder cluster, go here.
Question 11: ( June, '05 -- David Groulx )
Why are my exceptions printing out garbage?
Answer:
If exceptions are printing out garbage for you, then you are probably
using gcc 3.3 or earlier to compile with. To force exceptions to
print out information in a compiler independant way, configure SCIRun
with the flag '--enable-exceptions-crash' and rebuild. This should
give you more informative exceptions.
CATEGORY: MPI
Question 1: ( (Date Not Specified) -- (Author Not Specified) )
What environment variables do I use with MPI?
Answer:
This is for SGI's (perhaps the IBM SP?)
setenv MPI_MSGS_PER_HOST 2048
setenv MPI_MSGS_PER_PROC 1024
Question 2: ( (Date Not Specified) -- (Author Not Specified) )
Memory usage and MPI_TYPE_MAX
Answer:
Date: Thu, 02 May 2002 12:42:16 -0600
From: Wayne Witzel
Subject: memory usage and MPI_TYPE_MAX
FYI, this is a case study you should know about just in case this kind
of thing happens in the future.
I tracked down the highwater memory test failures of ICE and MPMICE to
the fact that I recently added:
setenv MPI_TYPE_MAX 10000
to my .cshrc on rapture.sci.
The default MPI_TYPE_MAX is 1024. So increasing it to 10000 causes
MPI to use significantly more memory (at least, relative to the memory
these ICE and MPMICE runs were using).
So the lesson here is that if you are having failures with highwater
memory tests in the regression tester, this is one culprit to look at.
One way to tell if this is the problem is to open up the "malloc_stats"
file in your results and in the gold standard, search for "MPI
initialization" and compare the number of bytes. The number will be 4
times whatever your MPI_TYPE_MAX is set to.
The way I could see people running into this in the future is if they
run tests manually on their account where they don't have MPI_TYPE_MAX
set (or set to a different value than I have it set to) and then replace
the gold standard with these results. The way to prevent this would be
for everybody to have the same values set for the MPI environment
variables in their .cshrc. I have the following in my .cshrc:
setenv MPI_MSGS_PER_HOST 32768
setenv MPI_MSGS_PER_PROC 8192
setenv MPI_TYPE_MAX 10000
Wayne
CATEGORY: Matlab Tricks
Question 1: ( 01/26/05 -- Todd )
How do I make a contour plot in matlab
Answer:
You should be in the Standalone directory and lineextract must be compiled
%__________________________________
% Hard wired Variables
ts = 4 % timestep
level = 0;
uda = test.uda;
startEnd ='-istart -1 -1 8 -iend 17 17 8';
%__________________________________
% import the data
c = sprintf('lineextract -v delP_Dilatate -l %i -timestep %i %s -o delP -m 0 -uda %s',level,ts,startEnd,uda);
[s, r] = unix(c);
delP = importdata('delP');
x = delP(:,1);
y = delP(:,2);
z = delP(:,4);
%__________________________________
% reshape and plot the data (this is the trick to contour plots)
X = reshape(x, [18 18]);
Y = reshape(y, [18 18]);
Z = reshape(z, [18 18]);
[C,h] = contourf(X, Y ,Z);
clabel(C,h);
colormap jet
CATEGORY: Misc
Question 1: ( Jul 2002 -- Dav/Bryan )
How do I get passwordless entry to LANL (or anywhere else for that matter)?
Answer:
Here is a method that I believe will work to remove the need to type
in your password when you ssh from anywhere to rapture (either for cvs
or for sending data files.)
You need to follow these steps:
> ssh to the machine you want passwordless access FROM
> ssh-keygen -t dsa # this is done only once
Press return, then enter a pass phrase that you will remember as you
will need it once every log in session.
This will create files in your ~/.ssh directory - id_dsa, and id_dsa.pub.
You may also use 'ssh-keygen -t rsa' for rsa (it will create id_rsa and
id_rsa.pub), or ssh-keygen -t rsa1 if you need ssh 1 protocol.
Copy the data from "id_dsa.pub" (or id_rsa.pub) (that was generated in
your .ssh dir on the machine you logged in to) to rapture (or the machine
you want passwordless access TO) and append it to a file named
~/.ssh/authorized_keys
Now, everytime you want to do the no password ssh'ing from that location to
rapture type:
> ssh-agent # this is done only one time as you first log in
Run the commands it prints to the screen (which adds some stuff to
your environment).
> ssh-add ~/.ssh/id_dsa # this is done only one time after the ssh-agent
enter your pass phrase that you used above.
> ssh name@rapture.sci.utah.edu (or to the machine you copied the public key)
At this point (from now on in this log in session) you can ssh freely
to rapture. You will also be able to ssh freely from any xterms you
kick off.
This sort of thing should also work to go to/from other machines.
BTW, the id_dsa file contains your private key. It should be only
readable by you. The id_dsa.pub contains your "public" key. In
theory, this is what you can give to other people so that they can
send encrypted data to you that only you can decipher.
(This part isn't necessary, it's just optional extra power)
The ssh-agent and ssh-add don't *really* need to be done every time.
In theory, whenever you run an ssh-agent, it stays in memory until the
machine reboots (or until root kills it). To take advantage of this, you
can save the commands that ssh-agent outputs to a file, and then just source
that file when you log in. And if you have already authenticated (ssh-add)
to that ssh-agent, you won't need to do it or type in your passphrase again.
Here are two aliases that facilitate this process (keep each one on one line).
Add them to your .cshrc or .aliases file.
alias agent 'rm -f "$HOME"/.ssh/`hostname`.agent ;
ssh-agent > "$HOME"/.ssh/`hostname`.agent ;
source "$HOME"/.ssh/`hostname`.agent ; ssh-add'
This saves the output of ssh-agent to a file, sources it, and does ssh-add.
You will need to type your pass-phrase here. You will only need to do this
once, or until the process gets killed.
alias sshagent 'if (-e "$HOME"/.ssh/`hostname`.agent)
source "$HOME"/.ssh/`hostname`.agent ; endif'
This one checks for the file that should be created by this computer, and if it
exists, it sets up the environment to run with that ssh-agent. If you run
sshagent at the end of your .cshrc, you may never have to type passwords again!
However, if the machine reboots or your ssh-agent gets killed, this alias won't
work, and you will need to run 'agent' again. Be extremely secure when doing
this, make sure your .ssh directories and these files can be read only by you.
So, once on the machine you want to ssh from, type
> agent
and at the bottom of your .cshrc file (after the aliases that you added above) add
sshagent.
This will set everything up. Also, before our run agent, make sure that there
aren't already any ssh-agents owned by you on that machine
ps -fu username | grep ssh-agent.
Kill them before you run the agent alias.
Question 2: ( Feb '03 -- Worthen )
How can I turn off the compilation Arches (or MPM or ICE)?
Answer:
To turn off compilation of ARCHES (this works for turning off MPM/ICE
too) use the script
Uintah/Test/helpers/useFakeArches.pl path-to-SCIRun.
This basically edits the sub.mk files to remove references to Arches and
builds an empty Arches class. Likewise, useFakeIce.pl, useFakeMPM_ICE.pl,
and useFakeMPM.pl (this one is in the works) will turn off ICE, MPM and
ICE, or MPM, respectively.
Question 3: ( (Date Not Specified) -- (Author Not Specified) )
I am having thread problems with SCIRun. What could be wrong?
Answer:
The gcc compiler must have threads enabled. You can check this with
"gcc -v". It should say "Thread model: posix". If not, you need to
reconfigure gcc using the "--with-threads=posix".
Question 4: ( Dec 2002 -- Hartner )
What MANPATH should I use?
Answer:
It should be undefined. Setting your MANPATH really messes up GNU man. As
long as the command is in your PATH, man should be able to find the man
page if one exists. (This is for linux/inferno?)
Question 5: ( Jan '03 -- Steve Parker )
How do you get generic execution time measurements from a program?
Answer:
/usr/bin/time sh "sus -ice inputs/whatever.ups >& time.log" > & time.log
Question 6: ( Feb '03 -- Worthen )
How can I change the PETSc I am using to another without reconfiguring?
Answer:
You can edit the configVars.mk file. Specifically you will need to
modify the PETSC_LIBRARY file and have it point to the right place.
This assumes that the build of PETSc works and uses the same version
of MPI that you are linking sus against (hence the problems we were
having last week on the cluster). Then delete all the libraries (rm
lib/*.so) and recompile.
Question 7: ( Apr '03 -- Kurt Zimmerman )
How do I make mpeg movies from raw frames?
Answer:
You'll need a few pieces of software:
pnmflip (on rapture in /usr/sci/local/bin/ used by raw2ppm.csh below)
raw2ppm (on rapture also, and also used by raw2ppm.csh)
mpeg_encode (grab a copy for an SGI from /home/sci/kuzimmer/bin)
You'll need a parameter file for mpeg_encode:
look at /home/sci/kuzimmer/tools/pnm.param
And you'll need a simple cshell script:
look at /home/sci/kuzimmer/tools/raw2ppm.csh
copy the raw2ppm.csh file into the directory where all of your
*.moviename.raw files are. Edit the dimensions in the script to match
the frame size of your raw frames. Then type raw2ppm.csh at the command
line. It will begin converting all of your raw files to ppm files.
mpeg_encode likes ppm files. While you are converting files you will
want to copy the pnm.param file to this same directory. You will also
want to edit the pnm.param file. Edit the OUTPUT line (line 4 in my
pnm.param file) to set the file name for the movie. Then edit the INPUT
(or lines 16-18 in my file) to match the names of your *.moviename.ppm
files then set the begining and end numbers. So for example if you want
to make a movie of the files 021.mymovie.ppm to 653.mymovie.ppm the
INPUT section of the parmeter file would read:
INPUT
*.mymovie.ppm [021-653]
END_INPUT
Once you have all of your .ppm files and you've edited your parameter
file, just type:
mpeg_encode pnm.param
Question 8: ( Apr '03 -- J. Davison de St. Germain )
I get this message when trying to use CVS:
cvs checkout: failed to create lock directory for
Some directories
Permission denied cvs checkout: failed to obtain dir lock in ...
[checkout aborted]: read lock failed - giving up
How do I fix this?
Answer:
This means you do not have permissions to access some of the CVS tree.
This is usually due to your not being in the sci unix group.
To verify, type "groups" on the command line on a SCI machine. If you
are not, please send a message to dav@sci.utah.edu asking to be added
to the sci unix group so you can access CVS.
Question 9: ( June 2003 -- Randy Jones )
How do I update CSAFE web pages?
Answer:
The following are some quick and simple instructions
on how to add content to the C-SAFE website.
Currently, content that is checked in should be automatically
updated to the web server.
Make sure you have an account on a SCI machine.
Step 1: Make sure to have the following environment
variables set:
CVS_RSH=ssh
CVSROOT=/usr/sci/projects/cvsrepository
If you are using a machine outside of sci,
then you will use:
CVS_RSH=ssh
CVSROOT=<user>@<sci-machine>:/usr/sci/projects/cvsrepository
(This is the same CVSROOT used for SCIRun)
Step 2: Checkout the C-SAFE web site tree:
cvs co csafeweb
(It will take about 110MB of disk space)
Step 3: The easiest way to create a new page and add
it to the C-SAFE web site is to copy one that
is already there, rename it, and then replace
the parts that are inside of:
<!-- START OF CONTENT -->
<!-- END OF CONTENT -->
with your own content.
Then, put a link to your new page from a
page already on the C-SAFE web-site. This way,
you will automatically get the C-SAFE title
bar and style on your new page.
(If your new web page is going to be in a new
subdirectory, you will have to fix the links to
the title bar images by replacing "../" with
"../../" on all of the image references.)
Step 4: Add your new web page to cvs:
cvs add <your-page>.html
Step 5: Commit your changes:
cvs commit -m "Added <something> to C-SAFE web site"
Question 10: ( Mar, '05 -- David Groulx )
How do I find out the kernel version and archetecture I am running on?
Answer:
From the terminal type "uname -a" to print out all OS information.
CATEGORY: SUS
Question 1: ( (Date Not Specified) -- (Author Not Specified) )
What is sus? Pronunciation?
Answer:
Standalone Uintah Simulation (application). Pronounced "sus" (short
'u') rhymes with "fuss".
Question 2: ( (Date Not Specified) -- (Author Not Specified) )
How do I give sus input? What is a .ups file?
Answer:
The xml file basically specifies the input and output of a module as
well as some other basic info for module creation. If you are just
adding more operators and using the same input and outputs, then you can
safely ignore the xml file. The tcl file is where you set up the
visual code for a module, entries, buttons, sliders etc. The tcl
code and the C++ code usually "communicate" via GuiVariables, although
there are other means of passing info between the two.
Question 3: ( 19 Sep 2002 -- T. Harman )
How long does a ICE/MPMICE timestep take (on frost/raptor)?
Answer:
Raptor ICE problem 2 ice_matl
Time=0.00715542, delT=7.37676e-05, elap T = 144.165, DW: 97, Mem Use = 12435456
Time=0.00722918, delT=7.37676e-05, elap T = 145.641, DW: 98, Mem Use = 12435456
Time=0.00730295, delT=7.37676e-05, elap T = 147.098, DW: 99, Mem Use = 12435456
Frost ICE problem 2 ice_matl
Time=0.00715542, delT=7.37676e-05, elap T = 149.435, DW: 97, Mem Use = 15147616
Time=0.00722918, delT=7.37676e-05, elap T = 150.962, DW: 98, Mem Use = 15147616
Time=0.00730295, delT=7.37676e-05, elap T = 152.476, DW: 99, Mem Use = 15147616
Frost MPMICE problem, 1 ice_matl 2 mpm_matl
Time=0.0547642, delT=0.0027551, elap T = 103.666, DW: 21, Mem Use = 18752672
Time=0.0575193, delT=0.00275559, elap T = 108.634, DW: 22, Mem Use = 18752672
Time=0.0602748, delT=0.00275601, elap T = 113.47, DW: 23, Mem Use = 18752672
Time=0.0630309, delT=0.00275639, elap T = 118.401, DW: 24, Mem Use = 18752672
Raptor MPMICE problem, 1 ice_matl 2 mpm_matl
Time=0.0547642, delT=0.0027551, elap T = 111.474, DW: 21, Mem Use = 15089664
Time=0.0575193, delT=0.00275559, elap T = 115.989, DW: 22, Mem Use = 15089664
Time=0.0602748, delT=0.00275601, elap T = 120.612, DW: 23, Mem Use = 15089664
Time=0.0630309, delT=0.00275639, elap T = 125.222, DW: 24, Mem Use = 15089664
Sus: going down successfully
Raptor configure line
../src/configure --enable-64bit --enable-package=Uintah '--enable-optimize=-Ofast -G0 -OPT:Olimit=20000 -IPA:plimit=20000' --disable-sci-malloc --enable-assertion-level=0
Frost configure line
../src/configure --enable-32bit --enable-package=Uintah --with-thirdparty=/usr/apps/uintah/SCIRun_Thirdparty/1.4.2/aix/xlC-32bit --disable-sci-malloc --with-zlib=/usr/local --with-mpi=/usr/lpp/ppe.poe --enable-optimize=-O2 --enable-assertion-level=0
Question 4: ( 2 Dec 2002 -- T. Harman )
What scripts are there that can help me with VarLabels?
Answer:
Update of /csafe_noexport/cvs/cvsroot/SCIRun/src/Packages/Uintah/StandAlone/inputs
In directory csf:/tmp/cvs-serv346188
Added Files:
labelNames
Log Message:
An aid for those who can't remember all the different variable labels.
This script spits out the variable names for the different components
usage:
labelNames
Question 5: ( Apr '03 -- Bryan Worthen )
How do I get sus to output checkpoints at specified walltime intervals?
Answer:
If you have this in your .ups file:
<DataArchiver>
...
<checkpoint walltimeStart="<startnum>" walltimeInterval="<intnum>"/>
...
</DataArchiver>
where startnum and intnum are in seconds, it will do a checkpoint starting
at startnum seconds, and then every intnum seconds after that.
I.e., if I had:
<checkpoint walltimeStart="3600" walltimeInterval="7200"/>
Then it would start doing checkpoints in one hour, and then every two
hours after that, or
<checkpoint walltimeStart="10800" walltimeInterval="7200"/>
then it would start at 3 hours, and then every 2 hours.
Keep in mind, though, that it could take a while to do the checkpoints and
that it will wait for a timestep to complete before it outputs
checkpoints. So keep probably 10-20 minutes before you know your run will
terminate.
Also note that data output and checkpoints can happen after every n timesteps, i.e.,
output: <outputTimestepInterval>1</outputTimestepInterval>
checkpoint: <checkpoint cycle = "2" timestepInterval = "500"/>
or after every n simulation seconds
<outputInterval> 0.01 </outputInterval>
<checkpoint interval="0.0005" cycle="2"/>
Question 6: ( Apr '03 -- Bryan Worthen )
What environment variables do sus/scirun respond to? (Or, how do I get SCIRun/sus to exit cleanly?)
Answer:
The following environment variable can be used in either sus or scirun:
| Variable | Value | Purpose |
| SCI_DBXCOMMAND | command | run this debug command on a signal/abort (the pid will be provided) |
| SCI_SIGNALMODE | | Default - ask user what to do on abort |
| exit | exit without prompt |
| dbx | invoke SCI_DBXCOMMAND if it exists, or dbx (on sgi, or on others, gdb) |
| cvd | another debugger to try |
| resume | try to keep going |
| SCI_EXCEPTIONMODE (not currently used) | | Default - ask user what to do on exception |
| abort | abort without prompt |
| dbx | invoke SCI_DBXCOMMAND if it exists, or dbx (on sgi, or on others, gdb) |
| cvd | another debugger to try |
| throw | throw the exception |
| MALLOC_STRICT | | causes all memory to be strictly initialized (0xffff5a5a) |
| MALLOC_LAZY | | turns off memory auditing |
| MALLOC_TRACE | filename | traces memory to filename or stderr if no filename |
| MALLOC_STATS | filename | outputs memory results at exit time to filename or stderr if no filename |
| MALLOC_PERPROC | filename | outputs memory usage per timestep to filename or cout if no filename |
Question 7: ( June 2003 -- Bryan Worthen )
How do I include an xml file from my .ups file (so I don't need to have the
same things in many different files)?
Answer:
Anywhere in your ups file where you want to replace the include tag with a
larger set of tags:
i.e.,
<Uintah_specification>
<DataArchiver>
<include href="saveLabels.xml"/>
<outputInterval>1.0</outputInterval>
...
<MPM>
<material>
<include href="MaterialData/MaterialConst4340Steel.xml"/>
...
</Uintah_specification>
Specifically, the include tag has the syntax:
<include href="filename"> where filename is either absolute or relative
to the path of the file doing the including.
The included file needs to look like this (this hasn't changed):
This is inputs/MPM/MaterialData/MaterialConst4340Steel.xml:
<?xml version='1.0' encoding='ISO-8859-1' ?>
<!-- 4340 Steel -->
<Uintah_Include>
<density>7830.0</density>
<toughness>10.e6</toughness>
<thermal_conductivity>38</thermal_conductivity>
<specific_heat>477</specific_heat>
<room_temp>294.0</room_temp>
<melt_temp>1793.0</melt_temp>
</Uintah_Include>
the syntax of the file is:
<?xml version='1.0' encoding='ISO-8859-1' ?>
<Uintah_Include>
<any tag or set of tags that you want/>
</Uintah Include>
Question 8: ( Oct 2003 -- Bryan Worthen )
How do I debug mpi jobs with gdb?
Answer:
See gdb.html
Question 9: ( November 26, 2003 -- Randy Jones )
How do I track memory leaks?
Answer:
- You need to have a build of sus where you have enabled sci-malloc.
(i.e. your configure lines should have " --enable-sci-malloc"). Add
--enable-scinew-line-numbers to get files with line numbers where
scinew detects a memory leak.
- Set the environment var MALLOC_STATS to a 'filename'.
- Edit your code and add:
const char* old_tag = AllocatorSetDefaultTag("task abc");
to the top of each task (or the top of each region you want to test).
AllocatorSetDefaultTag() returns the current string, so you
want to reset it at the end of your task to avoid misleading tags
for leaking memory. To to this, just add:
AllocatorSetDefaultTag(old_tag);
at the end of each task. (old_tag was set above by the first call
of AllocatorSetDefaultTag().
- Recompile
- Run
- Look at the 'filename' file and look for the non-freed memory. It
should be labeled with the name of the task that you entered. You
can add more AllocatorSetDefaultTag() calls in this task if you
need to narrow it down more.
- You should call AllocatorResetDefaultTag() at the highest level of
setting the default tag. This will set it back to the default tags.
The reason AllocatorSetDefaultTag(old_tag) won't do this is there are
actually three tags (malloc, new, and new[]) that are being set, and
AllocatorResetDefaultTag() resets all three back to their original value.
Question 10: ( January 27, 2004 -- James Bigler )
Is there a way to verify uda directory contents without launching scirun
and going through all of the timesteps?
Answer:
If your data is not too large, you could always do:
./puda -varsummary [uda]
This will touch all of the data and timesteps.
Question 11: ( May 2004 -- Bryan Worthen )
How do I run the dynamic load balancer?
Answer:
Add the following section to your ups file:
<LoadBalancer>
<timestepInterval>500</timestepInterval>
<cellFactor>.5</cellFactor>
<dynamicAlgorithm>particle3</dynamicAlgorithm>
<gainThreshold>0.0</gainThreshold>
<doSpaceCurve>true</doSpaceCurve>
</LoadBalancer>
The timestepInterval is how often loadbalancing will occur (you can also
use <interval>#</interval> where # is a number in terms of the simulation
time). Note that for all the cases I've done, the load balancer has done
most of its good work on the first timestep, and subsequent cases didn't help
much, but this will depend on your problem, so based on experimentation, you
might want this to be a higher or smaller number.
The cellFactor tells the loadbalancer how much to count each cell in terms of
particles to determing the total patch cost. I have found that between .5
and .7 are good numbers for MPM simulations, but you are welcome to experiment and that
around 1.0 are good for MPMICE simulations.
So far, the load balancer only works well for MPM-based simulations, the
others, including AMR simulations are currently a work in progress.
The dynamicAlgorithm is which algorithm to use to do at runtime. The
choices are
static - pretty much the same as the default load balancer
cyclic - rotates the patches in a cyclic manner among processors (pretty worthless except as a test)
random - assigns each patch to a random processor (also pretty worthless except as a test)
particle1 - decent algorithm, but worse than static
particle2 - not a good algorithm, way worse than static
particle3 - pretty good algorithm - this is what we showed at the TST.
So chances are, if you want to try anything useful with the Load Balancer,
do particle3.
gainThreshold is optional - it tells the load balancer that instead of loadbalancing every n timesteps,
first check to see if it's worthwhile to do so. More specifically, it calculates the std deviation of
processor cost (where cost is based on the cellcost*numcells + numParticles) with and without load balancing, and
if (oldStdDev / proposedStdDev) >= threshold, then do the load balancing. Or rather, if threshold is zero, always
load balance, if it is 1.0, load balance if the proposed solution is at least as good, or if it 1.25, then it
should be 25% better. If this is left out, the default value is 0.
doSpaceCurve is optional - it tells the load balancer to tryto do a simple space-filling curve, which
should give it some optimzation. Currently
the curve algorithm makes one big assumption about the domain - and that is that it can be identified by
a single <patches> section for each level. Therefore, if your domain has multiple boxes or uses AMR regridding,
it's probably a good idea to set this to false. If this is left out, the default value is false.
Then run sus as normal but add -loadbalancer PLB or -loadbalancer
ParticleLoadBalancer to the command line:
sus -mpm disks.ups -loadbalancer PLB
Note that you should have a bit more patches than processors (at least twice
as many).
Question 12: ( July 13, 2004 -- Randy Jones )
How do I get the stack trace on a hung program which crashed on an SGI?
Answer:
In a separate xterm, you can type "dbx -p <process-id>" and then type
"where" to get the stack trace.
Question 13: ( August 12, 2004 -- Randy Jones )
How do I see how much real time it is taking to calculate one simulation second?
Answer:
To see this statistic every timestep, just type the following into your shell:
setenv SCI_DEBUG SimulationTimeStats:+
If you want to see this statistic by itself and remove the normal stats, type:
setenv SCI_DEBUG SimulationTimeStats:+,SimulationStats:-
Question 14: ( August 2004 -- Bryan )
Is there a way to match the name of the uda directory with my batch job?
Answer:
Sometimes it is desirable to have the uda match something, like the job id
of the job that ran it. To do this, you can pass -uda_suffix <name>
as an arument to sus. On inferno, you can specify -uda_suffix $PBS_JOBID
to match the job number (and output file if that's how you save output).
Question 15: ( Sep, 2004 -- Bryan )
Can I override the delt of a restart run from what was saved in the
checkpoints?
Answer:
Yes.
Do:
<override_restart_delt> .00000000000000001 </override_restart_delt>
in the time block on your input.xml file. This will override the
very next timestep, and will display a message that it is doing so.
It will affect restarts only, so placing it in a ups file won't do
anything on the original run, but it will be copied to the input.xml
file and will take affect on the restart.
Question 16: ( December, 2004 -- Bryan )
Can I make sus output the initialization timestep?
Answer:
Yes. Add <outputInitTimestep/> to the DataArchiver block in your ups file.
Question 17: ( December, 2004 -- Bryan )
Outputting data seems to take longer with an increasing number of processors.
Is there a way to make this faster?
Answer:
Maybe. If you add this section to your ups file:
<LoadBalancer>
<outputNthProc>4</outputNthProc>
</LoadBalancer>
it will tell sus to output data every 4th processor instead of every single processor
(i.e., procs 1-3 will ship their data to proc 4, 5-7 will ship to proc 8, etc.) So naturally
the cost of sending the data via mpi needs to be less the gain achieved by having less processors hitting
the file system at the same for this to be beneficial.
The experimental point I have found that is beneficial is 128 procs outputting every 4 seems beneficial
(on inferno using raid1). I suppose that for more procs it would also be beneficial.
Question 18: ( Dec, 2004 -- Bryan )
How do I only run my simulation for x timesteps?
Answer:
2 ways. Either add:
<max_iterations>x</max_iterations>
or
<maxTimestep>x</maxTimestep>
where x is the number of timesteps. The difference is that
max_iterations will run that many timesteps from the start of
the simulation, even on restarts. maxTimestep will run to
timestep x and quit, even on restarts.
Question 19: ( Jan, 2005 -- Bryan )
How can I remove some variables from an uda?
Answer:
Run your simulation as normal. Then edit the <uda-dir>/input.xml
file and remove the "save" labels that you don't want anymore. Then run
[mpirun -np #procs] sus -reduce_uda <uda-dir>
Use MPI for big cases that won't fit on one processor.
If you want to compare the new and the old uda to make sure they are the
same, then edit the original uda/index.xml and remove the same variables
and then do
compare_uda first_uda second_uda.
Note, this probably won't work if your varLabels use BoundaryLayers, which
I think only are used in the Examples component directory.
Also note that the resulting uda will be changed slightly from the original uda,
but only in that the timesteps in the resulting will be
t00001-t&;t;number-of-timesteps> instead of the original number, the delt's
stored inthe timestep.xml represent the time difference between the output
timesteps, and the resulting uda does not have checkpoints or reduction variables.
However, if you copy the checkpoints and reductions over, you should be able to
restart just fine, and as far as scirun or any other program is concerned it should
look exactly the same.
Question 20: ( June, 2005 -- Bryan )
What does 'WARNING: Possible extra communication between patches!' mean?
Answer:
This means that a processor is sending more data to another processor
than it needs to, and normally only arises when you have more than one
patch per processor.
For example (ASCII art)
-----------
| | |
| 1 | 2 |
| | |
-----------
| | |
| 0 | 1 |
| | |
-----------
proc 0 needs to send data to proc 1. It needs to send to the patch
above it and to the patch to the right. So, in the current way of
things, we choose sending one larger message (which constitutes the
entire patch here) over sending two smaller messages. Whether this is a
good choice or not depends on the message size, network latency, and
bandwidth.
This is one of the things we are investigating for a scheduler change.
This message only occurs in taskgraph compilation time, so its frequency
is not indicative of the number of total large messages, but perhaps the
number of them in one timestep.
Question 21: ( 01/26/06 -- Todd )
How do I monitor a single variable through a timestep
Answer:
Add this to your input file
<Scheduler>
<VarTracker>
<start_time> 0 </start_time>
<end_time> 1 </end_time>
<start_index> [139, 0, -1] </start_index>
<end_index> [141,2,1] </end_index>
<var label="press_equil_CC" dw="NewDW" />
<var label="press_CC" dw="NewDW" />
</VarTracker>
</Scheduler>
If you want to limit the spew and only print a subset of tasks, you can
do that by specifying
<task name="ICE::computeDelPressAndUpdatePressCC" />
in the <VarTracker> section
CATEGORY: Scripts/Utilities
Question 1: ( 02/01/06 -- Todd )
What is plotStats?
Answer:
plotStats is a small gnuplot script that takes the output from sus, parses
it and plots several simulation metrics as a function of wall clock time. It's
really useful in monitoring the timestep size.
Usage:
plotStat <sus output file> <dump postScript File (y/Y), default is no>
You must have gnuplot installed.
Question 2: ( 02/01/06 -- Todd )
Is there a utility to get the time step information from an uda?
Answer:
Yes, use
puda -timesteps <uda directory>
CATEGORY: Subversion
Question 1: ( April 14, 2005 -- Hartner )
What is subversion and how do I use it?
Answer:
Subversion is used to manage the source trees of SCIRun and Uintah.
Prior to April 15th 2005 we used CVS.
The SCIRun developers have a webpage to help people get up and running
on Subversion.
Please refer to:
http://internal.sci.utah.edu/developer/BioPSE/NCRRweb/DocProcess/SCIRunandSubversion.html
Question 2: ( Apr, '05 -- David Groulx )
How do I get Subversion?
Answer:
For the impatient person installing from source (for any OS):
Download subversion-1.1.4.tar.gz
tar -zxvf subversion-1.1.4.tar.gz
cd subversion-1.1.4
mkdir ~/local
./configure --with-ssl --prefix=/home/yourname/local
make
make install (can be installed as a normal non-root user)
setenv PATH /home/yourname/local/bin:$PATH
For the impatient person running Redhat Enterprise Linux 3:
Download Jim Guilkey RPM's for RH EL3 posted at:
http://www.sci.utah.edu/~guilkey/SUBVERSION/
rpm -ivh *.rpm (must be root to install)
For the inquisitive patient person:
The subversion project homepage is located at http://subversion.tigris.org
All the files you need to get started using subversion are stored
locally on the network. For installers, go to
/usr/sci/projects/subversion/<platform> and get the installer for your
OS. The README within each folder will have platform specific
instructions and caveats. Additionally, the source tarball is located
in the src folder, this should compile on all platforms with the
standard "configure; make; make install" method. As a third option,
precompiled binaries for most platforms are located at
/usr/sci/projects/subversion/bin/<platform>. You can just add
the appropriate location to your path.
Question 3: ( 01/23/06 -- Todd )
How do I revert my changes back to reversion 32928
Answer:
cd SCIRun/src
svn merge -r32929:32928 https://code.sci.utah.edu/svn/SCIRun/trunk/src/
Question 4: ( 01/23/06 -- Todd )
How do I checkout a specific date?
Answer:
svn checkout --revision "{$year-$month-$day}" https://code.sci.utah.edu/svn/SCIRun/trunk/src SCIRun/src
Question 5: ( 01/26/06 -- Todd )
How do I make a branch?
Answer:
cd /Uintah
svn copy -m "creating impAMRICE residual Branch" . https://code.sci.utah.edu/svn/SCIRun/branches/uintah-impAMR-residual
CATEGORY: Tester
Question 1: ( Apr '03 -- Bryan Worthen )
How do I start/restart the regression tester? How do I run the regression
tester on my own SCIRun build?
Answer:
See the regression tester documentation for a lot more information.
To start the regression tester, run /local/csafe/raid1/tester/bin/startTester.
startTester neds at least one argument to run, which is normally -sendmail.
You can also run startTester -sendmail -use_tree path_to_scirun to run the
tests on any tree.
On inferno, in order to run mpi, we need to go through the batch scheduler, so
we can run:
qsub /local/csafe/raid1/tester/bin/Regress.pbs
To run specific regression tests on your SCIRun tree, see
how to run your own tests
Question 2: ( May 2003 -- Bryan Worthen )
How do I update and compile the current regression tester build?
Answer:
To update the current build,
1) cd /local/csafe/raid1/tester/{IRIX64|Linux}/SCIRun.date/src
where you pick either IRIX64 or Linux, and date is the most recent SCIRun
build.
2) cvs update
3) cd ../{dbg|opt}/build
again, choose between dbg or opt.
4) gmake -j [numprocs] sus
5) cd .. (this will take you to the dbg or opt dir
6) run the do[whatever]tests script, where whatever is ICE, ARCHES, MPM,
etc. i.e.,
doMPMARCHEStests
7) If it stops and asks you to remove a directory, do it, and run it
again.
8) Let the owner of the regression tester know if there are any permissions
problems
Question 3: ( June 2003 -- Bryan Worthen )
How do I add my own tests to the regression tester?
Answer:
See the regression tester docs
Question 4: ( December, 2003 -- Bryan )
The restart test passes the comparisons, but the normal test fails. I've
replaced my gold standards hundreds of time, but it still fails. What is
going on?
Answer:
When your restart test passes but your original tests fail, it is a problem
with the restart, probably in the initialization. The reason the restart
test passes is because when you replace the gold standard, the uda that gets
saved is the uda from the restart. Since something is different between the
restart and the original, the original fails, even though the original is more
correct.
Question 5: ( June 2005 -- Bryan )
How can I check on the status of the Regression Tester before I get the
email? (Or how can I see if it ran?)
Answer:
Three ways:
1) Check the website: www.csafe.utah.edu/tester/$OS/SCIRun.$DATE
where $OS is Linux or IRIX64 and date is in the 6-digit format. I.e.,
www.csafe.utah.edu/tester/Linux/SCIRun.060305
Linux dir
SGI dir
The last test group is the last test completed. So if IMPM-opt is the
last thing you see, then it has completed.
2) Check the RT directories: /usr/csafe/raid2/csafe-tester/$OS/SCIRun.$DATE
where OS and DATE follow the same rules as above. Check the contents of
dbg and opt for directories called *-results (where * is ARCHES, MPM, etc.)
If there are no results directories in opt, then it's either still in dbg, or
compiling the opt sus. The last results directory in alphabetical order
is the last directory it worked on.
3) Check the machine the tester is running on.
For IRIX
ssh muse
For Linux
ssh inferno
qstat -an
Look which nodes csafe-tester is using
ssh to the first one
Then run
ps -fu csafe-tester
to see what the RT is currently doing.
CATEGORY: Thirdparty
Question 1: ( (Date Not Specified) -- Wing )
PETSc vs HYPRE?
Answer:
PETSc does a good job of giving us a suit of preconditioners and
linear solvers (it has nonlinear solvers too). However in order to
make it more efficient on large scale parallel linear problems, we
need multigrid. PETSc do have multigrid but As Steve and Rajesh know,
it's a pain since our indexing scheme is different than theirs. And
also we have to take care of some other coding stuff (we know cause we
tried). I found hypre which has multigrid and some linear solvers,
and the advantage is it is using the same indexing so the interface is
very easy. This is just another option for users to choose what to
use. We are still using PETSc for some of the matrix vector
operation. You can say I'm just too lazy to code that up myself. For
more info on hypre, you can go to HREF="http://www.llnl.gov/CASC/hypre">http://www.llnl.gov/CASC/hypre
CATEGORY: UCF
Question 1: ( Dec 3, 2002 -- Steve )
What does this mean?
Caught exception: TempX_FC, matl 0, patch/level 0 not found for
scrubbing.
I assume there is a problem with the computes and requires for that
variable but could the message be a little more descriptive. I'll
change it if someone can describe what's wrong.
Answer:
It probably means that a variable was declared to be computed but never
"put". The old check for that specific problem no longer works.
Question 2: ( 12/02 -- Dav )
Is there a way to synchronize screen output from multiple
nodes/processors on the cluster? Otherwise output is unreadable.
Answer:
For shared memory (threads), you can use a mutex to separate output.
You would do something like this:
> extern Mutex cerrLock; // at the top of our .cc file.
In the code where you want output sync'd:
> cerrLock.lock();
> cerr << "Caught exception: " << e.message() << '\n';
> cerrLock.unlock();
However, this won't work with separate MPI processes. Usually MPI
itself separates the output. Sometimes there is a flag (to mpirun)
that tells MPI to put the processor number in front of output.
Question 3: ( (Date Not Specified) -- (Author Not Specified) )
How do patches/levels/grids work together? Data-storage? Data Warehouse?
Answer:
The Patches basically just "know" about themselves how big they
are. The levels manage the patches and the grid manages the
levels. BUT none of them know anything about the data. The only one
who has access to the data is the datawarehouse. (and for that it has
vectors for each variable type).
It has vectors (or whatever data structure, depending upon the variable)
for each variable AND each patch and material. In other words, you have
to specify the variable's "label" (it's name essentially), a patch, and
a material for each chunk of data you grab from the data archiver. So
basically, the data is associated with each patch. This is essentially
equivalent, I believe, to having the patch point to the data directly
(but not quite as efficient perhaps). I'm not sure all of the reasons
for doing it this way, but I can understand it in terms of encapsulation
purposes (a patch defines a spatial box that can have data associated
with it, not a container of the data itself -- I dunno).
Why was this kind of layout chosen; in particular why don't the
levels or patches "know" somehow about "their" data (at least in
form of pointers or whatever).
The data in the UCF can be very transient. The patch (plus material
number) is used as a key to index into the data warehouse. However,
there can be more than one data warehouse associated with each
level/patch. This can include multiple time-steps, multiple
iterations over the domain, and so forth. The UCF is quite different
than other approaches, and this is one aspect of that.
Question 4: ( (Date Not Specified) -- W. Witzel )
Scrubbing the Data Warehouse (Possibly out of date info)
Answer:
> > For what exactly is the bool parameter init_timestep in
> > Scheduler::(const ProcessorGroup* pc, bool init_timestep = 0)
> > good, and when should I set that to true/false?
>
> Looks like a Wayne thing. It is used to clear the scrub lists in the
> detailed task graph. You don't clear them out on the initial time
> step. Not sure why... Wayne can probably fill us in when he gets
> back.
>
> Set it to true the very first time you call compile. False after
> that.
I thought it was a Steve thing. It could have been me -- I don't
remember.
Here's the scoop. Any variable that isn't required at the beginning of
the next timestep can be scrubbed after it's use in the current timestep
is finished. How do you know what is at the beginning of the timestep?
Well, we basically just assume for now that the next timestep will have
the same taskgroup as the current timestep (at least as far as what is
required from the previous timestep). However, the initialization
timestep is different. It doesn't require anything from before. It just
initializes all of the variables. If we used the same strategy for the
initialization timestep, then everything would get scrubbed and you would
lose everything you initialized. So, our simple solution, it appears, is
just don't scrub anything on the initialization timestep that could
possibly be used later.
Probably more than you needed to know, but there you go.
Question 5: ( 12/02 -- Dav )
SUS aborts instead of throwing an exception. What's going on?
Answer:
When we link sus (due to our LD_LIBRARY_PATH) we link against
/usr/sci/local/lib/libgcc_s.so.1.
Looking at this file, it appears (though I'm not 100% certain) that
this is a libgcc built for gcc 3.0. If I unset my LD_LIBRARY_PATH,
and compile my test program (I haven't tried this with sus yet), it
starts catching exceptions. (I assume sus will too.)
Question 6: ( Nov 2002 -- S. Parker )
I am getting assertion faied error message:
An