FaqItem Categories:

Arches
CVS
Cluster
Coding
Configure
Documentation
Fortran
Graphics
LANL
LLNL
MPI
Matlab Tricks
Misc
SUS
Scripts/Utilities
Subversion
Tester
Thirdparty
UCF





CATEGORY: Arches

Questions:


Q1: What/where is the chem.bin file?

Q2: What does 'Caught exception: Allocating a CCvariable that is
apparently already allocated!' mean?



CATEGORY: CVS

Questions:


Q1: What should I set my CVSROOT and CVS_RSH environment variables to?

Q2: I've asked Yarden about this, but we don't know how to remove a lock.

Whenever I try to check code into Core/Geom I get the following
message from CVS:
cvs server: [01:38:18] waiting for yarden's lock in
/csafe_noexport/cvs/cvsroot/SCIRun/src/Core/Geom




CATEGORY: Cluster

Questions:


Q1: How can I get and use a set of interactive nodes on the cluster?

Q2: Why isn't sus running correctly? Looks like an MPI problem?
Sometimes I see errors like this:

> MPI process rank 0 (n0, p10588) caught a SIGSEGV in MPI_Type_extent.
> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
> Rank (0, MPI_COMM_WORLD): - MPI_Type_extent()
> Rank (0, MPI_COMM_WORLD): - MPI_Gather()
> Rank (0, MPI_COMM_WORLD): - MPI_Allgather()
> Rank (0, MPI_COMM_WORLD): - main()

or

> MPI_Bcast: invalid communicator (rank 0, MPI_COMM_WORLD)
> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
> Rank (0, MPI_COMM_WORLD): - MPI_Bcast()
> Rank (0, MPI_COMM_WORLD): - main()

Why isn't sus running correctly? Looks like an MPI problem?
Sometimes my mpi job produces X sets of output and X udas. What's going on?

Q3: How long can jobs run on the linux cluster? How many processors?

Q4: When I submit a job on the cluster (from /tmp/banerjee in inf004 in
this case) using qsub, I get the following message

qsub: Bad UID for job execution

What am I doing wrong?

Q5: When I run a pbs batch job, my output files are not group/world readable.

Q6: How do I see which nodes on inferno are down?

Q7: I get errors such as:

> > Unable to copy file 6991.inf001.OU to inf003.sci.utah.edu:/home/sci/likai/SCIRun/linux32opt/Packages/Uintah/StandAlone/mpm-8-1/batch.job.o6991
> > >>> error from copy
> > inf003.sci.utah.edu: Connection refused
> > Unable to copy file 6991.inf001.OU to inf003.sci.utah.edu:/home/sci/likai/SCIRun/linux32opt/Packages/Uintah/StandAlone/mpm-8-1/batch.job.o6991
> > >>> error from copy
> > inf003.sci.utah.edu: Connection refused
> > yboard-interactive).
> > lost connection
> > >>> end error output

What does this mean?

Q8: Why are we not using sse and sse2 flags on debug builds on the cluster?

Q9: How do I get system status on inferno (the linux cluster)?

Q10: For how much time can I run jobs on inferno?

Q11: What do I do when I have weird problems on the cluster?

Q12: What does it mean when I see an error like this running MPI on the cluster:

> It seems that some error has occurred during MPI_INIT. This will
> cause your process to abort. These kinds of errors are usually
> system-related, such as running out of disk space, running out of
> memory, or something more serious such as data not being passed
> between processes properly. That is, you should not be seeing this
> error message; if you are, somethings is likely Very Wrong with your
> system. :-(
>
> Perhaps this Unix error message will help:
>
> Unix errno: 1252
> Unknown error 1252


Q13: I get the following warning when I submit a job on the cluster:

Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.

Is anything wrong with the .pbs file?

Q14: How can I pass enviroment variables to the cluster nodes with mpirun on Inferno?

Q15: How do I run 2+ serial jobs on 1 node on inferno so I can utilize all the CPUs?

Q16: Why does my scirun build (that I built from my main sus tree) crash?

Q17: I would like to put through a test case that requires 75 nodes for
approximately 3 minutes later on today. This will tie up the queue
until it runs but will make the nodes available again after 3 minutes.
Does anybody have a problem with this?



CATEGORY: Coding

Questions:


Q1: In looking at the code to help Jim track down a memory leak, I
did a cursory search for new in both *.h and *.cc files within
Uintah/. There are many instances were new is used instead of scinew.
Is there any reason we should prefer new over scinew? If not, then I
will go in and change the new to scinew.

Q2: Why the single makefile?

Q3: How do use debug streams? (Environment variable?)

Q4: How do I use TAU?

Q5: How do you compile on LLNL? I've been just doing it on 1
processor. Do you submit a job and use more processors? How
do I compile at LLNL?

Q6: Monitor/top machine usage monitoring tool on Livermore IBM SP?

Q7: How long does it take to compile SCIRun/sus?

Q8: How do I do performance analysis?

Q9: How do I get the debugger to come up automatically (under
linux)?

Q10: On my SGI with a mountain fresh build I'm picking up


ld64: ERROR 28 : GP-relative sections overflow by 0x35d1 bytes. Please recompile with a smaller -G value.
You can see gprel section layout with -m -aoutkeep
See the explanation in the gp_overflow(5) manpage.
ld64: INFO 152: Output file removed because of error.
--- lib/libCore_Datatypes.so ---
*** Error code 2 (ignored)
C++ prelinker: warning: could not locate library -lCore_Datatypes; assuming /usr/lib/libCore_Datatypes.a
C++ prelinker: warning: nm returned a nonzero error status
ld64: FATAL 9 : I/O error (-lCore_Datatypes): No such file or directory
gmake: *** [lib/libCore_Algorithms_Geometry.so] Error 2


Here's my configure line

../src/configure --enable-64bit --enable-package=Uintah --with-thirdparty=/usr/installed/Thirdparty/1.7/IRIX64/MIPSpro-7.3.1.1m-64bit --enable-optimize=-Ofast


Should I just turn optimize down to O2 or is there a magic systune
knob to turn?

Q11: How can I track down memory problems in sus/SCIRun?

Q12: What is LD_LIBRARY_PATH used for?

Q13: Just an answer to a question I asked on Monday. I'm now running a 32
node job on inf. I think the problem I was having on Monday may have
been related to iterating outside the bounds of my arrays. It's not
clear why this didn't kill smaller jobs as well, but that's the only
thing that has been fixed that I know of.


Q14: I did a cvs update -Pd from src, and then a gmake, and I get this error:

> gmake: *** No rule to make target `../src/Core/Util/sci_system.c',
> needed by `Core/Util/sci_system.o'. Stop.

How do I fix this?

Q15: How do I make emacs insert tabs instead of spaces?

Q16: I'm getting an error message like the following when compiling (after
I did a 'cvs update'):

No rule to make target `../src/Dataflow/Modules/Render/SCIBaWGL.h',
needed by `Dataflow/Modules/Render/OpenGL.o'. Stop.

(Note, the 'no rule' target can be anything and the 'needed by' can
also be anything.)



CATEGORY: Configure

Questions:


Q1: Why do I get (and how do I fix) this error during configure:

./config.status --recheck running /bin/sh ../src/configure --with-thirdparty=/export/space/scratch/SCIRun1.8.0/1.8/Linux/gcc-3.2-32bit '--enable-package=BioPSE MatlabInterface' --enable-debug --no-create --no-recursion

checking for gcc...
gcc
checking for C compiler default output...
a.out
a.out
conftest.c
checking whether the C compiler works...
configure: error: cannot run C compiled programs.
If you meant to cross compile, use `--host'.

or (void*) problem.





Q2: Make complains that fspec.pl is not executable.



CATEGORY: Documentation

Questions:


Q1: Where is web documentation on the Q machine (qscfe1 @ LANL)?

Q2: How do I use doxygen?



CATEGORY: Fortran

Questions:


Q1: What to do with variable names that are too long. How do I use
this PASS3 thing that you mention?

Q2: Can I use Fortran 90 compilers or does sus only support Fortran 77?

Q3: Do I need to use gen-fspec with my fortran code? If so, how do I set it
up?



CATEGORY: Graphics

Questions:


Q1: How can I make mpeg movies from the raw frames ?

Q2: How can I configure and run the Real Time Ray Tracer (rtrt) to make
movies ?

Q3: How do a make a montage of jpg images.



CATEGORY: LANL

Questions:


Q1: How do I log into the LANL machines (Theta,Q)?

Q2: Why do my submitted jobs not start on Q?

Q3: How can I send data faster from the labs (pscp)?

Q4: How do I use long-term storage at LANL

Q5: How do I log into Q (or LANL) now that portal is gone?
OR
How do I use VPN with LANL Q?



CATEGORY: LLNL

Questions:


Q1: Where is Hypre at LLNL?

Q2: What are some hints on running at LLNL?

Q3: Where are the thirdparty libs located on Blue/Frost?

Q4: What configure line do you use on Frost (the IBM SP at LLNL)?

Q5: How do I use long term storage at LLNL?

Q6: How do I check our machine queue usage/time on frost/blue at LLNL?

Q7: How do I request dedicated time on frost or blue (LLNL)?

Q8: On Frost, why doesn't it let me allocate more than 256MB of memory?

Q9: How do I determine the number of nodes being used on ALC (at LLNL)?

Q10: How do I get onto LLNL's Thunder machine? What is it?

Q11: Why are my exceptions printing out garbage?



CATEGORY: MPI

Questions:


Q1: What environment variables do I use with MPI?

Q2: Memory usage and MPI_TYPE_MAX



CATEGORY: Matlab Tricks

Questions:


Q1: How do I make a contour plot in matlab



CATEGORY: Misc

Questions:


Q1: How do I get passwordless entry to LANL (or anywhere else for that matter)?

Q2: How can I turn off the compilation Arches (or MPM or ICE)?

Q3: I am having thread problems with SCIRun. What could be wrong?

Q4: What MANPATH should I use?

Q5: How do you get generic execution time measurements from a program?

Q6: How can I change the PETSc I am using to another without reconfiguring?

Q7: How do I make mpeg movies from raw frames?

Q8:

I get this message when trying to use CVS:

cvs checkout: failed to create lock directory for

Some directories

Permission denied cvs checkout: failed to obtain dir lock in ...

[checkout aborted]: read lock failed - giving up

How do I fix this?


Q9: How do I update CSAFE web pages?

Q10: How do I find out the kernel version and archetecture I am running on?



CATEGORY: SUS

Questions:


Q1: What is sus? Pronunciation?

Q2: How do I give sus input? What is a .ups file?

Q3: How long does a ICE/MPMICE timestep take (on frost/raptor)?

Q4: What scripts are there that can help me with VarLabels?

Q5: How do I get sus to output checkpoints at specified walltime intervals?

Q6: What environment variables do sus/scirun respond to? (Or, how do I get SCIRun/sus to exit cleanly?)

Q7: How do I include an xml file from my .ups file (so I don't need to have the
same things in many different files)?

Q8: How do I debug mpi jobs with gdb?

Q9: How do I track memory leaks?

Q10: Is there a way to verify uda directory contents without launching scirun
and going through all of the timesteps?

Q11: How do I run the dynamic load balancer?

Q12: How do I get the stack trace on a hung program which crashed on an SGI?

Q13: How do I see how much real time it is taking to calculate one simulation second?

Q14: Is there a way to match the name of the uda directory with my batch job?

Q15: Can I override the delt of a restart run from what was saved in the
checkpoints?

Q16: Can I make sus output the initialization timestep?

Q17: Outputting data seems to take longer with an increasing number of processors.
Is there a way to make this faster?

Q18: How do I only run my simulation for x timesteps?

Q19: How can I remove some variables from an uda?

Q20: What does 'WARNING: Possible extra communication between patches!' mean?

Q21: How do I monitor a single variable through a timestep



CATEGORY: Scripts/Utilities

Questions:


Q1: What is plotStats?

Q2: Is there a utility to get the time step information from an uda?



CATEGORY: Subversion

Questions:


Q1: What is subversion and how do I use it?

Q2: How do I get Subversion?

Q3: How do I revert my changes back to reversion 32928

Q4: How do I checkout a specific date?

Q5: How do I make a branch?



CATEGORY: Tester

Questions:


Q1: How do I start/restart the regression tester? How do I run the regression
tester on my own SCIRun build?

Q2: How do I update and compile the current regression tester build?


Q3: How do I add my own tests to the regression tester?

Q4: The restart test passes the comparisons, but the normal test fails. I've
replaced my gold standards hundreds of time, but it still fails. What is
going on?

Q5: How can I check on the status of the Regression Tester before I get the
email? (Or how can I see if it ran?)



CATEGORY: Thirdparty

Questions:


Q1: PETSc vs HYPRE?



CATEGORY: UCF

Questions:


Q1: What does this mean?

Caught exception: TempX_FC, matl 0, patch/level 0 not found for
scrubbing.

I assume there is a problem with the computes and requires for that
variable but could the message be a little more descriptive. I'll
change it if someone can describe what's wrong.

Q2: Is there a way to synchronize screen output from multiple
nodes/processors on the cluster? Otherwise output is unreadable.

Q3: How do patches/levels/grids work together? Data-storage? Data Warehouse?

Q4: Scrubbing the Data Warehouse (Possibly out of date info)

Q5: SUS aborts instead of throwing an exception. What's going on?

Q6: I am getting assertion faied error message:

An exception was thrown. Msg: from.d_window != 0 (file:
../../src/Packages/Uintah/Core/Grid/Array3.h, line: 203)
Backtrace:
An exception was thrown. Msg: from.d_window != 0 (file:
../../src/Packages/Uintah/Core/Grid/Array3.h, line: 203)
Backtrace:
0x3ffffccb54: SCIRun::AssertionFailed::AssertionFailed(const char*,const
char*,int)

Does anybody know what this means?

Q7: Has C-SAFE ever made it into the local paper?

Q8: Is there a way to print out the task graph that my input file generates?

Q9: When running timeextract on a large simulation run, my machine comes
to a screeching halt. What is going on?

Q10: What does it mean when I get the following error? I restarted a case
that gave me the following error message: :Parsing
file:///p/gf1/spinti/heptane_30cm_251.uda.005/checkpoints/index.xml
17:DataArchive::PatchHashMaps::parseOne:ERROR processor index (-1) is
out of bounds [0 , 2].

Q11: What is an example command line for running the MixedScheduler?

Q12: What arguments should I use with mpirun?

Q13: What do we know about the DBL_EPSILON being undefined under linux?

Q14: How is this different from 'could not find specific production for
Variable X'?

I ask because I'm having this error when I'm trying to get variable X
from the DW. I am requiring and getting the variable; that much is
certain, and the logic I have is telling me that I am computing it in
the previous time step...













CATEGORY: Arches




Question 1: ( Wed, 25 Sep 2002 -- S. Borodai )

  • What/where is the chem.bin file?

    Answer:

  • The chem.bin file in the directory
    /local/csafe/raid1/tester/Linux/coding.lock/dbg/
    has size 0:
    -rw-r--r-- 1 worthen csafe 0 Sep 24 19:51 chem.bin

    However, before you run methane test for Arches, you have to
    copy chem_meth.linux.bin from
    src/Packages/Uintah/StandAlone/inputs/ARCHES/
    to be chem.bin in the directory you are running Arches tests from.
    Otherwise, tests will fail, since chem.bin is the input data file
    required to run almost any Arches problem. The tests were working
    before,
    so I assume the copy statement for chem.bin somehow got dropped
    in the process of making tester more user friendly.

    Also I don't see input.dtd file in this directory, which is also needed.



    Question 2: ( (Date Not Specified) -- S. Borodai )

  • What does 'Caught exception: Allocating a CCvariable that is
    apparently already allocated!' mean?

    Answer:

  • > I ran the helium_1m.ups file and it ran past this:
    >
    > Time=0, delT=0.008, elap T = 1.97311, DW: 0, Mem Use = 68944368
    >
    > but then soon I got this error:
    >
    > > Caught exception: Allocating a CCvariable that is apparently already
    > > allocated!

    That error often happens when number of processors on the command line
    is not equal to number of patches in the ups file.





    CATEGORY: CVS




    Question 1: ( April 2, 2003 -- Randy Jones )

  • What should I set my CVSROOT and CVS_RSH environment variables to?

    Answer:

  • We are not using CVS any more. If you use CVS to checkout or update a
    source tree you will get the version of the source as of the date we
    switched over to Subversion. Please see the Subversion section of this FAQ.

    Please refer to:  Running Sus:  Checklist - Before you start



    Question 2: ( 1/03 -- S. Parker )

  • I've asked Yarden about this, but we don't know how to remove a lock.

    Whenever I try to check code into Core/Geom I get the following
    message from CVS:
    cvs server: [01:38:18] waiting for yarden's lock in
    /csafe_noexport/cvs/cvsroot/SCIRun/src/Core/Geom

    Answer:

  • Log into csf, and go to /csafe_noexport/cvs/cvsroot/SCIRun/src/Core/Geom.

    Do an "ls -a", and it should be pretty obvious which file is the lock.
    Then get Yarden or a sysadmin to remove it.





    CATEGORY: Cluster




    Question 1: ( September 3, 2003 -- Bryan/Biswajit )

  • How can I get and use a set of interactive nodes on the cluster?

    Answer:

  • For those of you who like interactive nodes (like me), and like to have
    multiple nodes in your interactive nodes (like me) to test out everything,
    I wrote a script called llogin that lives in the
    /usr/sci/projects/Uintah/scripts directory.

    Usage:

    llogin
  • Why isn't sus running correctly? Looks like an MPI problem?
    Sometimes I see errors like this:

    > MPI process rank 0 (n0, p10588) caught a SIGSEGV in MPI_Type_extent.
    > Rank (0, MPI_COMM_WORLD): Call stack within LAM:
    > Rank (0, MPI_COMM_WORLD): - MPI_Type_extent()
    > Rank (0, MPI_COMM_WORLD): - MPI_Gather()
    > Rank (0, MPI_COMM_WORLD): - MPI_Allgather()
    > Rank (0, MPI_COMM_WORLD): - main()

    or

    > MPI_Bcast: invalid communicator (rank 0, MPI_COMM_WORLD)
    > Rank (0, MPI_COMM_WORLD): Call stack within LAM:
    > Rank (0, MPI_COMM_WORLD): - MPI_Bcast()
    > Rank (0, MPI_COMM_WORLD): - main()

    Why isn't sus running correctly? Looks like an MPI problem?
    Sometimes my mpi job produces X sets of output and X udas. What's going on?

    Answer:

  • Hi all,

    sus does not know implicitly how to recognize all mpi implementations, so in these
    cases we need to tell it to.

    > mpirun -np X sus -mpi -component input.ups
    ^^^^

    Use the "-mpi". If you don't, MPI_Init is never called (which you
    probably will see as a warning when your run finishes.) You also will
    not get most of your TAU output.

    The reason you must be explicit about mpi is that we have not yet
    determined how to make the code automagically figure out that it
    should use MPI on all clusters. FYI: This is
    the same reason that we have hardcoded usingMPI when running on the
    SP.

    One other thing to make sure of is that you are using the lam mpirun
    call (as opposed to the mpich one.)

    > > which mpirun
    > /usr/local/lam-mpi/bin/mpirun
    Hi all,

    sus does not know implicitly how to recognize all mpi implementations, so in these
    cases we need to tell it to.

    > mpirun -np X sus -mpi -component input.ups
    ^^^^

    Use the "-mpi". If you don't, MPI_Init is never called (which you
    probably will see as a warning when your run finishes.) You also will
    not get most of your TAU output.

    The reason you must be explicit about mpi is that we have not yet
    determined how to make the code automagically figure out that it
    should use MPI on all clusters. FYI: This is
    the same reason that we have hardcoded usingMPI when running on the
    SP.




    11/02
    Dav
    Cluster


    What configure line do you use on the cluster?



    LAM Linux cluster configure line:

    ../src/configure --enable-package=Uintah
    '--enable-optimize=-march=pentium4 -msse -msse2 -O3'
    --enable-assertion-level=0 --with-mpi=/usr/local/lam-mpi
    --with-petsc=/usr/sci/projects/Uintah/Thirdparty/1.0.0/Linux/gcc-3.2-lam-32bit/petsc-2.1.1
    --with-hypre=/usr/sci/projects/Uintah/Thirdparty/1.0.0/Linux/gcc-3.2-32bit/hypre-1.7.7b




    Question 3: ( Jan, 2003 -- M. Hartner )

  • How long can jobs run on the linux cluster? How many processors?

    Answer:

  • Please see:
    http://www.csafe.utah.edu/Information/Instructions/inferno.html#cpu_usage



    Question 4: ( (Date Not Specified) -- (Author Not Specified) )

  • When I submit a job on the cluster (from /tmp/banerjee in inf004 in
    this case) using qsub, I get the following message

    qsub: Bad UID for job execution

    What am I doing wrong?

    Answer:

  • You must run 'qsub' from inf001.



    Question 5: ( June 2003 -- Mark/Dav )

  • When I run a pbs batch job, my output files are not group/world readable.

    Answer:

  • The umask is hard-coded as 077 in the PBS src.

    I think they hard-coded it because jobs are not run through your shell,
    but are started directly by PBS, so they don't have a umask from your dot
    files.

    I can recompile with a different umask, but then every file created from a
    batch job would be world-readable.

    Something like this should work, near the end of your batch file:

    mpirun -O -np $NUM_PROCS /bin/tcsh -c "sus <args to sus>"


    This will run whatever umask setting you have in your .cshrc file

    ** OR **
    After the mpi call in your batch script, you could:

    % cd top_of_data_dir
    % chmod -R go+rX *

    'Course, if the script doesn't finish, then this wouldn't work...





    Question 6: ( June 2003 -- Mark Hartner )

  • How do I see which nodes on inferno are down?

    Answer:

  • xpbsmon is a good way to see the status of the cluster.
    pbsnodes -I will show which nodes are down.



    Question 7: ( June 2003 -- Mark Hartner )

  • I get errors such as:

    > > Unable to copy file 6991.inf001.OU to inf003.sci.utah.edu:/home/sci/likai/SCIRun/linux32opt/Packages/Uintah/StandAlone/mpm-8-1/batch.job.o6991
    > > >>> error from copy
    > > inf003.sci.utah.edu: Connection refused
    > > Unable to copy file 6991.inf001.OU to inf003.sci.utah.edu:/home/sci/likai/SCIRun/linux32opt/Packages/Uintah/StandAlone/mpm-8-1/batch.job.o6991
    > > >>> error from copy
    > > inf003.sci.utah.edu: Connection refused
    > > yboard-interactive).
    > > lost connection
    > > >>> end error output

    What does this mean?

    Answer:

  • Your home directory might be group writeable. The batch system uses ssh
    to copy files around, and it refuses to authenticate a user with a group
    writeable home directory. That is why you are getting errors.
    If you need a place to share files, I would suggest you make a
    subdirectory within your home directory and set the permissions
    appropriately.




    Question 8: ( June 2003 -- Steve )

  • Why are we not using sse and sse2 flags on debug builds on the cluster?

    Answer:

  • They really are optimization options - they make it use special
    instructions in the pentium 4 to make the code faster. It doesn't make
    a big difference, even in optimized mode. If you want to make the debug
    code faster, use --enable-debug="-O -g". G++ can mix debug and
    optimization.



    Question 9: ( Jul, '03 -- J. Davison de St. Germain )

  • How do I get system status on inferno (the linux cluster)?

    Answer:

  • Use the script 'usage' in /usr/sci/projects/Uintah/scripts/inferno.
    Or you can directly use the commands "qstat -a" or "pbsnodes -l".



    Question 10: ( Aug 2003 -- J. Davison de St. Germain )

  • For how much time can I run jobs on inferno?

    Answer:

  • Please see the usage policy at
    ../Instructions/inferno.html
    . Jobs that do not meet follow
    this policy may be deleted without notice.



    Question 11: ( November 03 -- Bryan )

  • What do I do when I have weird problems on the cluster?

    Answer:

  • Send mail to cluster-users@sci.utah.edu. Send your job number and job
    output files in the email.



    Question 12: ( November 03 -- Bryan )

  • What does it mean when I see an error like this running MPI on the cluster:

    > It seems that some error has occurred during MPI_INIT. This will
    > cause your process to abort. These kinds of errors are usually
    > system-related, such as running out of disk space, running out of
    > memory, or something more serious such as data not being passed
    > between processes properly. That is, you should not be seeing this
    > error message; if you are, somethings is likely Very Wrong with your
    > system. :-(
    >
    > Perhaps this Unix error message will help:
    >
    > Unix errno: 1252
    > Unknown error 1252

    Answer:

  • We have seen this before. If it happens send your job id and job output files
    to cluster-users@sci.utah.edu. The problems we have had with this are
    semaphores and other shared-memory items not being cleaned up.



    Question 13: ( Feb, 2004 -- Bryan, Stas, Mark )

  • I get the following warning when I submit a job on the cluster:

    Warning: no access to tty (Bad file descriptor).
    Thus no job control in this shell.

    Is anything wrong with the .pbs file?

    Answer:

  • No, you can safely ignore this message. It just means that there is
    no interactive control in your job.



    Question 14: ( December 2, 2004 -- Randy Jones )

  • How can I pass enviroment variables to the cluster nodes with mpirun on Inferno?

    Answer:

  • Optional flags to mpirun:

    -x varname[=value][,varname[=value],...]

    This passes the environment variable varname to all the nodes used in mpirun. If you specify the =value portion, it will set the specified variable with that value. Otherwise it will use the value of that variable in the current environment. I.e.,

    mpirun -x SCI_SIGNALMODE=exit,MALLOC_STATS=malloc_stats



    Question 15: ( May 05 -- Bryan/Dav )

  • How do I run 2+ serial jobs on 1 node on inferno so I can utilize all the CPUs?

    Answer:

  • Place the following as the command section of your batch script (instead of
    the mpirun ... line):

    program 1 &
    program 2
    wait

    If you need to move output after one of the jobs finished, you can try:
    program 1 ; cp $SCRATCHDIR1/*.dat $WORKDIR1 &
    program 2 ; cp $SCRATCHDIR2/*.dat $WORKDIR2
    wait

    where you can set SCRATCHDIR as the dir the job runs in and WORDIR as the current dir.



    Question 16: ( May 05 -- Todd )

  • Why does my scirun build (that I built from my main sus tree) crash?

    Answer:

  • We have found that building scirun with '-O3 -msse -msse2 -march=pentium4' can cause
    scirun to crash. We recommend configuring scirun with --enable-optimize
    instead of --enable-optimize=<flags>.



    Question 17: ( 6/17/05 -- Jim Guilkey )

  • I would like to put through a test case that requires 75 nodes for
    approximately 3 minutes later on today. This will tie up the queue
    until it runs but will make the nodes available again after 3 minutes.
    Does anybody have a problem with this?

    Answer:

  • Actually, on inferno, it won't tie up the queue. Inferno isn't
    FIFO. Rather, if you put in a 75 node job, it'll sit there until
    75 nodes are free, which might be a while, since if anyone puts in
    a job behind yours, even if there are 74 nodes free, they'll go
    first.





    CATEGORY: Coding




    Question 1: ( (Date Not Specified) -- (Author Not Specified) )

  • In looking at the code to help Jim track down a memory leak, I
    did a cursory search for new in both *.h and *.cc files within
    Uintah/. There are many instances were new is used instead of scinew.
    Is there any reason we should prefer new over scinew? If not, then I
    will go in and change the new to scinew.

    Answer:

  • You cannot use scinew in some instances. When making an array of user-defined
    objects (not any of the built-in types), you cannot use scinew unfortunately.

    Do not use scinew in Array3, or in any templated function where you
    are allocating an array or the templated type. (i.e. new T[5]). This
    is due to a "bug" in the C++ specification and various vendors'
    interpretation of the spec.

    Otherwise, scinew should be preferred...

    It is fine to allocate a single object, just not an array.

    The one in Array3.h where it allocates the data is the dangerous one.



    Question 2: ( (Date Not Specified) -- S. Parker )

  • Why the single makefile?

    Answer:

  • That is only one of the reasons that we went towards the single makefile
    approach. The global make clean is just an artifact of the single
    makefile. Adding a local make clean would be hard, unless it just
    did:
    find . -name "*.o" -print | xargs rm

    which would work but might not always do what you want either.

    Personally, I never use make clean. I just do rm *.o;gmake or
    the above find statement. This leaves the .ii files which makes for
    a faster link.

    Anybody that wants to implement a local make clean, here is the
    idea of how to do it: use gmake's pattern match rule to look for
    all of the $(CLEANOBJS) at or below the subdir.
    clean:
    rm $(filter $(DIR)/%, $(CLEANOBJS))


    where getting DIR is left as an exercise to the reader...



    Question 3: ( (Date Not Specified) -- Dav )

  • How do use debug streams? (Environment variable?)

    Answer:

  • setenv SCI_DEBUG TaskGraph:+[FileNameToStoreInfoIn][,VarName:+[file]]
    




    Question 4: ( (Date Not Specified) -- Dav )

  • How do I use TAU?

    Answer:

  • in configVars.mk:


    TAU_MAKEFILE := /res/sci/data1/TAU/tau/sgin32/lib/Makefile.tau-sgitimers-sproc
    ifneq ($(TAU_MAKEFILE),)
    include $(TAU_MAKEFILE)
    endif

    ....

    type "make cleantau;make"

    On Nirvana:
    /usr/projects/Uintah/tau/sgi64/lib/Makefile.tau-profile-trace


    % tau_merge *.trc sus.trc
    % tau_convert -pv sus.trc tau.edf sus.pv
    % vampir sus.pv


    For subsets of the trace files (when you have a lot of trace files):


    > tau_merge sus1.trc sus11.trc sus21.trc sus.trc
    > tau_convert -nocomm -pv sus.trc tau.edf sus.pv
    > vampir sus.pv




    Question 5: ( (Date Not Specified) -- (Author Not Specified) )

  • How do you compile on LLNL? I've been just doing it on 1
    processor. Do you submit a job and use more processors? How
    do I compile at LLNL?

    Answer:

  • I've been just running "gmake -j4" on blue. Or -j8 on frost. I have
    not been submitting a job. If you are compiling interactively, I
    suggest using frost as it is much faster. However, if the login node
    on frost is being hammered, you can try submitting an "xterm" job and
    then compiling on the node you get. Usually it doesn't take too long
    to get a single node this way.

    I think something like this should work:


    > echo "xterm -display taurus.sci.utah.edu:0" | psub


    Make sure you have xhost + set on your local machine.




    Question 6: ( (Date Not Specified) -- (Author Not Specified) )

  • Monitor/top machine usage monitoring tool on Livermore IBM SP?

    Answer:

  • One other little trick... if you want to see the "top" output on LLNL,
    you need to use the monitor program. I have mine aliased:

    frost001:22:~> which top
    top: aliased to "monitor -top"




    Question 7: ( Sep 2002 -- T. Harman )

  • How long does it take to compile SCIRun/sus?

    Answer:

  • Here are some recompile times for both blue and rapture with optimized
    builds. I used 4 processors on both machines and I touched the same
    file. You might mention this at your next crt conference call.



    Blue
    Real 793.49
    User 280.32
    System 173.03

    Rapture
    real 188.791
    user 196.366
    sys 26.554

    Frost
    Real 216.57
    User 117.10
    System 68.41


    This web page has some historic compile time results:


    http://www.csafe.utah.edu/Information/Instructions/CompileTimes.html




    Question 8: ( (Date Not Specified) -- (Author Not Specified) )

  • How do I do performance analysis?

    Answer:

  • Here is something I found out: do NOT compile your program with -pg,
    but DO use -g. Then do this:
    setenv LD_PROFILE libPackages_Uintah_CCA_Components_MPM.so
    ./sus ...
    sprof ../../../lib/libPackages_Uintah_CCA_Components_MPM.so

    Unfortunately, it will only give you profiles for a single .so, which is
    very annoying, but it is a first step.

    Steve

    > Steve was right regarding gprof: I made some stupid little so and linked
    > it with a small program, and gprof doesn't seem to cross so's. I tested
    > this about a million times with different function usage, and different
    > linking styles (linking against a static library, or just linking all the
    > files together). Everything seems to work except the shared libs.



    Question 9: ( (Date Not Specified) -- S. Parker )

  • How do I get the debugger to come up automatically (under
    linux)?

    Answer:

  • Here is the magic environment variable to get gdb in a new window
    whenever sus crashes on linux.


    setenv SCI_DBXCOMMAND "gnome-terminal -x gdb sus %d"
    or
    setenv SCI_DBXCOMMAND "xterm -e gdb sus %d"

    This only works if you run sus from Packages/Uintah/StandAlone,
    otherwise you will need to add a path to sus (gdb
    /whataever/Packages/Uintah/StandAlone/sus)



    Question 10: ( Feb '03 -- Guilkey, Parker )

  • On my SGI with a mountain fresh build I'm picking up


    ld64: ERROR 28 : GP-relative sections overflow by 0x35d1 bytes. Please recompile with a smaller -G value.
    You can see gprel section layout with -m -aoutkeep
    See the explanation in the gp_overflow(5) manpage.
    ld64: INFO 152: Output file removed because of error.
    --- lib/libCore_Datatypes.so ---
    *** Error code 2 (ignored)
    C++ prelinker: warning: could not locate library -lCore_Datatypes; assuming /usr/lib/libCore_Datatypes.a
    C++ prelinker: warning: nm returned a nonzero error status
    ld64: FATAL 9 : I/O error (-lCore_Datatypes): No such file or directory
    gmake: *** [lib/libCore_Algorithms_Geometry.so] Error 2


    Here's my configure line

    ../src/configure --enable-64bit --enable-package=Uintah --with-thirdparty=/usr/installed/Thirdparty/1.7/IRIX64/MIPSpro-7.3.1.1m-64bit --enable-optimize=-Ofast


    Should I just turn optimize down to O2 or is there a magic systune
    knob to turn?

    Answer:

  • Jim writes: Magic knob, you need to set -G0, e.g.

    Steve writes: Turn optimization down to O2 and compile Core/Datatypes,
    then you can turn it back up. It is not a systune variable, it is a
    problem with Core/Datatypes getting too big - we haven't seen it in a
    while. (Not sure if this is pertinent based on Jim's response...)



    Question 11: ( Feb 'O3 -- Dav )

  • How can I track down memory problems in sus/SCIRun?

    Answer:

  • Make sure your build is configured with --enable-sci-malloc.
    If you set the environment variable MALLOC_STRICT (under tcsh: setenv
    MALLOC_STRICT) then the memory management system will fill "memory"
    with "bogus" data that can help track down memory errors. NOTE: if
    you set MALLOC_STRICT and suddenly your program starts dieing, it is
    very likely that there is an uninitialized variable in your code that
    (luckily) defaulted to 0 and thus worked... However the default to 0
    is a coincidence and should not be relied upon.




    Question 12: ( Feb '03 -- Dav )

  • What is LD_LIBRARY_PATH used for?

    Answer:

  • The environment variable LD_LIBRARY_PATH tells the runtime linker
    where to look for dynamic libraries that need to be loaded by your
    program. If your LD_LIBRARY_PATH variable points to libraries that
    where created by a different compiler than your application, you can
    experience strange behavior. Usually LD_LIBRARY_PATH should not be
    set (as sus/scirun build in library path information when they are
    linked), however you can use this variable to dynamically use
    different libraries if you know what you are doing.



    Question 13: ( Feb '03 -- Parker )

  • Just an answer to a question I asked on Monday. I'm now running a 32
    node job on inf. I think the problem I was having on Monday may have
    been related to iterating outside the bounds of my arrays. It's not
    clear why this didn't kill smaller jobs as well, but that's the only
    thing that has been fixed that I know of.

    Answer:

  • With an optimized build, this is not suprising. Iterating just outside
    of a small array is still "close" in memory. However, iterating just
    outside of a large array can be very far away, causing a crash. Trying
    it on a debug build you should have gotten an assertion failure
    independent of the size of the array.

    The lesson: if weird things happen, try a debug build...



    Question 14: ( Mar '03 -- Parker )

  • I did a cvs update -Pd from src, and then a gmake, and I get this error:

    > gmake: *** No rule to make target `../src/Core/Util/sci_system.c',
    > needed by `Core/Util/sci_system.o'. Stop.

    How do I fix this?

    Answer:

  • (See also the "repair.sh" entry in this FAQ.)

    This is a typical error when you do an update after files have been
    removed from SCIRun. The easiest fix is:

    touch ../src/Core/Util/sci_system.c
    gmake
    rm ../src/Core/Util/sci_system.c

    The problem occurs because the make system is trying to determine the
    "age" of the dependency file in order to determine if the (.cc/.c)
    file in question should be rebuilt (into a new .o). This also occurs
    frequently if a .h file is removed from the tree. Other times this
    occurs include when you build on one architecture (or specific
    machine) and then try to build on a different architecture (and
    sometimes machine.)



    Question 15: ( April 2004 -- Bryan Worthen )

  • How do I make emacs insert tabs instead of spaces?

    Answer:

  • To insert tabs instead of spaces, add this to your .emacs:

    (setq-default indent-tabs-mode nil)



    Question 16: ( Aug 30, 2004 -- J. Davison de St. Germain )

  • I'm getting an error message like the following when compiling (after
    I did a 'cvs update'):

    No rule to make target `../src/Dataflow/Modules/Render/SCIBaWGL.h',
    needed by `Dataflow/Modules/Render/OpenGL.o'. Stop.

    (Note, the 'no rule' target can be anything and the 'needed by' can
    also be anything.)

    Answer:

  • ...you can use the "repair.sh" script located in .../SCIRun/src/scripts/ to
    fix this.

    eg:


    > cd SCIRun/<bin>
    > ../src/scripts/repair.sh SCIBaWGL.h

    The repair script will search all the .d files (or depend.mk files on
    the SGI) for the bad include file (in this case, SCIBaWGL.h) and
    remove the corresponding .d and .o files. Then you can just type make
    and it will rebuild what is necessary.





    CATEGORY: Configure




    Question 1: ( Jan '03 -- Dav )

  • Why do I get (and how do I fix) this error during configure:

    ./config.status --recheck running /bin/sh ../src/configure --with-thirdparty=/export/space/scratch/SCIRun1.8.0/1.8/Linux/gcc-3.2-32bit '--enable-package=BioPSE MatlabInterface' --enable-debug --no-create --no-recursion

    checking for gcc...
    gcc
    checking for C compiler default output...
    a.out
    a.out
    conftest.c
    checking whether the C compiler works...
    configure: error: cannot run C compiled programs.
    If you meant to cross compile, use `--host'.

    or (void*) problem.




    Answer:

  • This error usually occurs when you are using a different compiler (or
    compiler version) then the Thirdparty was compiled with. Another
    possibility is if you have your LD_LIBRARY_PATH variable set with
    stuff that does not work with the default compiler.

    Also, if configure was created using the wrong version of autoconf,
    this might happen.



    Question 2: ( A long time ago... -- (Author Not Specified) )

  • Make complains that fspec.pl is not executable.

    Answer:

  • Configure is supposed to "chmod +x" this file. It appears
    not to the first time. Manually do the chmod if you run
    into this problem. (chmod +x Packages/Uintah/tools/fspec.pl)





    CATEGORY: Documentation




    Question 1: ( Feb '03 -- Dav )

  • Where is web documentation on the Q machine (qscfe1 @ LANL)?

    Answer:

  • Local docs:
    http://www.csafe.utah.edu/Information/Instructions/qsc.html

    LANL docs (you will need a Z# and a pass code):

    https://icnn1.lanl.gov/ldswg/icnn/content/qsc/help




    Question 2: ( Oct 2003 -- Bryan Worthen )

  • How do I use doxygen?

    Answer:

  • See doxygen.html





    CATEGORY: Fortran




    Question 1: ( (Date Not Specified) -- S. Parker )

  • What to do with variable names that are too long. How do I use
    this PASS3 thing that you mention?

    Answer:

  • To pass an array into fortran, we must also pass the lower and
    upper bounds. On the SGI, we do this with two integer arrays
    (low and high) with 3 elements (for the x,y,z bounds). However,
    GNU fortran does not allow this type of array:

    double precision A(low(1):high(1), low(2):high(2), low(3):high(3))
    


    It does however allow this:

    double precision A(low_x:high_x, low_y:high_y, low_z:high_z)
    


    So the fortran interface generates 2 different versions: the first
    form on SGI because it is more efficient, and the second form on linux
    because it works. For the most part, the fortran code doesn't see
    this. However, if you are passing an array into a subroutine, you
    need to do:

    call sub(A, A_low, A_high)


    on the SGI and:

    call sub(A, A_low_x, A_low_y, A_low_z, A_high_x, A_high_y, A_high_z)


    on linux/g77. To make this easier, I made the PASS3 macro, which is
    short for passing a 3 dimensional array.

    call sub(PASS3(A))

    which will do the right thing in both cases. The only problem is with
    very long array names. When this gets expanded:
    call sub(PASS3(long_name))

    to:
    call sub(long_name, long_name_low_x, long_name_low_y, long_name_low_z, ...
    

    then it will easily overflow the 72 character limit for fortran code.
    Thus the need for the PASS3A/PASS3B macros:
    call sub(PASS3A(long_long_name)
    
    & PASS3B(long_long_name),

    which will just split the name expansion onto two different lines of less
    than 72 characters.



    Question 2: ( August 2003 -- Bryan Worthen )

  • Can I use Fortran 90 compilers or does sus only support Fortran 77?

    Answer:

  • We have currently made little investigation into compiling with Fortran 90.
    We intend to look into this a little more in the future, but not for the
    moment. However, if you know what you're doing, you may try to use Fortran
    90.



    Question 3: ( August 2003 -- Bryan Worthen )

  • Do I need to use gen-fspec with my fortran code? If so, how do I set it
    up?

    Answer:

  • Click for the answer





    CATEGORY: Graphics




    Question 1: ( September 3, 2003 -- Kurt/Biswajit )

  • How can I make mpeg movies from the raw frames ?

    Answer:

  • You'll need a few pieces of software:
    pnmflip (can be found on rapture and used by raw2ppm.csh below)
    raw2ppm (on rapture also, and also used by raw2ppm.csh)
    mpeg_encode (grab a copy for an SGI from ~kuzimmer/bin)

    You'll need a parameter file for mpeg_encode:
    look at ~kuzimmer/tools/pnm.param

    And you'll need a simple cshell script:
    look at ~kuzimmer/tools/raw2ppm.csh

    Copy the raw2ppm.csh file into the directory where all of your
    *.[moviename].raw files are.
    Edit the dimensions in the script to match the frame size of your raw frames.

    Then type raw2ppm.csh at the command line.
    It will begin converting all of your raw files to ppm files.
    mpeg_encode likes ppm files. While you are converting files you will want to
    copy the pnm.param file to this same directory.

    You will also want to edit the pnm.param file. Edit the OUTPUT line (line 4
    in my pnm.param file) to set the file name for the movie. Then edit the INPUT
    (or lines 16-18 in my file) to match the names of your *.[moviename].ppm files
    then set the begining and end numbers. So for example if you want to make a
    movie of the files 021.mymovie.ppm to 653.mymovie.ppm the INPUT section of the
    parameter file would read:

    INPUT
    *.mymovie.ppm [021-653]
    END_INPUT

    Once you have all of your .ppm files and you've edited your parameter file,
    just type:
    mpeg_encode pnm.param

    If you have multiple directories of raw files, the easiest thing to do is
    change the numbering of the files, then merge them together into one
    directory, then perform the above steps.



    Question 2: ( September 3, 2003 -- Jim/Biswajit )

  • How can I configure and run the Real Time Ray Tracer (rtrt) to make
    movies ?

    Answer:

  • Step 1: Go to one of the SGI parallel machines (rapture, muse etc.).

    Step 2: Configure and build.
    ../src/configure '--enable-package=Uintah Teem rtrt' --enable-optimize
    --enable-64bit --with-glut=/usr/sci/local --with-glui=/usr/sci/local
    --with-teem=/usr/sci/projects/SCIRun/Thirdparty/teem/IRIX64/MIPSpro-7.3.1.3m-64bit

    gmake -j2

    Step 3: Set the display variable to your machine.
    setenv DISPLAY [yourmachine].utah.edu:0.

    Step 4: Run rtrt.
    rtrt -np 16 -no_shadows -bv 0 -scene scenes/uintahparticle2 -rate 1.0
    /local/csafe/raid1/[uda_file] -timesteplow 55 -timestephigh 55
    -timestepinc 1 -radius 0.0008

    Basic Instructions for RTRT :
    Left click - sets min crop value
    Middle click - color by this value
    Right click - sets max crop value

    Use these to twiddle with the color map range
    Control+Left click - Set min value for color map
    Control+Middle click - Reset color map range
    Control+Right click - Set max value for color map

    Use these to narrow in on a region of the histogram
    Shift+Left click - Set min for histogram viewing
    Shift+Middle click - Reset histogram viewing to original
    Shift+Right click - Set max for histogram viewing

    The only way to control the animation rate is from the command line (yech!).
    You can specify the animation rate with
    -rate [number of frames to display in one second -- default 3].
    This can be a float. If you want to display each frame for 2 seconds,
    use -rate 0.5.

    As far as the movie making thing went, I used MovieMaker and then converted
    the file to a mpeg. You have the raw movie file. Try the QuickTime format
    too, and see how they compare with quality/size.

    I try to only create movies that are less than 2 minutes. If you want, we
    can get the media crew to piece together some sequences. For a presentation,
    movieslonger than 30 seconds to a minute get really boring.



    Question 3: ( 01/26/06 -- Todd )

  • How do a make a montage of jpg images.

    Answer:

  • Suppose you have 9 jpgs that you want resized to 640x480 and placed in a single image

    montage -geometry "640x480" -tile 3x3 1.jpg 2.jpg 3.jpg 4.jpg 5.jpg 6.jpg 7.jpg 8.jpg 9.jpg montage.jpg






    CATEGORY: LANL




    Question 1: ( 1/03 -- Dav )

  • How do I log into the LANL machines (Theta,Q)?

    Answer:

  • Use your crypto card to get you login password. Then "ssh
    portal.lanl.gov". From portal, you can ssh to theta or qscfe1.
    When going to qscfe1 from portal, you must use "ssh -1 qscfe1".



    Question 2: ( 10/03 -- Randy )

  • Why do my submitted jobs not start on Q?

    Answer:

  • If you are seeing this message:

    prun: Error: insufficient cpus in allocated resource use -O to override

    Then you might want to check that you typed "bsub < batch.job" instead
    of typing "bsub batch.job" which will not work.

    If your job dies on startup because of a "Caught: unknown exception",
    then just try re-submitting your job. This is a known problem that
    we are still chasing. It seems to only happen on large (128 procs or
    greater) runs.



    Question 3: ( Oct 2003 -- Bryan Worthen )

  • How can I send data faster from the labs (pscp)?

    Answer:

  • See pscp.html



    Question 4: ( Oct 2003 -- Bryan Worthen )

  • How do I use long-term storage at LANL

    Answer:

  • 1) Make sure you are registered to use this service.
    Go to https://register.lanl.gov
    Click main menu on the left side.
    Under Authentication Accounts, click on "High Performance Computing"
    If you don't see "Open HPSS Storage" under the list of granted accounts,
    click on "Request New Account" on the left side.
    Check the Open HPSS Storage box, and click Submit.

    It could take a while to get your account, so try to do this before
    you need it.

    2) From Q (or somewhere else on lanl), type psi. You will be inside your
    HPSS filesystem, and normal file system commands work here just like
    normal unix commands, and if you prepend a bang (!), it will happen
    in the local filesystem.

    The command 'store' will copy a file/directory to HPSS, and the command
    'get' will copy it to the current local direcetory.



    Question 5: ( July, 2004 -- Dav )

  • How do I log into Q (or LANL) now that portal is gone?
    OR
    How do I use VPN with LANL Q?

    Answer:

  • I just installed the Windows VPN client that lanl provides. You can
    get it here (you will need your z# and password):

    http://protected.lanl.gov/nst/VPNinstructions.html

    There are also downloads and instructions for Linux/Solaris and Mac.

    I followed their simple instructions and it went very smoothly. I
    was able to connect using VPN and then "ssh dav@qscfe1.lanl.gov"
    without a problem.

    Once on qscfe1, I was able to ssh and scp back to muse/raid1. It
    seems like this should be a viable, if not extremely convenient,
    method of doing work at LANL. You won't be able to do this on any
    machine that requires a local network to be maintained (ie, any
    machine that mounts a necessary network drive.)





    CATEGORY: LLNL




    Question 1: ( April 2, 2003 -- Randy Jones )

  • Where is Hypre at LLNL?

    Answer:

  • Please refer to:  Building sus on Frost:  Step 5

    Randy Jones: The following is no longer needed (I believe):

    Ok, at LLNL this is where everything is:

    HYPRE_DIR := -L/usr/apps/hypre/beta/lib
    HYPRE_INC := -I/usr/apps/hypre/beta/include
    HYPRE_LIB := -lHYPRE_LSI -lHYPRE_blas -lHYPRE_struct_ls
    -lHYPRE_struct_mv

    and on rapture, it is in my home directory and you want the 1.7.7b
    version. If you are getting errors in the mli_* files, do this:

    mv FEI_mv FEI_mv.hide
    ./configure
    make

    Apparently they do there development on Linux and didn't run into this
    problem. They said it's fix now but the version hasn't been released
    yet.



    Question 2: ( Thu, 12 Sep 2002 -- Wing )

  • What are some hints on running at LLNL?

    Answer:

  • I had a meeting with Barbara while I was at LLNL. She gave me some
    hints on using blue and frost.

    Here are some of the questions that I asked:

    pdebug vs pbatch ?
    It's not always faster to submit your jobs in the debug pool. If the
    debug is being used alot (like on frost), submitting to the batch pool
    with a short time (like 30 mins) will get your job to run earlier. A
    good command to check is "spj"

    Leaving a processor free on each node?
    >From her experiense, leaving a processor free from each node doesn't
    help much on blue (there are only 4 processors per node) but helps a lot
    at frost. She said they can rearrange the configuration and give us the
    debug node also next time for our big run.

    Optimal big case during normal run?
    Using the lowest maximum allowed can usually get the cases to run pretty
    fast. Like on frost, 24 nodes is the max during the day and lots of
    people run cases that size. If you ask for 32, then it might not run
    for several days. And on blue is 128 nodes but of course there is the 2
    hr. during the day factor. But I guess we will just have to do the
    dependent condition. Basically by doing this, you have a better chance
    to get your job to run since it can be done either during the day or at
    night.

    Another helpful pstate that I'm using is:

    pstat -A -o jid,name,user,status,maxtime,used,maxnodes,xct,prio
    JID NAME USER STATUS MAXCPUTIME USED MAXNODES
    XCT PRIORITY
    11867 nb_pen_nw.run deveritt *MULTIPLE 50:00 0:00
    0 0 0.000

    This will give you info about other cases like how long they asked for
    and how much long and their priority.



    Question 3: ( April 2, 2003 -- Randy Jones )

  • Where are the thirdparty libs located on Blue/Frost?

    Answer:

  • Please refer to:  Building sus on Frost:  Step 5



    Question 4: ( April 2, 2003 -- Randy Jones )

  • What configure line do you use on Frost (the IBM SP at LLNL)?

    Answer:

  • Please refer to:  Building sus on Frost:  Step 5



    Question 5: ( Jan '03 -- Dav )

  • How do I use long term storage at LLNL?

    Answer:

  • I have looked into the question of long term storage at LLNL. Turns
    out that it is as simple as ftp'ing whatever you want to
    storage.llnl.gov.

    > ftp storage.llnl.gov

    With ftp you can "mkdir", "cd", "put", and "get", etc. I have not
    tried it, but it is supposed to be very easy.



    Question 6: ( Feb '03 -- Wing )

  • How do I check our machine queue usage/time on frost/blue at LLNL?

    Answer:

  • Here is the command. Change the dates as needed.

    pcsusage -bm -b utah -u all -tb oct 01 2002 -te dec 31 2002




    Question 7: ( Apr '03 -- Dav d. )

  • How do I request dedicated time on frost or blue (LLNL)?

    Answer:

  • First, coordinate the request with Dav and the Homebrew team.

    You will need to IPA first at this web site:

    https://access.llnl.gov/ipa/login


    Then go to this web site:

    https://lc.llnl.gov/computing/forms/expedited_runs.html


    It will ask you for your LLNL user id and password.



    Question 8: ( July, 2003 -- James/Dav )

  • On Frost, why doesn't it let me allocate more than 256MB of memory?

    Answer:

  • By default, AIX executables can use only 256MB. This is
    determine by a bit in the header of the executable, it
    is not a property of the code itself.

    You can change this setting at link time by adding the link
    option '-bmaxdata:0x80000000' to your link line. No recompilation
    is otherwise necessary. The leading '8' indicates how many
    256MB segments you want to have (for a maximum of 2GB).

    You determine an existing executable's limit using 'dump -ov a.out'.
    The last two lines will be something like:

    maxSTACK maxDATA SNbss magic modtype
    0x00000000 0x00000000 0x0003 0x010b 1L

    The number under maxDATA indicates how much memory you
    can use. The default '0x00000000' is 256MB.

    You can change an existing executable's limit using the
    'setbmaxdata' script:

    setbmaxdata 8 a.out

    Then using dump -ov a.out, you will see for the last two lines:

    maxSTACK maxDATA SNbss magic modtype
    0x00000000 0x80000000 0x0003 0x010b 1L

    To speed up debugging, etc. we often recommend using
    0x70000000 unless your application really needs all
    2GB. I would also recommend using 'dump -ov' on your
    executable linked with -bmaxdata to make sure
    you are getting what you want.

    BTW, if you are using g++, you need to add -Wl, before
    - -bmaxdata in order to get it to work properly. Otherwise,
    g++ will interpret it as -b -m -a, etc. which causes
    really bad things to happen and cryptic error messages.



    Question 9: ( Aug 30, 2004 -- T. Harman )

  • How do I determine the number of nodes being used on ALC (at LLNL)?

    Answer:

  • Use the "usage" script (modified from the inferno script of the same
    name by Todd) to get this information. (The script is located in
    /usr/gapps/uintah/bin/usage.)



    Question 10: ( 06/22/2005 -- J. Davison de St. Germain )

  • How do I get onto LLNL's Thunder machine? What is it?

    Answer:

  • To get access to LLNL's Thunder cluster, you need to send a
    request to dav@sci.utah.edu. He will then approve the request and
    forward it to LLNL (lc-support@llnl.gov). For information about
    the Thunder cluster, go here.



    Question 11: ( June, '05 -- David Groulx )

  • Why are my exceptions printing out garbage?

    Answer:

  • If exceptions are printing out garbage for you, then you are probably
    using gcc 3.3 or earlier to compile with. To force exceptions to
    print out information in a compiler independant way, configure SCIRun
    with the flag '--enable-exceptions-crash' and rebuild. This should
    give you more informative exceptions.





    CATEGORY: MPI




    Question 1: ( (Date Not Specified) -- (Author Not Specified) )

  • What environment variables do I use with MPI?

    Answer:

  • This is for SGI's (perhaps the IBM SP?)

    setenv MPI_MSGS_PER_HOST 2048
    setenv MPI_MSGS_PER_PROC 1024




    Question 2: ( (Date Not Specified) -- (Author Not Specified) )

  • Memory usage and MPI_TYPE_MAX

    Answer:

  • Date: Thu, 02 May 2002 12:42:16 -0600
    From: Wayne Witzel
    Subject: memory usage and MPI_TYPE_MAX

    FYI, this is a case study you should know about just in case this kind
    of thing happens in the future.

    I tracked down the highwater memory test failures of ICE and MPMICE to
    the fact that I recently added:
    setenv MPI_TYPE_MAX 10000
    to my .cshrc on rapture.sci.

    The default MPI_TYPE_MAX is 1024. So increasing it to 10000 causes
    MPI to use significantly more memory (at least, relative to the memory
    these ICE and MPMICE runs were using).

    So the lesson here is that if you are having failures with highwater
    memory tests in the regression tester, this is one culprit to look at.
    One way to tell if this is the problem is to open up the "malloc_stats"
    file in your results and in the gold standard, search for "MPI
    initialization" and compare the number of bytes. The number will be 4
    times whatever your MPI_TYPE_MAX is set to.

    The way I could see people running into this in the future is if they
    run tests manually on their account where they don't have MPI_TYPE_MAX
    set (or set to a different value than I have it set to) and then replace
    the gold standard with these results. The way to prevent this would be
    for everybody to have the same values set for the MPI environment
    variables in their .cshrc. I have the following in my .cshrc:


    setenv MPI_MSGS_PER_HOST 32768
    setenv MPI_MSGS_PER_PROC 8192
    setenv MPI_TYPE_MAX 10000


    Wayne





    CATEGORY: Matlab Tricks




    Question 1: ( 01/26/05 -- Todd )

  • How do I make a contour plot in matlab

    Answer:

  • You should be in the Standalone directory and lineextract must be compiled

    %__________________________________
    % Hard wired Variables
    ts = 4 % timestep
    level = 0;
    uda = test.uda;
    startEnd ='-istart -1 -1 8 -iend 17 17 8';

    %__________________________________
    % import the data
    c = sprintf('lineextract -v delP_Dilatate -l %i -timestep %i %s -o delP -m 0 -uda %s',level,ts,startEnd,uda);
    [s, r] = unix(c);

    delP = importdata('delP');
    x = delP(:,1);
    y = delP(:,2);
    z = delP(:,4);
    %__________________________________
    % reshape and plot the data (this is the trick to contour plots)
    X = reshape(x, [18 18]);
    Y = reshape(y, [18 18]);
    Z = reshape(z, [18 18]);

    [C,h] = contourf(X, Y ,Z);
    clabel(C,h);
    colormap jet






    CATEGORY: Misc




    Question 1: ( Jul 2002 -- Dav/Bryan )

  • How do I get passwordless entry to LANL (or anywhere else for that matter)?

    Answer:

  • Here is a method that I believe will work to remove the need to type
    in your password when you ssh from anywhere to rapture (either for cvs
    or for sending data files.)

    You need to follow these steps:

    > ssh to the machine you want passwordless access FROM

    > ssh-keygen -t dsa # this is done only once
    Press return, then enter a pass phrase that you will remember as you
    will need it once every log in session.

    This will create files in your ~/.ssh directory - id_dsa, and id_dsa.pub.
    You may also use 'ssh-keygen -t rsa' for rsa (it will create id_rsa and
    id_rsa.pub), or ssh-keygen -t rsa1 if you need ssh 1 protocol.


    Copy the data from "id_dsa.pub" (or id_rsa.pub) (that was generated in
    your .ssh dir on the machine you logged in to) to rapture (or the machine
    you want passwordless access TO) and append it to a file named

    ~/.ssh/authorized_keys

    Now, everytime you want to do the no password ssh'ing from that location to
    rapture type:

    > ssh-agent # this is done only one time as you first log in

    Run the commands it prints to the screen (which adds some stuff to
    your environment).

    > ssh-add ~/.ssh/id_dsa # this is done only one time after the ssh-agent
    enter your pass phrase that you used above.

    > ssh name@rapture.sci.utah.edu (or to the machine you copied the public key)
    At this point (from now on in this log in session) you can ssh freely
    to rapture. You will also be able to ssh freely from any xterms you
    kick off.

    This sort of thing should also work to go to/from other machines.
    BTW, the id_dsa file contains your private key. It should be only
    readable by you. The id_dsa.pub contains your "public" key. In
    theory, this is what you can give to other people so that they can
    send encrypted data to you that only you can decipher.

    (This part isn't necessary, it's just optional extra power)
    The ssh-agent and ssh-add don't *really* need to be done every time.
    In theory, whenever you run an ssh-agent, it stays in memory until the
    machine reboots (or until root kills it). To take advantage of this, you
    can save the commands that ssh-agent outputs to a file, and then just source
    that file when you log in. And if you have already authenticated (ssh-add)
    to that ssh-agent, you won't need to do it or type in your passphrase again.

    Here are two aliases that facilitate this process (keep each one on one line).
    Add them to your .cshrc or .aliases file.

    alias agent 'rm -f "$HOME"/.ssh/`hostname`.agent ;
    ssh-agent > "$HOME"/.ssh/`hostname`.agent ;
    source "$HOME"/.ssh/`hostname`.agent ; ssh-add'

    This saves the output of ssh-agent to a file, sources it, and does ssh-add.
    You will need to type your pass-phrase here. You will only need to do this
    once, or until the process gets killed.

    alias sshagent 'if (-e "$HOME"/.ssh/`hostname`.agent)
    source "$HOME"/.ssh/`hostname`.agent ; endif'

    This one checks for the file that should be created by this computer, and if it
    exists, it sets up the environment to run with that ssh-agent. If you run
    sshagent at the end of your .cshrc, you may never have to type passwords again!
    However, if the machine reboots or your ssh-agent gets killed, this alias won't
    work, and you will need to run 'agent' again. Be extremely secure when doing
    this, make sure your .ssh directories and these files can be read only by you.

    So, once on the machine you want to ssh from, type

    > agent

    and at the bottom of your .cshrc file (after the aliases that you added above) add

    sshagent.

    This will set everything up. Also, before our run agent, make sure that there
    aren't already any ssh-agents owned by you on that machine

    ps -fu username | grep ssh-agent.

    Kill them before you run the agent alias.



    Question 2: ( Feb '03 -- Worthen )

  • How can I turn off the compilation Arches (or MPM or ICE)?

    Answer:

  • To turn off compilation of ARCHES (this works for turning off MPM/ICE
    too) use the script

    Uintah/Test/helpers/useFakeArches.pl path-to-SCIRun.

    This basically edits the sub.mk files to remove references to Arches and
    builds an empty Arches class. Likewise, useFakeIce.pl, useFakeMPM_ICE.pl,
    and useFakeMPM.pl (this one is in the works) will turn off ICE, MPM and
    ICE, or MPM, respectively.



    Question 3: ( (Date Not Specified) -- (Author Not Specified) )

  • I am having thread problems with SCIRun. What could be wrong?

    Answer:

  • The gcc compiler must have threads enabled. You can check this with
    "gcc -v". It should say "Thread model: posix". If not, you need to
    reconfigure gcc using the "--with-threads=posix".



    Question 4: ( Dec 2002 -- Hartner )

  • What MANPATH should I use?

    Answer:

  • It should be undefined. Setting your MANPATH really messes up GNU man. As
    long as the command is in your PATH, man should be able to find the man
    page if one exists. (This is for linux/inferno?)



    Question 5: ( Jan '03 -- Steve Parker )

  • How do you get generic execution time measurements from a program?

    Answer:

  • /usr/bin/time sh "sus -ice inputs/whatever.ups >& time.log" > & time.log



    Question 6: ( Feb '03 -- Worthen )

  • How can I change the PETSc I am using to another without reconfiguring?

    Answer:

  • You can edit the configVars.mk file. Specifically you will need to
    modify the PETSC_LIBRARY file and have it point to the right place.
    This assumes that the build of PETSc works and uses the same version
    of MPI that you are linking sus against (hence the problems we were
    having last week on the cluster). Then delete all the libraries (rm
    lib/*.so) and recompile.



    Question 7: ( Apr '03 -- Kurt Zimmerman )

  • How do I make mpeg movies from raw frames?

    Answer:

  • You'll need a few pieces of software:
    pnmflip (on rapture in /usr/sci/local/bin/ used by raw2ppm.csh below)
    raw2ppm (on rapture also, and also used by raw2ppm.csh)
    mpeg_encode (grab a copy for an SGI from /home/sci/kuzimmer/bin)

    You'll need a parameter file for mpeg_encode:
    look at /home/sci/kuzimmer/tools/pnm.param

    And you'll need a simple cshell script:
    look at /home/sci/kuzimmer/tools/raw2ppm.csh

    copy the raw2ppm.csh file into the directory where all of your
    *.moviename.raw files are. Edit the dimensions in the script to match
    the frame size of your raw frames. Then type raw2ppm.csh at the command
    line. It will begin converting all of your raw files to ppm files.
    mpeg_encode likes ppm files. While you are converting files you will
    want to copy the pnm.param file to this same directory. You will also
    want to edit the pnm.param file. Edit the OUTPUT line (line 4 in my
    pnm.param file) to set the file name for the movie. Then edit the INPUT
    (or lines 16-18 in my file) to match the names of your *.moviename.ppm
    files then set the begining and end numbers. So for example if you want
    to make a movie of the files 021.mymovie.ppm to 653.mymovie.ppm the
    INPUT section of the parmeter file would read:
    INPUT
    *.mymovie.ppm [021-653]
    END_INPUT

    Once you have all of your .ppm files and you've edited your parameter
    file, just type:
    mpeg_encode pnm.param





    Question 8: ( Apr '03 -- J. Davison de St. Germain )


  • I get this message when trying to use CVS:

    cvs checkout: failed to create lock directory for

    Some directories

    Permission denied cvs checkout: failed to obtain dir lock in ...

    [checkout aborted]: read lock failed - giving up

    How do I fix this?

    Answer:

  • This means you do not have permissions to access some of the CVS tree.
    This is usually due to your not being in the sci unix group.
    To verify, type "groups" on the command line on a SCI machine. If you
    are not, please send a message to dav@sci.utah.edu asking to be added
    to the sci unix group so you can access CVS.



    Question 9: ( June 2003 -- Randy Jones )

  • How do I update CSAFE web pages?

    Answer:

  • The following are some quick and simple instructions
    on how to add content to the C-SAFE website.

    Currently, content that is checked in should be automatically
    updated to the web server.

    Make sure you have an account on a SCI machine.

    Step 1: Make sure to have the following environment
    variables set:
    CVS_RSH=ssh
    CVSROOT=/usr/sci/projects/cvsrepository

    If you are using a machine outside of sci,
    then you will use:

    CVS_RSH=ssh
    CVSROOT=<user>@<sci-machine>:/usr/sci/projects/cvsrepository

    (This is the same CVSROOT used for SCIRun)


    Step 2: Checkout the C-SAFE web site tree:

    cvs co csafeweb

    (It will take about 110MB of disk space)


    Step 3: The easiest way to create a new page and add
    it to the C-SAFE web site is to copy one that
    is already there, rename it, and then replace
    the parts that are inside of:

    <!-- START OF CONTENT -->


    <!-- END OF CONTENT -->

    with your own content.

    Then, put a link to your new page from a
    page already on the C-SAFE web-site. This way,
    you will automatically get the C-SAFE title
    bar and style on your new page.

    (If your new web page is going to be in a new
    subdirectory, you will have to fix the links to
    the title bar images by replacing "../" with
    "../../" on all of the image references.)


    Step 4: Add your new web page to cvs:


    cvs add <your-page>.html


    Step 5: Commit your changes:

    cvs commit -m "Added <something> to C-SAFE web site"




    Question 10: ( Mar, '05 -- David Groulx )

  • How do I find out the kernel version and archetecture I am running on?

    Answer:

  • From the terminal type "uname -a" to print out all OS information.





    CATEGORY: SUS




    Question 1: ( (Date Not Specified) -- (Author Not Specified) )

  • What is sus? Pronunciation?

    Answer:

  • Standalone Uintah Simulation (application). Pronounced "sus" (short
    'u') rhymes with "fuss".



    Question 2: ( (Date Not Specified) -- (Author Not Specified) )

  • How do I give sus input? What is a .ups file?

    Answer:

  • The xml file basically specifies the input and output of a module as
    well as some other basic info for module creation. If you are just
    adding more operators and using the same input and outputs, then you can
    safely ignore the xml file. The tcl file is where you set up the
    visual code for a module, entries, buttons, sliders etc. The tcl
    code and the C++ code usually "communicate" via GuiVariables, although
    there are other means of passing info between the two.



    Question 3: ( 19 Sep 2002 -- T. Harman )

  • How long does a ICE/MPMICE timestep take (on frost/raptor)?

    Answer:

  • Raptor ICE problem 2 ice_matl
    Time=0.00715542, delT=7.37676e-05, elap T = 144.165, DW: 97, Mem Use = 12435456
    Time=0.00722918, delT=7.37676e-05, elap T = 145.641, DW: 98, Mem Use = 12435456
    Time=0.00730295, delT=7.37676e-05, elap T = 147.098, DW: 99, Mem Use = 12435456

    Frost ICE problem 2 ice_matl
    Time=0.00715542, delT=7.37676e-05, elap T = 149.435, DW: 97, Mem Use = 15147616
    Time=0.00722918, delT=7.37676e-05, elap T = 150.962, DW: 98, Mem Use = 15147616
    Time=0.00730295, delT=7.37676e-05, elap T = 152.476, DW: 99, Mem Use = 15147616


    Frost MPMICE problem, 1 ice_matl 2 mpm_matl
    Time=0.0547642, delT=0.0027551, elap T = 103.666, DW: 21, Mem Use = 18752672
    Time=0.0575193, delT=0.00275559, elap T = 108.634, DW: 22, Mem Use = 18752672
    Time=0.0602748, delT=0.00275601, elap T = 113.47, DW: 23, Mem Use = 18752672
    Time=0.0630309, delT=0.00275639, elap T = 118.401, DW: 24, Mem Use = 18752672

    Raptor MPMICE problem, 1 ice_matl 2 mpm_matl
    Time=0.0547642, delT=0.0027551, elap T = 111.474, DW: 21, Mem Use = 15089664
    Time=0.0575193, delT=0.00275559, elap T = 115.989, DW: 22, Mem Use = 15089664
    Time=0.0602748, delT=0.00275601, elap T = 120.612, DW: 23, Mem Use = 15089664
    Time=0.0630309, delT=0.00275639, elap T = 125.222, DW: 24, Mem Use = 15089664
    Sus: going down successfully

    Raptor configure line
    ../src/configure --enable-64bit --enable-package=Uintah '--enable-optimize=-Ofast -G0 -OPT:Olimit=20000 -IPA:plimit=20000' --disable-sci-malloc --enable-assertion-level=0

    Frost configure line
    ../src/configure --enable-32bit --enable-package=Uintah --with-thirdparty=/usr/apps/uintah/SCIRun_Thirdparty/1.4.2/aix/xlC-32bit --disable-sci-malloc --with-zlib=/usr/local --with-mpi=/usr/lpp/ppe.poe --enable-optimize=-O2 --enable-assertion-level=0




    Question 4: ( 2 Dec 2002 -- T. Harman )

  • What scripts are there that can help me with VarLabels?

    Answer:

  • Update of /csafe_noexport/cvs/cvsroot/SCIRun/src/Packages/Uintah/StandAlone/inputs
    In directory csf:/tmp/cvs-serv346188

    Added Files:
    labelNames
    Log Message:
    An aid for those who can't remember all the different variable labels.

    This script spits out the variable names for the different components
    usage:
    labelNames



    Question 5: ( Apr '03 -- Bryan Worthen )

  • How do I get sus to output checkpoints at specified walltime intervals?

    Answer:

  • If you have this in your .ups file:

    <DataArchiver>
    ...
    <checkpoint walltimeStart="<startnum>" walltimeInterval="<intnum>"/>
    ...
    </DataArchiver>

    where startnum and intnum are in seconds, it will do a checkpoint starting
    at startnum seconds, and then every intnum seconds after that.

    I.e., if I had:

    <checkpoint walltimeStart="3600" walltimeInterval="7200"/>
    Then it would start doing checkpoints in one hour, and then every two
    hours after that, or
    <checkpoint walltimeStart="10800" walltimeInterval="7200"/>
    then it would start at 3 hours, and then every 2 hours.

    Keep in mind, though, that it could take a while to do the checkpoints and
    that it will wait for a timestep to complete before it outputs
    checkpoints. So keep probably 10-20 minutes before you know your run will
    terminate.

    Also note that data output and checkpoints can happen after every n timesteps, i.e.,

    output: <outputTimestepInterval>1</outputTimestepInterval>
    checkpoint: <checkpoint cycle = "2" timestepInterval = "500"/>

    or after every n simulation seconds

    <outputInterval> 0.01 </outputInterval>
    <checkpoint interval="0.0005" cycle="2"/>




    Question 6: ( Apr '03 -- Bryan Worthen )

  • What environment variables do sus/scirun respond to? (Or, how do I get SCIRun/sus to exit cleanly?)

    Answer:

  • The following environment variable can be used in either sus or scirun:

















    VariableValuePurpose
    SCI_DBXCOMMANDcommandrun this debug command on a signal/abort (the pid will be provided)
    SCI_SIGNALMODEDefault - ask user what to do on abort
    exitexit without prompt
    dbxinvoke SCI_DBXCOMMAND if it exists, or dbx (on sgi, or on others, gdb)
    cvdanother debugger to try
    resumetry to keep going
    SCI_EXCEPTIONMODE (not currently used)Default - ask user what to do on exception
    abortabort without prompt
    dbxinvoke SCI_DBXCOMMAND if it exists, or dbx (on sgi, or on others, gdb)
    cvdanother debugger to try
    throwthrow the exception
    MALLOC_STRICTcauses all memory to be strictly initialized (0xffff5a5a)
    MALLOC_LAZYturns off memory auditing
    MALLOC_TRACEfilenametraces memory to filename or stderr if no filename
    MALLOC_STATSfilenameoutputs memory results at exit time to filename or stderr if no filename
    MALLOC_PERPROCfilenameoutputs memory usage per timestep to filename or cout if no filename





    Question 7: ( June 2003 -- Bryan Worthen )

  • How do I include an xml file from my .ups file (so I don't need to have the
    same things in many different files)?

    Answer:

  • Anywhere in your ups file where you want to replace the include tag with a
    larger set of tags:

    i.e.,
    <Uintah_specification>

    <DataArchiver>
    <include href="saveLabels.xml"/>
    <outputInterval>1.0</outputInterval>
    ...
    <MPM>
    <material>
    <include href="MaterialData/MaterialConst4340Steel.xml"/>
    ...
    </Uintah_specification>

    Specifically, the include tag has the syntax:

    <include href="filename"> where filename is either absolute or relative
    to the path of the file doing the including.

    The included file needs to look like this (this hasn't changed):

    This is inputs/MPM/MaterialData/MaterialConst4340Steel.xml:

    <?xml version='1.0' encoding='ISO-8859-1' ?>
    <!-- 4340 Steel -->
    <Uintah_Include>
    <density>7830.0</density>
    <toughness>10.e6</toughness>
    <thermal_conductivity>38</thermal_conductivity>
    <specific_heat>477</specific_heat>
    <room_temp>294.0</room_temp>
    <melt_temp>1793.0</melt_temp>
    </Uintah_Include>

    the syntax of the file is:

    <?xml version='1.0' encoding='ISO-8859-1' ?>
    <Uintah_Include>
    <any tag or set of tags that you want/>
    </Uintah Include>




    Question 8: ( Oct 2003 -- Bryan Worthen )

  • How do I debug mpi jobs with gdb?

    Answer:

  • See gdb.html



    Question 9: ( November 26, 2003 -- Randy Jones )

  • How do I track memory leaks?

    Answer:

  • - You need to have a build of sus where you have enabled sci-malloc.
    (i.e. your configure lines should have " --enable-sci-malloc"). Add
    --enable-scinew-line-numbers to get files with line numbers where
    scinew detects a memory leak.

    - Set the environment var MALLOC_STATS to a 'filename'.

    - Edit your code and add:

    const char* old_tag = AllocatorSetDefaultTag("task abc");

    to the top of each task (or the top of each region you want to test).

    AllocatorSetDefaultTag() returns the current string, so you
    want to reset it at the end of your task to avoid misleading tags
    for leaking memory. To to this, just add:

    AllocatorSetDefaultTag(old_tag);

    at the end of each task. (old_tag was set above by the first call
    of AllocatorSetDefaultTag().

    - Recompile

    - Run

    - Look at the 'filename' file and look for the non-freed memory. It
    should be labeled with the name of the task that you entered. You
    can add more AllocatorSetDefaultTag() calls in this task if you
    need to narrow it down more.

    - You should call AllocatorResetDefaultTag() at the highest level of
    setting the default tag. This will set it back to the default tags.
    The reason AllocatorSetDefaultTag(old_tag) won't do this is there are
    actually three tags (malloc, new, and new[]) that are being set, and
    AllocatorResetDefaultTag() resets all three back to their original value.




    Question 10: ( January 27, 2004 -- James Bigler )

  • Is there a way to verify uda directory contents without launching scirun
    and going through all of the timesteps?

    Answer:

  • If your data is not too large, you could always do:

    ./puda -varsummary [uda]

    This will touch all of the data and timesteps.



    Question 11: ( May 2004 -- Bryan Worthen )

  • How do I run the dynamic load balancer?

    Answer:

  • Add the following section to your ups file:

    <LoadBalancer>
    <timestepInterval>500</timestepInterval>
    <cellFactor>.5</cellFactor>
    <dynamicAlgorithm>particle3</dynamicAlgorithm>
    <gainThreshold>0.0</gainThreshold>
    <doSpaceCurve>true</doSpaceCurve>
    </LoadBalancer>



    The timestepInterval is how often loadbalancing will occur (you can also
    use <interval>#</interval> where # is a number in terms of the simulation
    time). Note that for all the cases I've done, the load balancer has done
    most of its good work on the first timestep, and subsequent cases didn't help
    much, but this will depend on your problem, so based on experimentation, you
    might want this to be a higher or smaller number.

    The cellFactor tells the loadbalancer how much to count each cell in terms of
    particles to determing the total patch cost. I have found that between .5
    and .7 are good numbers for MPM simulations, but you are welcome to experiment and that
    around 1.0 are good for MPMICE simulations.

    So far, the load balancer only works well for MPM-based simulations, the
    others, including AMR simulations are currently a work in progress.

    The dynamicAlgorithm is which algorithm to use to do at runtime. The
    choices are

    static - pretty much the same as the default load balancer
    cyclic - rotates the patches in a cyclic manner among processors (pretty worthless except as a test)
    random - assigns each patch to a random processor (also pretty worthless except as a test)
    particle1 - decent algorithm, but worse than static
    particle2 - not a good algorithm, way worse than static
    particle3 - pretty good algorithm - this is what we showed at the TST.

    So chances are, if you want to try anything useful with the Load Balancer,
    do particle3.

    gainThreshold is optional - it tells the load balancer that instead of loadbalancing every n timesteps,
    first check to see if it's worthwhile to do so. More specifically, it calculates the std deviation of
    processor cost (where cost is based on the cellcost*numcells + numParticles) with and without load balancing, and
    if (oldStdDev / proposedStdDev) >= threshold, then do the load balancing. Or rather, if threshold is zero, always
    load balance, if it is 1.0, load balance if the proposed solution is at least as good, or if it 1.25, then it
    should be 25% better. If this is left out, the default value is 0.

    doSpaceCurve is optional - it tells the load balancer to tryto do a simple space-filling curve, which
    should give it some optimzation. Currently
    the curve algorithm makes one big assumption about the domain - and that is that it can be identified by
    a single <patches> section for each level. Therefore, if your domain has multiple boxes or uses AMR regridding,
    it's probably a good idea to set this to false. If this is left out, the default value is false.

    Then run sus as normal but add -loadbalancer PLB or -loadbalancer
    ParticleLoadBalancer to the command line:

    sus -mpm disks.ups -loadbalancer PLB

    Note that you should have a bit more patches than processors (at least twice
    as many).



    Question 12: ( July 13, 2004 -- Randy Jones )

  • How do I get the stack trace on a hung program which crashed on an SGI?

    Answer:

  • In a separate xterm, you can type "dbx -p <process-id>" and then type
    "where" to get the stack trace.



    Question 13: ( August 12, 2004 -- Randy Jones )

  • How do I see how much real time it is taking to calculate one simulation second?

    Answer:

  • To see this statistic every timestep, just type the following into your shell:
    setenv SCI_DEBUG SimulationTimeStats:+

    If you want to see this statistic by itself and remove the normal stats, type:
    setenv SCI_DEBUG SimulationTimeStats:+,SimulationStats:-



    Question 14: ( August 2004 -- Bryan )

  • Is there a way to match the name of the uda directory with my batch job?

    Answer:

  • Sometimes it is desirable to have the uda match something, like the job id
    of the job that ran it. To do this, you can pass -uda_suffix <name>
    as an arument to sus. On inferno, you can specify -uda_suffix $PBS_JOBID
    to match the job number (and output file if that's how you save output).



    Question 15: ( Sep, 2004 -- Bryan )

  • Can I override the delt of a restart run from what was saved in the
    checkpoints?

    Answer:

  • Yes.

    Do:
    <override_restart_delt> .00000000000000001 </override_restart_delt>

    in the time block on your input.xml file. This will override the
    very next timestep, and will display a message that it is doing so.
    It will affect restarts only, so placing it in a ups file won't do
    anything on the original run, but it will be copied to the input.xml
    file and will take affect on the restart.



    Question 16: ( December, 2004 -- Bryan )

  • Can I make sus output the initialization timestep?

    Answer:

  • Yes. Add <outputInitTimestep/> to the DataArchiver block in your ups file.



    Question 17: ( December, 2004 -- Bryan )

  • Outputting data seems to take longer with an increasing number of processors.
    Is there a way to make this faster?

    Answer:

  • Maybe. If you add this section to your ups file:

    <LoadBalancer>
    <outputNthProc>4</outputNthProc>
    </LoadBalancer>

    it will tell sus to output data every 4th processor instead of every single processor
    (i.e., procs 1-3 will ship their data to proc 4, 5-7 will ship to proc 8, etc.) So naturally
    the cost of sending the data via mpi needs to be less the gain achieved by having less processors hitting
    the file system at the same for this to be beneficial.

    The experimental point I have found that is beneficial is 128 procs outputting every 4 seems beneficial
    (on inferno using raid1). I suppose that for more procs it would also be beneficial.




    Question 18: ( Dec, 2004 -- Bryan )

  • How do I only run my simulation for x timesteps?

    Answer:

  • 2 ways. Either add:
    <max_iterations>x</max_iterations>
    or
    <maxTimestep>x</maxTimestep>

    where x is the number of timesteps. The difference is that
    max_iterations will run that many timesteps from the start of
    the simulation, even on restarts. maxTimestep will run to
    timestep x and quit, even on restarts.



    Question 19: ( Jan, 2005 -- Bryan )

  • How can I remove some variables from an uda?

    Answer:

  • Run your simulation as normal. Then edit the <uda-dir>/input.xml
    file and remove the "save" labels that you don't want anymore. Then run

    [mpirun -np #procs] sus -reduce_uda <uda-dir>

    Use MPI for big cases that won't fit on one processor.

    If you want to compare the new and the old uda to make sure they are the
    same, then edit the original uda/index.xml and remove the same variables
    and then do

    compare_uda first_uda second_uda.

    Note, this probably won't work if your varLabels use BoundaryLayers, which
    I think only are used in the Examples component directory.

    Also note that the resulting uda will be changed slightly from the original uda,
    but only in that the timesteps in the resulting will be
    t00001-t&;t;number-of-timesteps> instead of the original number, the delt's
    stored inthe timestep.xml represent the time difference between the output
    timesteps, and the resulting uda does not have checkpoints or reduction variables.
    However, if you copy the checkpoints and reductions over, you should be able to
    restart just fine, and as far as scirun or any other program is concerned it should
    look exactly the same.



    Question 20: ( June, 2005 -- Bryan )

  • What does 'WARNING: Possible extra communication between patches!' mean?

    Answer:

  • This means that a processor is sending more data to another processor
    than it needs to, and normally only arises when you have more than one
    patch per processor.

    For example (ASCII art)

    -----------
    | | |
    | 1 | 2 |
    | | |
    -----------
    | | |
    | 0 | 1 |
    | | |
    -----------

    proc 0 needs to send data to proc 1. It needs to send to the patch
    above it and to the patch to the right. So, in the current way of
    things, we choose sending one larger message (which constitutes the
    entire patch here) over sending two smaller messages. Whether this is a
    good choice or not depends on the message size, network latency, and
    bandwidth.

    This is one of the things we are investigating for a scheduler change.

    This message only occurs in taskgraph compilation time, so its frequency
    is not indicative of the number of total large messages, but perhaps the
    number of them in one timestep.



    Question 21: ( 01/26/06 -- Todd )

  • How do I monitor a single variable through a timestep

    Answer:

  • Add this to your input file

    <Scheduler>
    <VarTracker>
    <start_time> 0 </start_time>
    <end_time> 1 </end_time>
    <start_index> [139, 0, -1] </start_index>
    <end_index> [141,2,1] </end_index>
    <var label="press_equil_CC" dw="NewDW" />
    <var label="press_CC" dw="NewDW" />
    </VarTracker>
    </Scheduler>

    If you want to limit the spew and only print a subset of tasks, you can
    do that by specifying
     <task name="ICE::computeDelPressAndUpdatePressCC" /> 

    in the <VarTracker> section





    CATEGORY: Scripts/Utilities




    Question 1: ( 02/01/06 -- Todd )

  • What is plotStats?

    Answer:

  • plotStats is a small gnuplot script that takes the output from sus, parses
    it and plots several simulation metrics as a function of wall clock time. It's
    really useful in monitoring the timestep size.

    Usage:
    plotStat <sus output file> <dump postScript File (y/Y), default is no>

    You must have gnuplot installed.



    Question 2: ( 02/01/06 -- Todd )

  • Is there a utility to get the time step information from an uda?

    Answer:

  • Yes, use
    puda -timesteps <uda directory>





    CATEGORY: Subversion




    Question 1: ( April 14, 2005 -- Hartner )

  • What is subversion and how do I use it?

    Answer:

  • Subversion is used to manage the source trees of SCIRun and Uintah.
    Prior to April 15th 2005 we used CVS.

    The SCIRun developers have a webpage to help people get up and running
    on Subversion.
    Please refer to: 

    http://internal.sci.utah.edu/developer/BioPSE/NCRRweb/DocProcess/SCIRunandSubversion.html





    Question 2: ( Apr, '05 -- David Groulx )

  • How do I get Subversion?

    Answer:


  • For the impatient person installing from source (for any OS):

    Download subversion-1.1.4.tar.gz
    tar -zxvf subversion-1.1.4.tar.gz
    cd subversion-1.1.4
    mkdir ~/local
    ./configure --with-ssl --prefix=/home/yourname/local
    make
    make install (can be installed as a normal non-root user)
    setenv PATH /home/yourname/local/bin:$PATH

    For the impatient person running Redhat Enterprise Linux 3:

    Download Jim Guilkey RPM's for RH EL3 posted at:
    http://www.sci.utah.edu/~guilkey/SUBVERSION/
    rpm -ivh *.rpm (must be root to install)

    For the inquisitive patient person:

    The subversion project homepage is located at http://subversion.tigris.org
    All the files you need to get started using subversion are stored
    locally on the network. For installers, go to
    /usr/sci/projects/subversion/<platform> and get the installer for your
    OS. The README within each folder will have platform specific
    instructions and caveats. Additionally, the source tarball is located
    in the src folder, this should compile on all platforms with the
    standard "configure; make; make install" method. As a third option,
    precompiled binaries for most platforms are located at
    /usr/sci/projects/subversion/bin/<platform>. You can just add
    the appropriate location to your path.




    Question 3: ( 01/23/06 -- Todd )

  • How do I revert my changes back to reversion 32928

    Answer:


  • cd SCIRun/src
    svn merge -r32929:32928 https://code.sci.utah.edu/svn/SCIRun/trunk/src/




    Question 4: ( 01/23/06 -- Todd )

  • How do I checkout a specific date?

    Answer:


  • svn checkout --revision "{$year-$month-$day}" https://code.sci.utah.edu/svn/SCIRun/trunk/src SCIRun/src




    Question 5: ( 01/26/06 -- Todd )

  • How do I make a branch?

    Answer:


  • cd /Uintah
    svn copy -m "creating impAMRICE residual Branch" . https://code.sci.utah.edu/svn/SCIRun/branches/uintah-impAMR-residual






    CATEGORY: Tester




    Question 1: ( Apr '03 -- Bryan Worthen )

  • How do I start/restart the regression tester? How do I run the regression
    tester on my own SCIRun build?

    Answer:

  • See the regression tester documentation for a lot more information.

    To start the regression tester, run /local/csafe/raid1/tester/bin/startTester.

    startTester neds at least one argument to run, which is normally -sendmail.
    You can also run startTester -sendmail -use_tree path_to_scirun to run the
    tests on any tree.

    On inferno, in order to run mpi, we need to go through the batch scheduler, so
    we can run:

    qsub /local/csafe/raid1/tester/bin/Regress.pbs

    To run specific regression tests on your SCIRun tree, see
    how to run your own tests



    Question 2: ( May 2003 -- Bryan Worthen )

  • How do I update and compile the current regression tester build?

    Answer:

  • To update the current build,

    1) cd /local/csafe/raid1/tester/{IRIX64|Linux}/SCIRun.date/src

    where you pick either IRIX64 or Linux, and date is the most recent SCIRun
    build.

    2) cvs update

    3) cd ../{dbg|opt}/build

    again, choose between dbg or opt.

    4) gmake -j [numprocs] sus

    5) cd .. (this will take you to the dbg or opt dir

    6) run the do[whatever]tests script, where whatever is ICE, ARCHES, MPM,
    etc. i.e.,

    doMPMARCHEStests

    7) If it stops and asks you to remove a directory, do it, and run it
    again.

    8) Let the owner of the regression tester know if there are any permissions
    problems



    Question 3: ( June 2003 -- Bryan Worthen )

  • How do I add my own tests to the regression tester?

    Answer:

  • See the regression tester docs



    Question 4: ( December, 2003 -- Bryan )

  • The restart test passes the comparisons, but the normal test fails. I've
    replaced my gold standards hundreds of time, but it still fails. What is
    going on?

    Answer:

  • When your restart test passes but your original tests fail, it is a problem
    with the restart, probably in the initialization. The reason the restart
    test passes is because when you replace the gold standard, the uda that gets
    saved is the uda from the restart. Since something is different between the
    restart and the original, the original fails, even though the original is more
    correct.



    Question 5: ( June 2005 -- Bryan )

  • How can I check on the status of the Regression Tester before I get the
    email? (Or how can I see if it ran?)

    Answer:

  • Three ways:
    1) Check the website: www.csafe.utah.edu/tester/$OS/SCIRun.$DATE
    where $OS is Linux or IRIX64 and date is in the 6-digit format. I.e.,
    www.csafe.utah.edu/tester/Linux/SCIRun.060305

    Linux dir
    SGI dir

    The last test group is the last test completed. So if IMPM-opt is the
    last thing you see, then it has completed.

    2) Check the RT directories: /usr/csafe/raid2/csafe-tester/$OS/SCIRun.$DATE
    where OS and DATE follow the same rules as above. Check the contents of
    dbg and opt for directories called *-results (where * is ARCHES, MPM, etc.)

    If there are no results directories in opt, then it's either still in dbg, or
    compiling the opt sus. The last results directory in alphabetical order
    is the last directory it worked on.

    3) Check the machine the tester is running on.
    For IRIX
    ssh muse
    For Linux
    ssh inferno
    qstat -an
    Look which nodes csafe-tester is using
    ssh to the first one

    Then run
    ps -fu csafe-tester
    to see what the RT is currently doing.





    CATEGORY: Thirdparty




    Question 1: ( (Date Not Specified) -- Wing )

  • PETSc vs HYPRE?

    Answer:

  • PETSc does a good job of giving us a suit of preconditioners and
    linear solvers (it has nonlinear solvers too). However in order to
    make it more efficient on large scale parallel linear problems, we
    need multigrid. PETSc do have multigrid but As Steve and Rajesh know,
    it's a pain since our indexing scheme is different than theirs. And
    also we have to take care of some other coding stuff (we know cause we
    tried). I found hypre which has multigrid and some linear solvers,
    and the advantage is it is using the same indexing so the interface is
    very easy. This is just another option for users to choose what to
    use. We are still using PETSc for some of the matrix vector
    operation. You can say I'm just too lazy to code that up myself. For
    more info on hypre, you can go to HREF="http://www.llnl.gov/CASC/hypre">http://www.llnl.gov/CASC/hypre





    CATEGORY: UCF




    Question 1: ( Dec 3, 2002 -- Steve )

  • What does this mean?

    Caught exception: TempX_FC, matl 0, patch/level 0 not found for
    scrubbing.

    I assume there is a problem with the computes and requires for that
    variable but could the message be a little more descriptive. I'll
    change it if someone can describe what's wrong.

    Answer:

  • It probably means that a variable was declared to be computed but never
    "put". The old check for that specific problem no longer works.



    Question 2: ( 12/02 -- Dav )

  • Is there a way to synchronize screen output from multiple
    nodes/processors on the cluster? Otherwise output is unreadable.

    Answer:

  • For shared memory (threads), you can use a mutex to separate output.
    You would do something like this:

    > extern Mutex cerrLock; // at the top of our .cc file.

    In the code where you want output sync'd:

    > cerrLock.lock();
    > cerr << "Caught exception: " << e.message() << '\n';
    > cerrLock.unlock();

    However, this won't work with separate MPI processes. Usually MPI
    itself separates the output. Sometimes there is a flag (to mpirun)
    that tells MPI to put the processor number in front of output.



    Question 3: ( (Date Not Specified) -- (Author Not Specified) )

  • How do patches/levels/grids work together? Data-storage? Data Warehouse?

    Answer:

  • The Patches basically just "know" about themselves how big they
    are. The levels manage the patches and the grid manages the
    levels. BUT none of them know anything about the data. The only one
    who has access to the data is the datawarehouse. (and for that it has
    vectors for each variable type).


    It has vectors (or whatever data structure, depending upon the variable)
    for each variable AND each patch and material. In other words, you have
    to specify the variable's "label" (it's name essentially), a patch, and
    a material for each chunk of data you grab from the data archiver. So
    basically, the data is associated with each patch. This is essentially
    equivalent, I believe, to having the patch point to the data directly
    (but not quite as efficient perhaps). I'm not sure all of the reasons
    for doing it this way, but I can understand it in terms of encapsulation
    purposes (a patch defines a spatial box that can have data associated
    with it, not a container of the data itself -- I dunno).


    Why was this kind of layout chosen; in particular why don't the
    levels or patches "know" somehow about "their" data (at least in
    form of pointers or whatever).



    The data in the UCF can be very transient. The patch (plus material
    number) is used as a key to index into the data warehouse. However,
    there can be more than one data warehouse associated with each
    level/patch. This can include multiple time-steps, multiple
    iterations over the domain, and so forth. The UCF is quite different
    than other approaches, and this is one aspect of that.



    Question 4: ( (Date Not Specified) -- W. Witzel )

  • Scrubbing the Data Warehouse (Possibly out of date info)

    Answer:

  • > > For what exactly is the bool parameter init_timestep in
    > > Scheduler::(const ProcessorGroup* pc, bool init_timestep = 0)
    > > good, and when should I set that to true/false?
    >
    > Looks like a Wayne thing. It is used to clear the scrub lists in the
    > detailed task graph. You don't clear them out on the initial time
    > step. Not sure why... Wayne can probably fill us in when he gets
    > back.
    >
    > Set it to true the very first time you call compile. False after
    > that.

    I thought it was a Steve thing. It could have been me -- I don't
    remember.

    Here's the scoop. Any variable that isn't required at the beginning of
    the next timestep can be scrubbed after it's use in the current timestep
    is finished. How do you know what is at the beginning of the timestep?
    Well, we basically just assume for now that the next timestep will have
    the same taskgroup as the current timestep (at least as far as what is
    required from the previous timestep). However, the initialization
    timestep is different. It doesn't require anything from before. It just
    initializes all of the variables. If we used the same strategy for the
    initialization timestep, then everything would get scrubbed and you would
    lose everything you initialized. So, our simple solution, it appears, is
    just don't scrub anything on the initialization timestep that could
    possibly be used later.

    Probably more than you needed to know, but there you go.



    Question 5: ( 12/02 -- Dav )

  • SUS aborts instead of throwing an exception. What's going on?

    Answer:

  • When we link sus (due to our LD_LIBRARY_PATH) we link against
    /usr/sci/local/lib/libgcc_s.so.1.

    Looking at this file, it appears (though I'm not 100% certain) that
    this is a libgcc built for gcc 3.0. If I unset my LD_LIBRARY_PATH,
    and compile my test program (I haven't tried this with sus yet), it
    starts catching exceptions. (I assume sus will too.)



    Question 6: ( Nov 2002 -- S. Parker )

  • I am getting assertion faied error message:

    An