C-SAFE Home

Information/Instructions/LLNL/batch

From C-SAFE Wiki

Jump to: navigation, search

Contents

Queue information

Hera - AMD Opteron quad core, 16 procs/node

  • pbatch queue: 254 nodes, 24 hour limit
  • scratch dir: /p/lscratch(*)

WARNING: all the following machines are going away very soon (Dec. 08). Use HERA.

ALC - linux x86, 2 procs/node

  • pdebug queue: 8 nodes, interactive only, no time limit
  • pbatch queue: 454 nodes, batch only, 24 hour limit
  • scratch dir: /p/ga1

Thunder - itanium, 4 procs/node

  • pdebug queue: 16 nodes, interactive only, no time limit
  • pbatch queue: 986 nodes, batch only, 12 hour (24 on weekends) limit
  • scratch dir: /p/gt1

uP - IBM power5, 8 procs/node

  • pdebug queue: max 2 nodes/job, 2 hour limit
  • pbatch queue: max 32 nodes/job, 12 hour limit
  • scratch dir: /p/gup1

Zeus - AMD Opteron 2.4GHz, 8 procs/node

  • pdebug queue: max 16 nodes/job, 0.5 hour limit
  • pbatch queue: max 260 nodes/job, 12 hour limit
  • scratch dir: /p/lscratch(*)

IMPORTANT

If your job hangs when running on multiple procs/node, add this to your .cshrc file:

setenv LIBELAN_SHM_ENABLE 0

On large 2048 processor Atlas runs the job would hang unless

setenv LIBELAN_SHM_BIGMSG 2G

was set.

Running in batch mode

To run in batch mode, you'll need to make a batch script. Here's a simple batch script that you can copy and make your modifications to (see [1] for more options):


MOAB script
______________________________
#!/bin/csh
# script to be submitted with msub
#MSUB -N 10.45                    # sets job name
#MSUB -l walltime=00:10:00        # requested wallclock time
#MSUB -l nodes=1                  # number of nodes
#MSUB -V                          # export current env var settings
#MSUB -r n                        # do not rerun job after system reboot
#MSUB -j oe                       # send output log directly to file
#MSUB -M t.likestoplay@gmail.com  # email list (not sure that this works)
#MSUB -m b                        # send mail when job starts
#MSUB -m e                        # send mail when job ends
##MSUB -A utahdat
##MSUB -l qos=expedite
#MSUB # no more msub commands

set echo
echo LCRM job id = $SLURM_JOBID

setenv SCI_DEBUG "ProgressiveWarning:-,ComponentTimings:+,BNRStats:+"
setenv LIBELAN_SHM_BIGMSG 2G

# name of output file
set OUT = "out.10.45"

cd /p/lscratchb/harman/nodeTest

srun -N1 -n2 sus_atlas -mpi advect.ups >& $OUT

echo "ALL DONE"

If you are running on uP, omit the "srun -N<nodes> -n<procs> in the sus command line. To specify the number of required processors, add the following line to the .pbs script:

setenv SLURM_NPROCS <number of processors>

Also, run time errors have been seen on uP if there exists a file with the same name as the output file in the working directory.

If you are going to be generating a lot of data, then you should add the following line to your batch job file, right before the call to sus:

cd <scratch-dir>/<username>

To submit the job, run

msub batch.pbs

Job/Queue status

To see how many nodes are available, type:

ju

To see the status of your job, type:

pstat

To see the status of all jobs, type:

pstat -m

The following command is usefull to see the status of all jobs in priority order, so you know where you stand in line and what your job's priority is:

pstat -o jid,name,user,bank,status,nodes,timeleft,maxcputime,priority -malc -s priority

I would add the following line to your ~/.cshrc.linux file, and then source your ~/.cshrc.linux file:

alias spj 'pstat -o jid,name,user,bank,status,nodes,timeleft,maxcputime,priority -malc -s priority' source ~/.cshrc.linux

To remove a queued or running job, type:

prm [optional job-id]

Running an interactive/debug job

ALC, thunder, zeus

You can only run interactively on the pdebug pool which has fewer nodes. All you have to do is call srun directly and tell it to use the pdebug pool:

srun -N<#nodes> -n<#procs> -p pdebug <path-to-sus>/sus -mpi -<algorithm> <upsfile>

up

You can only run interactively on the pdebug pool, so set the MP_RMPOOL environment variable to pdebug:

setenv MP_RMPOOL pdebug

setenv MP_NODES <num-nodes>

To specify the number of processors you want, you can either specify the total number of processors or the number of processors that you want to run per node.

setenv MP_PROCS <total-num-procs>

or

setenv MP_TASKS_PER_NODE <procs-per-node>

You should now be able to run interactively on the debug node simply by calling sus. For example:

sus -mpi -mpm bigbar.ups


Getting Job Names

With the update from PBS to Moab, it has become more difficult to get a list of jobs running in the queue. However, according to LLNL, the best way to do this is to continue to use (the backwardly compatible "pstat" command.) You can also try the "squeue" tool, and there is an sqlog tool that can be used to see jobs that already completed.

FYI, 'pstat' is a wrapper that parses the "mdiag -j --format=xml" output --- but I don't recommend using that command as it gives too much information in an unreadable format.

Barbara@LLNL: So for now, the best we have is pstat, squeue, and sqlog. Sorry there isn't a good Moab command to show the job name--I've pushed back that there should be one, but don't know if it will do any good.



Back to: Main

Personal tools