Information/Instructions/LLNL/batch
From C-SAFE Wiki
Contents |
Queue information
Hera - AMD Opteron quad core, 16 procs/node
- pbatch queue: 254 nodes, 24 hour limit
- scratch dir: /p/lscratch(*)
WARNING: all the following machines are going away very soon (Dec. 08). Use HERA.
ALC - linux x86, 2 procs/node
- pdebug queue: 8 nodes, interactive only, no time limit
- pbatch queue: 454 nodes, batch only, 24 hour limit
- scratch dir: /p/ga1
Thunder - itanium, 4 procs/node
- pdebug queue: 16 nodes, interactive only, no time limit
- pbatch queue: 986 nodes, batch only, 12 hour (24 on weekends) limit
- scratch dir: /p/gt1
uP - IBM power5, 8 procs/node
- pdebug queue: max 2 nodes/job, 2 hour limit
- pbatch queue: max 32 nodes/job, 12 hour limit
- scratch dir: /p/gup1
Zeus - AMD Opteron 2.4GHz, 8 procs/node
- pdebug queue: max 16 nodes/job, 0.5 hour limit
- pbatch queue: max 260 nodes/job, 12 hour limit
- scratch dir: /p/lscratch(*)
IMPORTANT
If your job hangs when running on multiple procs/node, add this to your .cshrc file:
setenv LIBELAN_SHM_ENABLE 0
On large 2048 processor Atlas runs the job would hang unless
setenv LIBELAN_SHM_BIGMSG 2G
was set.
Running in batch mode
To run in batch mode, you'll need to make a batch script. Here's a simple batch script that you can copy and make your modifications to (see [1] for more options):
MOAB script ______________________________ #!/bin/csh # script to be submitted with msub #MSUB -N 10.45 # sets job name #MSUB -l walltime=00:10:00 # requested wallclock time #MSUB -l nodes=1 # number of nodes #MSUB -V # export current env var settings #MSUB -r n # do not rerun job after system reboot #MSUB -j oe # send output log directly to file #MSUB -M t.likestoplay@gmail.com # email list (not sure that this works) #MSUB -m b # send mail when job starts #MSUB -m e # send mail when job ends ##MSUB -A utahdat ##MSUB -l qos=expedite #MSUB # no more msub commands set echo echo LCRM job id = $SLURM_JOBID setenv SCI_DEBUG "ProgressiveWarning:-,ComponentTimings:+,BNRStats:+" setenv LIBELAN_SHM_BIGMSG 2G # name of output file set OUT = "out.10.45" cd /p/lscratchb/harman/nodeTest srun -N1 -n2 sus_atlas -mpi advect.ups >& $OUT echo "ALL DONE"
If you are running on uP, omit the "srun -N<nodes> -n<procs> in the sus command line. To specify the number of required processors, add the following line to the .pbs script:
setenv SLURM_NPROCS <number of processors>
Also, run time errors have been seen on uP if there exists a file with the same name as the output file in the working directory.
If you are going to be generating a lot of data, then you should add the following line to your batch job file, right before the call to sus:
cd <scratch-dir>/<username>
To submit the job, run
msub batch.pbs
Job/Queue status
To see how many nodes are available, type:
ju
To see the status of your job, type:
pstat
To see the status of all jobs, type:
pstat -m
The following command is usefull to see the status of all jobs in priority order, so you know where you stand in line and what your job's priority is:
pstat -o jid,name,user,bank,status,nodes,timeleft,maxcputime,priority -malc -s priority
I would add the following line to your ~/.cshrc.linux file, and then source your ~/.cshrc.linux file:
alias spj 'pstat -o jid,name,user,bank,status,nodes,timeleft,maxcputime,priority -malc -s priority' source ~/.cshrc.linux
To remove a queued or running job, type:
prm [optional job-id]
Running an interactive/debug job
ALC, thunder, zeus
You can only run interactively on the pdebug pool which has fewer nodes. All you have to do is call srun directly and tell it to use the pdebug pool:
srun -N<#nodes> -n<#procs> -p pdebug <path-to-sus>/sus -mpi -<algorithm> <upsfile>
up
You can only run interactively on the pdebug pool, so set the MP_RMPOOL environment variable to pdebug:
setenv MP_RMPOOL pdebug
setenv MP_NODES <num-nodes>
To specify the number of processors you want, you can either specify the total number of processors or the number of processors that you want to run per node.
setenv MP_PROCS <total-num-procs>
or
setenv MP_TASKS_PER_NODE <procs-per-node>
You should now be able to run interactively on the debug node simply by calling sus. For example:
sus -mpi -mpm bigbar.ups
Getting Job Names
With the update from PBS to Moab, it has become more difficult to get a list of jobs running in the queue. However, according to LLNL, the best way to do this is to continue to use (the backwardly compatible "pstat" command.) You can also try the "squeue" tool, and there is an sqlog tool that can be used to see jobs that already completed.
FYI, 'pstat' is a wrapper that parses the "mdiag -j --format=xml" output --- but I don't recommend using that command as it gives too much information in an unreadable format.
Barbara@LLNL: So for now, the best we have is pstat, squeue, and sqlog. Sorry there isn't a good Moab command to show the job name--I've pushed back that there should be one, but don't know if it will do any good.
Back to: Main
