Translations of this page:

SLURM

SLURM is used as a job scheduler in our clusters. Most important commands are briefly described here. There is a lot of information available on Internet and on manpage of each command (hit q to exit).

Example:

man sbatch

Typically jobs are started with job script which contains SLURM options and command(s) to start calculations. sbatch command is used.

SLURM usage

sinfo

With this command you can see the available partitions or queues in the cluster. The partitions are typically formed based on different hardware on the machines or intended usage. Partitions available on clusters are listed in detail page of specific cluster. Links to detailed information of clusters are in frontpage.

squeue

With this command you will see the current work queue in the cluster. Your case will show in the list with the name you have given in the SLURM job script.

sbatch

With this command you will submit your work to the queue. SLURM reserves resources as asked in job script file.

sbatch job_script_name.sh
scancel

With this command you can cancel your case. Check the JOBID of your case with squeue and cancel your job by

scancel JOBID

Setting resources

CPU

Parallel calculations are considered efficient if you get 1.5 speedup when doubling amount of CPU cores.

Let's consider two cases:

  • Calculation code is able to calculate X results in hour with c cores. It calculates Y results in hour with 2*c cores. Y ⇒ 1.5 * X
  • Calculation code is able to finish calculation in H hours with c cores. It finishes in I hours with 2*c cores. Now H / I ⇒ 1.5

Availability of software licenses may cause another reasonable limit.

Memory

Check if your software reports how much memory is used in calculation and put something on top of it. Then set limits by using this value.

When calculation is running the memory usage can be checked with sstat.

sstat

If calculation has already ended use this command. MaxRSS is the most interesting value.

sacct --format JobID,jobname,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize

Template job scripts

These are general example scripts. Remember to check each software page if there is more specific template to be used with your software. Lines starting with #SBATCH contain SLURM options. Otherwise job scripts are normal shell scripts.

Serial
Parallel (MPI)
#!/bin/csh
###
### job script example with 4 cores on exactly 1 node
###
 
## name of your job
#SBATCH -J <job name>
 
## system error message output file
## leave %j as it's being replaced by JOB ID number
#SBATCH -e <job name>.%j.std.err
 
## system message output file
#SBATCH -o <job name>.%j.std.out
 
## send mail after job is finished
#SBATCH --mail-type=end
#SBATCH --mail-user=<your LUT/cluster username>@lut.fi
 
## memory limit per allocated CPU core
## try to put this limit as low as reasonably achievable
## too low calculation will fail, too high resources are wasted
## limit is specified in MB
## example: 1 GB is 1000
#SBATCH --mem-per-cpu=1000
 
## how long a job takes, wallclock time d-hh:mm:ss
#SBATCH -t 1-00:00:00
 
## number of nodes (if necessary)
## -N 1 (job run on exactly one node)
## -N <minnodes:maxnodes>
#SBATCH -N 1
 
## number of cores
#SBATCH -n 4
 
## name of queue 
#SBATCH -p phase1
 
## load necessary environment modules 
module load greatsoftware/1.5
 
## change directory to your calculation directory
cd /home/<user name>/<calculation directory>/<case directory>
 
## run my MPI executable
srun --mpi=pmi2 <executable of your software> <software options>
Parallel (OpenMP)
Parallel (MPI + OpenMP)
 
/opt/webdata/webroot/wiki/data/pages/en/hpc/usage/batch.txt · Last modified: 2015/02/26 16:43 by vrintala
[unknown button type]
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki