SLURM is used as a job scheduler in our clusters. Most important commands are briefly described here. There is a lot of information available on Internet and on manpage of each command (hit q to exit).
Example:
man sbatch
Typically jobs are started with job script which contains SLURM options and command(s) to start calculations. sbatch command is used.
With this command you can see the available partitions or queues in the cluster. The partitions are typically formed based on different hardware on the machines or intended usage. Partitions available on clusters are listed in detail page of specific cluster. Links to detailed information of clusters are in frontpage.
With this command you will see the current work queue in the cluster. Your case will show in the list with the name you have given in the SLURM job script.
With this command you will submit your work to the queue. SLURM reserves resources as asked in job script file.
sbatch job_script_name.sh
With this command you can cancel your case. Check the JOBID of your case with squeue and cancel your job by
scancel JOBID
Parallel calculations are considered efficient if you get 1.5 speedup when doubling amount of CPU cores.
Let's consider two cases:
Availability of software licenses may cause another reasonable limit.
Check if your software reports how much memory is used in calculation and put something on top of it. Then set limits by using this value.
When calculation is running the memory usage can be checked with sstat.
sstat
If calculation has already ended use this command. MaxRSS is the most interesting value.
sacct --format JobID,jobname,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize
These are general example scripts. Remember to check each software page if there is more specific template to be used with your software. Lines starting with #SBATCH contain SLURM options. Otherwise job scripts are normal shell scripts.
#!/bin/bash -l ### ### job script example with 4 cores on exactly 1 node ### ## name of your job #SBATCH -J <job name> ## system error message output file ## leave %j as it's being replaced by JOB ID number #SBATCH -e <job name>.%j.std.err ## system message output file #SBATCH -o <job name>.%j.std.out ## send mail after job is finished #SBATCH --mail-type=end #SBATCH --mail-user=<your LUT/cluster username>@lut.fi ## memory limit per allocated CPU core ## try to put this limit as low as reasonably achievable ## too low calculation will fail, too high resources are wasted ## limit is specified in MB ## example: 1 GB is 1000 #SBATCH --mem-per-cpu=1000 ## how long a job takes, wallclock time d-hh:mm:ss #SBATCH -t 1-00:00:00 ## number of nodes (if necessary) ## -N 1 (job run on exactly one node) ## -N <minnodes:maxnodes> #SBATCH -N 1 ## number of cores #SBATCH -n 4 ## name of queue #SBATCH -p phase1 ## load necessary environment modules module load greatsoftware/1.5 ## change directory to your calculation directory cd /home/<user name>/<calculation directory>/<case directory> ## run my MPI executable srun --mpi=pmi2 <executable of your software> <software options>