HPC Getting Started: Running Jobs
High performance parallel computing programs should be almost always be run in batch mode. When you login to the cluster you login to one of the login nodes, and if you execute a run command, your program will run on your login node. If that program uses a lot of resources, which most HPC programs do, it will very likely bog down or even crash the login node affecting everyone logged into to that node. It will also run very slowly, due to the other people using that login node. You can have your computing privileges restricted for doing this. Batch jobs are controlled by scripts and are submitted to the batch system, which manages the computing resources and which will schedule the job to run In general, batch systems work on a first-in, first-out basis, but this is subject to some constraints.
See also the Batch Job FAQ list.
What not to do and what to watch:
- Do not run jobs on the login node that take more than 120 minutes, and it would be better if you could keep them much shorter than this.
- Do not run parallel jobs on the login node.
- Do not run single serial jobs on a single node leaving 15 unused processors. This is a major waste of resources.
- Check your jobs by logging in to a node and running top. Look for load balance between nodes and memory usage. If you don’t know how to fix these issues seek advice.
- Run jobs from your scratch directory. The read/write is much much faster than your home directory. This is important because if your job has a lot of files and a lot of read/write, it could slow the entire system down for everyone significantly if you are running from your home directory.
- If you expect a job to take longer than a day, check it regularly to make sure it is progressing and has not gotten hung up. Hung jobs can deprive other users of resources for significant periods of time if hung jobs are not identified.
- If you have a lot of serial jobs. Bundle them as described below. Likewise if you have a lot of single node jobs bundle them as described below.
Monitoring Batch Jobs
The command squeue | more will show you all of the jobs in the queue, which will include both pending and running jobs.
The command squeue -u userid will show you the jobs userid (usually your own) has in the queue.
The command scancel job_num will cancel the job with the specified number.
To check how resources are being utilized on a node by your job you can log into one of your jobs nodes by typing the ssh n001 command with the proper node number.