Computational Science Community Wiki

Sun Grid Engine — A Batch System

What is a batch system?

Any HPC (or HTC) system — usually a cluster of machines — needs a means of sharing computational resources fairly between users; without, there would be anarchy. Batch-queueing systems — usually abbreviated to simply batch systems — are intended to do this.

All batch systems have at least these features:

To use a batch system, each computational job must be defined: the program to run is specified, with any input and output files, arguments, and the required environment (e.g., working directory, PATH and perhaps LD_LIBRARY_PATH). The defined job is then submitted to the batch system, usually to a specified queue.

Traditionally, jobs must be run non-interactively, i.e., in batch mode, though the ability to run queued, interactive jobs may be provided. (Interactive jobs may be run on RCS clusters only where specified in the user documentation.)

On RCS-run HPC clusters, all computationally-intensive activity should be run under the documented batch-queueing system — any computationally-intensive process running outside of the batch system will be killed by the system administrator.

Many batch systems are available, including NQS, LSF, Sun Grid Engine &mdash SGE &mdash and PBS (and, for HTC, Condor). SGE is the batch system used on Mace01, Man2e and Redqueen.

SGE Overview

Sun Grid Engine is so-named because it is the engine of Sun Grid, an on-demand grid computing service operated by Sun Microsystems. However, SGE is otherwise (in the opinion of the author) a misnomer — SGE is a batch-queueing system; it is not grid middleware.

SGE is an open-source community effort sponsored by Sun Microsystems and hosted by CollabNet at gridengine.sunsource.net; it is free (as in beer) to download and use. Some features of SGE are introduced in this section, below.

Scheduler, Queues and Slots

SGE includes both a scheduler for allocating resources (CPUs!) to computational jobs and a queueing mechanism. Each queue is associated with a number of slots: one computational process runs in each slot; each compute node in the HPC cluster provides one or more slots.

Parallel Environments

For most parallel jobs, including those using OpenMP and MPI (e.g., OpenMPI or MPICH), and parallel programs such as Fluent and Star-CD, an SGE Parallel Environment must be specified. This PE acts as glue ensuring that SGE and the parallel, i.e., multi-process, program play nicely together.

Setting Up Your Environment

SGE user commands (such as qsub, qstat, qdel) are available by default on RedQueen and the CSF. No additional set up is needed to use the batch system.

*/

A First SGE Job

Submission of a job to SGE is done by using the qsub command, for example

  prompt> qsub hostname-date.sh

where hostname-date.sh is a simple BASH script which defines the job:

  #!/bin/bash

  #
  # -- SGE options :
  #

  #$ -S /bin/bash
  #$ -cwd
  #$ -q serial.q

  #
  # -- the commands to be executed (programs to be run) :
  #

  /bin/hostname
  /bin/date

This script is explained in detail below.

The First Line: #!/bin/bash and #$ -S /bin/bash

These two lines tell SGE to run the script using the BASH command-line interpreter/shell — see below for details.

Output and Error Messages from the Job

Output from the job, once finished, will appear in a file called hostname-date.sh.o<job-ID>, for example, hostname-date.sh.o2590, viz,

      comp007
      Mon Apr 20 16:03:11 BST 2009

Any error messages will appear in hostname-date.sh.e<job-ID> — there should be none in this case.

Comments

Any line beginning with hash symbol, #, is a comment; it will not be executed, i.e., does not form part of the job run.

SGE Options: -cwd and -q serial.q

Any line beginning with hash-dollar, i.e., #$, is a special comment which is understood by SGE to specify something about how or where the job is run. In this case we specify the directory in which the job is to be run (#$ -cwd) and the queue which the job will join (#$ -q serial.q).

Commands to be Executed

Any line which does NOT begin with a hash symbol, #, is executed just as if it were entered manually at the command line. In this case both /bin/hostname and /bin/date are executed, one after another.

Cluster Queues, Queue Instances and Slots

When submitting a job, a SGE queue should be specified. This is a cluster-queue; it will be intended for a particular type of job (e.g., serial or parallel — OpenMP or MPI; short or long; or those requiring particularly large amounts of memory. To list the cluster-queues available on a system enter qconf -sql (show queue list) at the command line, for example

  prompt> qconf -sql

  openmp.q
  parallel-16.q
  parallel.q
  serial-bf.q
  serial-test.q
  serial.q
  serialfat.q

Attributes of each cluster-queue can be found by using qconf, again, for example

  prompt> qconf -sq serial.q

  qname                 serial.q
  hostlist              @serial
  .                     .
  slots                 4
  tmpdir                /tmp
  shell                 /bin/csh
  .                     .
  shell_start_mode      unix_behavior
  .                     .
  .                     .

Each cluster-queue has instances on one or more compute-nodes in the cluster. Each instance "contains" one or more slots — usually the same number as there are CPUs (more accurately, CPU cores) on a compute node.

Example

An example should help make things clear. Suppose we have a cluster on which there are four compute nodes, cat1, cat2, cat2 and cat4. To get a complete listing of queue instances, enter qstat -f:

  prompt> qstat -f

  queuename              qtype resv/used/tot. load_avg arch          states
  ---------------------------------------------------------------------------------
  serial-fat.q@cat3      BIP   0/0/8          0.01     lx24-amd64    
  ---------------------------------------------------------------------------------
  serial.q@cat4          BIP   0/0/8          0.00     lx24-amd64    
  ---------------------------------------------------------------------------------
  parallel.q@cat1        BIP   0/8/8          8.12     lx24-amd64    
  ---------------------------------------------------------------------------------
  parallel.q@cat2        BIP   0/8/8          8.19     lx24-amd64    

The above listing shows that:

Commonly-Used SGE Qsub Options

There are many SGE options which can be specified in a qsub file (using the #$ syntax); these are all described in the manual page (man qsub). The most commonly used options are briefly described below.

-cwd

Execute the job from the current (working) directory — the directory from which the qsub command is issued. If this option is not present, the job will be executed the user's home directory.

-V

This ensures that the environment for your userid on the login node is inherited/passed to the compute node, including the settings applied by loading software modulesfiles.

-j y

Merge the standard error stream into the standard output stream, i.e., job output and error messages are sent to the same file, rather than different files.

-q

Specify the queue to which a job is sent. (Use the command qconf -sql to list all available queues.)

-pe

Specify the SGE parallel-environment to which a job is sent — see the section on parallel jobs.

-S bash

See below.

-N

Sets the job name, e.g. -N my_job_name to set job name to my_job_name.

-hold_jid

Specifies this job is conditional upon completion of a previous job or jobs, e.g. -hold_jid job1_id to submit a job which will not start until job1_id has completed. job1_id can be a job number or job name (job names are specified using the -N flag) and is a comma separated list for multiple jobs.

Job Shell

Every job submitted to SGE runs under the specified commandline-interpreter aka shell (for example BASH or CSH). The syntax of the qsub script depends on this shell.

New users should read (only) the Short Version. More experienced Linux and HPC users may wish to read the Long Version.

Short Version

There are multiple ways by which SGE determines the shell under which a job is run, i.e., the command interpreter for the script. The default shell on almost all current Linux systems is BASH, so a simple, safe policy is to ensure that #!/bin/bash appears as the first line in the script and that #$ -S bash appears below.

  #!/bin/bash

  #$ -S bash

   ...rest of script here...

This policy should ensure that all jobs run under the BASH shell on all HPC systems using SGE as the batch system.

Long Version

There are multiple ways by which SGE determines the shell under which a job is run:

  1. If the queue attribute shell_start_mode is set equal to unix_behavior, then the first line, e.g, #!/bin/bash determines the shell.

  2. If shell_start_mode is set equal to posix_compliant then the shell is determined by:

    1. the -S option, if specified, for example #$ -S /bin/bash;

    2. otherwise, the queue attribute shell is used — this is set to /bin/csh default;

    3. and if none of the above are defined, the SGE default shell is used, again /bin/csh.

  3. All queues on Mace01, Man2e and Redqueen have shell_start_mode set equal to unix_behaviour.

Monitoring Jobs

When submitting jobs to the batch system, one may wish to know:

The answers to these questions (or, at least, an indication of the answers) is provided by output from the qstat command.

With older versions of SGE (e.g., v6.0), qstat lists all jobs belonging to all users by default; on newer versions of SGE (e.g., v6.2) only a user's own jobs are listed — in this case, to list all jobs, use

  qstat -u "*"

N.B. The double-quotes around the asterisk.

Example output:

  prompt> qstat -u "*"

  job-ID  prior  name        user     state submit/start at      queue              slots
  ---------------------------------------------------------------------------------------
  1807  0.50500  selfeats_a  mcaieas2   r   03/24/2009 22:28:53  serial.q@comp007       1
  1808  0.50500  selfeats_4  mcaieas2   r   03/24/2009 22:35:08  serial.q@comp011       1
  1873  0.50500  runEpi.sh   mcbicjb2   r   03/26/2009 14:52:30  serial.q@comp008       1
  1945  0.50500  STDIN       mbexfbt2   r   03/28/2009 18:31:00  serial.q@comp004       1
  2405  0.50500  new.txt     mjkieak3   r   04/12/2009 18:51:46  serialfat.q@comp002    1
  2441  0.52500  fluent      mbgbfjz2   r   04/14/2009 17:38:18  parallel.q@comp022     4
  2456  0.52500  fluent      mbgbfjz2   r   04/15/2009 08:28:03  parallel.q@comp014     4
  2460  0.50500  runmatn1r1  mchifja2   r   04/15/2009 13:07:25  serial.q@comp011       1
  2469  0.50500  STDIN       mbexfbt2   r   04/15/2009 17:46:10  serial.q@comp013       1
  2481  0.55167  fluent      mcjiych3   r   04/15/2009 22:50:55  parallel-16.q@punch    8
  2483  0.55167  fluent      mcjiych3   r   04/15/2009 22:56:40  parallel-16.q@punch    8
  2512  0.60500  ch_flow_le  mbgnfsr2   r   04/16/2009 17:06:10  parallel.q@comp020    16
  2513  0.60500  ch_flow_le  mbgnfsr2   r   04/16/2009 17:17:10  parallel.q@comp019    16
  2536  0.50500  ge.qsub     msysssp4   r   04/17/2009 11:15:08  serial.q@comp009       1
  2514  0.60500  ch_flow_le  mbgnfsr2   qw  04/16/2009 17:18:34                        16
  2515  0.60500  ch_flow_le  mbgnfsr2   qw  04/16/2009 17:21:49                        16

This shows:

N.B. For MPI-based jobs, the host listed in the queue column is that on which the Rank 0 process is running; other processes associated with this job may be running on other hosts — to get a complete listing of activity in each slot on each host, use qstat -u "*" -f. Example output:

  prompt> qstat -u "*" -f

  queuename                      qtype resv/used/tot. load_avg arch          states
  ---------------------------------------------------------------------------------
  parallel-16.q@judy             BIP   0/8/16         6.92     lx24-amd64    
     2632 0.58278 fluent     mbgnfae2     r     04/22/2009 22:48:39     8        
  ---------------------------------------------------------------------------------
  parallel-16.q@punch            BIP   0/14/16        11.77    lx24-amd64    
     2585 0.60500 fluent     mbgnfae2     r     04/20/2009 12:52:56    10        
     2622 0.53833 fluent     mbgbfjz2     r     04/22/2009 09:46:46     4        
  ---------------------------------------------------------------------------------
  parallel.q@comp014             BIP   0/4/4          2.99     lx24-amd64    
     2595 0.58278 pk9_PoF_ke mcji2ak3     r     04/21/2009 12:06:11     4        
  ---------------------------------------------------------------------------------
  parallel.q@comp015             BIP   0/0/4          0.16     lx24-amd64    
  ---------------------------------------------------------------------------------
  parallel.q@comp016             BIP   0/0/4          0.19     lx24-amd64    
  ---------------------------------------------------------------------------------
  parallel.q@comp017             BIP   0/4/4          4.15     lx24-amd64    
     2625 0.53833 fluent     mbgbfjz2     r     04/22/2009 16:57:46     4        
  .                          .
  .                          .

Deleting Jobs

On occasion it may be necessary to remove a waiting job from the queue, or to stop a running job (and remove it from the queue). This is done using qdel <job-ID>. For example

  qdel 1807 

deletes the first job in the first "qstat" listing above.

A Typical Serial Job

The following qsub script is intended be a "real world" example. It includes commands which will give an estimate of the job length, and also the name of the host on which it ran. Further explanation is given in the script's comments.

  #!/bin/bash

  #
  # == Set SGE options:
  #

  #$ -S /bin/bash
  #$ -cwd
  #$ -q serial.q
      #
      # -- ensure BASH is used, no matter what the queue config;
      # -- run the job in the current-working-directory;
      # -- submit the job to the "serial.q" queue;
      #

  #
  # == The job itself :
  #
 
  echo " "
  echo "Job ran on and started at: "
  /bin/hostname
  /bin/date
  echo " "

  export PATH=/bin:/usr/bin:/software/matlab/bin
      #
      # ...ensure matlab is on our PATH...
      #

  matlab < myinput.m
      #
      # ...run matlab with input file "myinput.m" --- N.B. "myinput.m" must
      #    be in the current-working-directory...

  echo " "
  echo "Job finished at: "
  /bin/date

See also