Sun Grid Engine — A Batch System
What is a batch system?
Any HPC (or HTC) system — usually a cluster of machines — needs a means of sharing computational resources fairly between users; without, there would be anarchy. Batch-queueing systems — usually abbreviated to simply batch systems — are intended to do this.
All batch systems have at least these features:
- a scheduler for allocating resources (CPUs!) to jobs and for prioritising jobs;
one or more queues to which jobs are submitted — each queue might be configured for a particular type of job, for example, serial or parallel jobs, long or short jobs, or those requiring particularly high memory.
To use a batch system, each computational job must be defined: the program to run is specified, with any input and output files, arguments, and the required environment (e.g., working directory, PATH and perhaps LD_LIBRARY_PATH). The defined job is then submitted to the batch system, usually to a specified queue.
Traditionally, jobs must be run non-interactively, i.e., in batch mode, though the ability to run queued, interactive jobs may be provided. (Interactive jobs may be run on RCS clusters only where specified in the user documentation.)
On RCS-run HPC clusters, all computationally-intensive activity should be run under the documented batch-queueing system — any computationally-intensive process running outside of the batch system will be killed by the system administrator.
Many batch systems are available, including NQS, LSF, Sun Grid Engine &mdash SGE &mdash and PBS (and, for HTC, Condor). SGE is the batch system used on Mace01, Man2e and Redqueen.
Sun Grid Engine is so-named because it is the engine of Sun Grid, an on-demand grid computing service operated by Sun Microsystems. However, SGE is otherwise (in the opinion of the author) a misnomer — SGE is a batch-queueing system; it is not grid middleware.
SGE is an open-source community effort sponsored by Sun Microsystems and hosted by CollabNet at gridengine.sunsource.net; it is free (as in beer) to download and use. Some features of SGE are introduced in this section, below.
Scheduler, Queues and Slots
SGE includes both a scheduler for allocating resources (CPUs!) to computational jobs and a queueing mechanism. Each queue is associated with a number of slots: one computational process runs in each slot; each compute node in the HPC cluster provides one or more slots.
For most parallel jobs, including those using OpenMP and MPI (e.g., OpenMPI or MPICH), and parallel programs such as Fluent and Star-CD, an SGE Parallel Environment must be specified. This PE acts as glue ensuring that SGE and the parallel, i.e., multi-process, program play nicely together.