Computational Science Community Wiki

More on Sun Grid Engine: Parallel Work

Parallel Jobs and SGE Parallel Environments

When a parallel job starts running under any batch system, a mechanism must be in place through which the batch system dictates to the job not only the compute node on which the job starts, but on which nodes further processes are started (and the number of processes on each node), i.e., tight-binding must exist. Within SGE, this mechanism is part of the parallel environment, aka the PE. The PE may also take care of creating other parts of any environment required for parallel software.

Thus, any qsub script defining a parallel job, must specify not only an SGE queue but also an SGE PE.

Example: OpenMPI under SGE

This is a qsub script suitable for running an OpenMPI job on Redqueen:

  #!/bin/bash

  #$ -q parallel.q
  #$ -pe orte.pe 16
      #
      # ...specify the SGE "parallel.q" queue and **also** the "orte.pe" PE...
      #

  #$ -cwd
  #$ -S /bin/bash

  export PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/openmpi-1.3-gcc-gfortran/bin
  export LD_LIBRARY_PATH=/usr/local/openmpi-1.3-gcc-gfortran/lib
      #
      # ...ensure the OpenMPI-related executables and libaries can be found by
      #    the job...
      #

  mpirun -np $NSLOTS ./mynameis
      #
      # ...start the job!  "$NSLOTS" is the number of cores/processes/slots the
      #    job will use --- the value is set by the orte.pe PE.  
      #
      #    *** N.B. We have not specified a host/machinefile here --- OpenMPI picks
      #             this up automagically from the PE. ****
      #

We have specified the orte.pe PE which works with OpenMPI and specified a 16-process job (-pe orte.pe 16). The PE:

N.B. MPICH has no such automagic; the host/machine file must be specifed.

More About SGE PEs

Each parallel application needs it own, dedicated SGE PE. For example, MPICH, OpenMPI, OpenMP, Star-CD and Fluent each have their own on Man2e, Mace01 and Redqueen.

To determine which PEs are available and to find out details of any particular PE, use qconfqconf -spl lists PEs

  prompt> qconf -spl

  fluent-16.pe
  fluent.pe
  openmp.pe
  orte.pe
  starcd.pe

while qconf -sp <pe-name> lists details

  prompt> qconf -sp fluent.pe

  pe_name            fluent.pe
  .                  .
  .                  .
  start_proc_args    /software/Fluent.Inc/setup_env
  stop_proc_args     /software/Fluent.Inc/addons/sge1.0/kill-fluent
  allocation_rule    $fill_up
  .                  .
  .                  .

Example Qsub Script: Star-CD on Man2

  #$ -S /bin/bash 
  #$ -cwd

  #$ -q parallel.q
  #$ -pe starcd.pe 8

  export LM_LICENSE_FILE=1999@130.88.124.202

  STARINI=Default; export STARINI
  . /software/starcd_402_001_lam/etc/setstar

  echo " "
  echo "Command line is:"
  echo "star -dp $PNP_JOBNODES"
  star -dp $PNP_JOBNODES 
  echo " "

  exit_on_error $?

or

  #$ -S /bin/bash 
  #$ -cwd

  #$ -q parallel.q
  #$ -pe starcd.pe 8

  export LM_LICENSE_FILE=1999@130.88.124.202

  STARINI=Default; export STARINI
  . /software/starcd_402_001_lam/etc/setstar

  #
  # -- use starcd.pe-generated machinefile (from SGE's PE_HOSTFILE) :
  #
  echo " "
  echo "Command line is:"
  echo "star -dp -nodefile=$MACHINEFILE"
  star -dp -nodefile=$MACHINEFILE 
  echo " "

  exit_on_error $?

Example Qsub Script: OpenMPI on Redqueen

    #!/bin/bash

    #$ -pe orte.pe 16
    #$ -q parallel.q
        # ...or "parallel-fat.q"...
    #$ -cwd
    #$ -S /bin/bash

    export PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/openmpi-1.3-gcc-gfortran/bin
    export LD_LIBRARY_PATH=/usr/local/openmpi-1.3-gcc-gfortran/lib

    mpirun -np $NSLOTS ./mynameis