Computational Science Community Wiki

Submitting Jobs to Condor on Mace01

This Page

This page describes how to use Condor on the Mace01 Condor pool. It is not a general introduction to Condor.

Overview

Mace01 is primarily a traditional HPC Linux cluster; users submit jobs via Sun Grid Engine (SGE). However, the system also runs Condor which is used to "backfill" the system with (possibly large numbers of) relatively short jobs.

The pool consists of compute nodes from the cluster and also some desktop machines from the George Begg teaching cluster. However, there is only one submit node, gbcondor.mace.manchester.ac.uk. Owing to the network "topology" of this pool, it is not possible for users' desktop machines to become submit nodes. (More accurately, it requires significant network routing changes and the use of an IP tunnel.)

Getting an Account and Logging In

To get access to the Condor pool on the Mace01 cluster, please email rcs@manchester.ac.uk. To login to gbcondor.mace.manchester.ac.uk, use SSH and the username and password you have been given:

    prompt> ssh <username>@gbcondor.mace.manchester.ac.uk
    Password:
        # ...enter the password you have been given...

Further Help

Should this documentation not prove sufficient, additional help may be sought via the following:

Setting Up Your Environment

To use Condor on Mace01, you must set the CONDOR_CONFIG environment variable to point to the main configuration file:

  export CONDOR_CONFIG=/opt/condor-7.4.2/etc/condor_config

and also include Condor's bin directory in your PATH, for example

    export PATH=$PATH:/opt/condor-7.4.2/bin
        #
        # ...assuming you are using BASH...

Rather than enter these commands each time you login to the system, it makes sense to place them in your .bash_profile

Example Job Submission: The Vanilla Universe

The following example works with the vanilla universe on Mace01 — here, a.out is a statically linked executable:

  requirements = ((Arch == "INTEL") || (Arch == "X86_64"))

  executable   = a.out
  universe     = vanilla

  output = a.out.$(Process).out
  error  = a.out.$(Process).err

  log    = a.out.log

  transfer_files = always

  queue 10

Filesystems and File Transfer under Condor on Mace01

Because the pool includes hosts which are not part of the Mace01 cluster, Condor is configured not to take advantage of the shared filesystem which exists across Mace01, i.e., we have set

  USE_NFS = FALSE

By default, the vanilla universe assumes that submit and execute nodes share a filesystem, i.e., that: the executable and input files will exist on both the submit node and compute node — they will not need to be transfered at the start of the job — and output files (and error files) will not need to be transferred at the end of the job. Since we have USE_NFS = FALSE, vanilla universe jobs on Mace01 must set the transfer_files option in the submit file — as in the example above.

Job Submission: The Standard Universe

The following example works with the standard universe on Mace01 — here, loop.remote is an executable which has been linked against the Condor libraries using the condor_compile command:

  requirements = ((Arch == "INTEL") || (Arch == "X86_64"))

  log         = loop.log
  executable  = loop.remote

  output      = loop.$(Process).out
  error       = loop.$(Process).err

  arguments   = 200

  queue 2

  arguments   = 500
  output      = loop.last.out
  error       = loop.last.err
  queue