Submitting Jobs to Condor on Mace01
This page describes how to use Condor on the Mace01 Condor pool. It is not a general introduction to Condor.
Mace01 is primarily a traditional HPC Linux cluster; users submit jobs via Sun Grid Engine (SGE). However, the system also runs Condor which is used to "backfill" the system with (possibly large numbers of) relatively short jobs.
The pool consists of compute nodes from the cluster and also some desktop machines from the George Begg teaching cluster. However, there is only one submit node, gbcondor.mace.manchester.ac.uk. Owing to the network "topology" of this pool, it is not possible for users' desktop machines to become submit nodes. (More accurately, it requires significant network routing changes and the use of an IP tunnel.)
Getting an Account and Logging In
To get access to the Condor pool on the Mace01 cluster, please email firstname.lastname@example.org. To login to gbcondor.mace.manchester.ac.uk, use SSH and the username and password you have been given:
prompt> ssh <username>@gbcondor.mace.manchester.ac.uk Password: # ...enter the password you have been given...
Should this documentation not prove sufficient, additional help may be sought via the following:
For a simple introduction to Condor usage, visit Astrophysics at the University of Victoria, Canada.
The Condor v6.8 Manual at the University of Wisconsin — but be warned, this is not always easy to follow!
Setting Up Your Environment
To use Condor on Mace01, you must set the CONDOR_CONFIG environment variable to point to the main configuration file:
and also include Condor's bin directory in your PATH, for example
export PATH=$PATH:/opt/condor-7.4.2/bin # # ...assuming you are using BASH...
Rather than enter these commands each time you login to the system, it makes sense to place them in your .bash_profile
Example Job Submission: The Vanilla Universe
The following example works with the vanilla universe on Mace01 — here, a.out is a statically linked executable:
requirements = ((Arch == "INTEL") || (Arch == "X86_64")) executable = a.out universe = vanilla output = a.out.$(Process).out error = a.out.$(Process).err log = a.out.log transfer_files = always queue 10
Filesystems and File Transfer under Condor on Mace01
Because the pool includes hosts which are not part of the Mace01 cluster, Condor is configured not to take advantage of the shared filesystem which exists across Mace01, i.e., we have set
USE_NFS = FALSE
By default, the vanilla universe assumes that submit and execute nodes share a filesystem, i.e., that: the executable and input files will exist on both the submit node and compute node — they will not need to be transfered at the start of the job — and output files (and error files) will not need to be transferred at the end of the job. Since we have USE_NFS = FALSE, vanilla universe jobs on Mace01 must set the transfer_files option in the submit file — as in the example above.
Job Submission: The Standard Universe
The following example works with the standard universe on Mace01 — here, loop.remote is an executable which has been linked against the Condor libraries using the condor_compile command:
requirements = ((Arch == "INTEL") || (Arch == "X86_64")) log = loop.log executable = loop.remote output = loop.$(Process).out error = loop.$(Process).err arguments = 200 queue 2 arguments = 500 output = loop.last.out error = loop.last.err queue