Computational Science Community Wiki

Useful information on Condor

Please feel free to add useful information to this page.....

Further Help

Information can be found in the man pages or by passing the -h flag to the relevant command, e.g. type man condor_q or condor_q -h at the prompt for info about the condor_q command.

For a simple introduction to Condor usage, visit Astrophysics at the University of Victoria.

Or contact rcs@manchester.ac.uk for help and information.

Condor manuals

The Condor manuals are located here (be warned that it is not always easy to follow!). Man2 uses Condor Version 7.3, MACE uses Version 6.8.

Condor universes

Condor has various universes, the most commonly used are the Vanilla Universe and the Standard Universe. Condor jobs are generally given low priority such that running jobs are stopped whenever a compute node is required for non-Condor use. In the Standard Universe jobs are paused and continued when the next free node becomes available, but in the Vanilla Universe these jobs are killed and then restated on the next free node. However there are limitations on what can be run in the Standard Universe, and condor_compile must be used to create the executable.

Vanilla Universe

If you simply want to run your job unchanged from elsewhere then use the vanilla universe. However, system-checkpointing is not supported - incomplete jobs cannot be paused and migrated to the next available compute node.

By default, the vanilla universe assumes that submit and execute nodes share a filesystem, i.e., that executable and input files will not need to be transfered at the start of the job. Therefore the submit script must use transfer_files to specify the type of file transfer required.

To use the Vanilla Universe:

Note: is it disadvantageous to run long vanilla jobs, since there is a risk that jobs are killed and restarted many times. This suggests whenever possible running lots of small jobs rather than one long one. However there may be a latency associated with each job, e.g. time to transfer data files and start a program, and so it may be inefficient to make jobs too short. Some experimentation is therefore required to find an optimal job length.

Example Vanilla Universe submit scripts

Single job

  universe = vanilla

  executable = helloworld # replace by the name of your executable

  transfer_files = always
 
  # set std out/err and log file names:
  output  = hello.out
  error   = hello.err
  log     = hello.log

  queue

Submit this to the Condor queue using

condor_submit <scriptName>

Multiple jobs

It is very easy to run a large number of related jobs under Condor, each with different arguments, as one may require for a parameter sweep. This submission script illustrates:

  universe = vanilla

  executable = myprog

  output = myprog.$(Process).out
  error  = myprog.$(Process).err
  # The $(Process) string will be replaced with the job's process value
  # so output and error file names will look like:
  # myprog.0.out, myprog.1.out, ..., myprog.5.out and 
  # myprog.0.err, myprog.1.err, ..., myprog.5.err

  log = /tmp/simonh/myprog.log

  arguments = 2000
  queue 2

  arguments = 3000
  queue 3

  arguments = 5000
  queue 

Here myprog is run a total of six times, twice with a command-line argument of 2000, thrice with 3000 and once with 5000.

This example is contrived to show that:

If planning to submit hundreds of jobs via one submission script in this way it may be more sensible to write a script which in turn writes the submission script.

Other useful submit script commands

See Section 2.5 Submitting a Job of the Condor Manual for more information.

Standard Universe

This is the default universe - if you don't specify a universe, this is what you get.

This universe allows Condor system-checkpointing if code is compiled using the condor_compile command. However there are various requirements which must be met for checkpointing to work, including choice of compiler, see the relevant Condor manual Section 2.4.1Choosing a Condor Universe for details.

To use the Standard Universe with checkpointing:

You cannot (as of 2007 August) use Fortran 90 (on Linux, unless you use the PGI compilers — the University does not have a site licence for these).

Using condor_compile

Our example code is the ubiquitous Hello World programme:

Compile using condor_compile with a suitable compiler, in this case we wrap the gcc compiler (see the manual for a list of compatible compilers). Ensure any required libraries are statically linked in.

Condor commands

condor_status

Lists information about the compute nodes available to Condor.

condor_submit

Used to submit Condor jobs which are defined in a suitable script.

condor_q

Lists all jobs in the queue, useful for checking jobs were successfully submitted, and to identify cluster and process IDs, e.g.

Condor Job ID

Each job has a unique ID which consists of two numbers separated by a decimal point. The number before the decimal point is the cluster ID, this is shared amongst all jobs submitted with each queue in the script submitted to condor_submit, the number after the decimal point is the process ID, which is unique for each job within a cluster.

condor_rm

Kills Condor jobs, e.g.

ClassAds

The operators used in ClassAd expressions are similar to those in C, more information can be found here. Condor ClassAds can be used with various Condor commands to identify specific jobs or processors, or to select processors with specific properties, e.g.