Computational Science Community Wiki

Differences between revisions 11 and 28 (spanning 17 versions)
Revision 11 as of 2011-07-06 21:06:15
Size: 2491
Editor: MichaelBane
Comment:
Revision 28 as of 2012-05-28 20:11:33
Size: 5072
Editor: MichaelBane
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
#pragma section-numbers 2
Line 8: Line 10:
== Performance == <<TableOfContents(2)>>

== Hardware ==
 * Q: how do I know if my graphics card supports OpenCL
  i. A: http://www.streamcomputing.eu/blog/2011-12-29/opencl-hardware-support/

 * Q: how do I know if my CPU supports OpenCL
  i. A: http://www.streamcomputing.eu/blog/2011-12-29/opencl-hardware-support/

== Performance Issues ==
Line 12: Line 23:
 * Q: how do I profile my CUDA code?  * Q: how do I profile my GPU codes?
  i. SDK solution:
  i. add calls to `clock()` but beware asynchronicity
Line 15: Line 28:
  i. CUDA Driver API has `cuMemGetInfo(size_t *free, size_t *total)`, callable by runtime API once CUDA context established;
  i. CLI command `cuda-memcheck` (for debugging mem calls)

 * Q: how do I debug my GPU codes?
  i. CUDA SDK has cuda-gdb
Line 19: Line 37:
 * Q: what does the `computeMode` output mean?  * Q: what does the `computeMode` output mean on NVIDIA cards?
Line 31: Line 49:
  i. There's two ways of setting this up   i. There's two ways of setting this up on NVIDIA cards
Line 36: Line 54:
  i. On the [[../HectorGpu|HECToR GPU Testbed]] all NVidia GPUs are in Default compute mode (checked on 10 June 2011)   i. On the [[../HectorGpu|HECToR GPU Testbed]] all NVidia GPUs are now in Exclusive Process Mode (many threads in one process is able to use cudaSetDevice() with this device) - checked on 7 July 2011

 * Q: What about exclusive use of AMD cards?
  i. they don't have different modes

== Using GPUs with other programming languages ==
=== FORTRAN ===
 * Q: How do I call OpenCL and CUDA from fortran code?
  i. No native fortran OpenCL library exists so you can call C functions from your fortran code which in turn call OpenCL functions. CUDA provides some Fortran support but the same technique of using a C layer is also useful. See this [[/FortranGPU|skeleton code]] and [[attachment:FortranOpenCL/CallingOpenCLfromFortran.ppt|overview (PPT)]]
=== MPI ===
 * Also see [[attachment:../news/ge199-report.pdf|Implementing Sparse Linear Solvers on Multiple GPUs]]

== Techniques ==
=== Numerical Libraries ===
 * [[attachment:../news/ge199-report.pdf|Implementing Sparse Linear Solvers on Multiple GPUs]]


== Accessing GPU Resources ==
Line 40: Line 75:

 * Q: how do I run on a specific GPU card on a specific node (eg on HECToR)?
  i. Under SGE you can use `qsub -q <queue>@<node>`. So on HECToR, to use the gpu3 node with the C2070, you would submit to `-q tesla@gpu3` (and the C2070s are the first two devices (if unused))

 * Q: how do I use MPI and GPUs in the same computer program?
  i. If you have an existing MPI program, this is straightforward. Adapt the computational kernel for GPU acceleration as normal. Assuming the host MPI program decomposes the problem, the accelerated kernel should work on local data only for each MPI process. A CUDA MPI+GPU example can be seen in ParaFEM program [[http://code.google.com/p/parafem/source/browse/trunk/parafem/src/programs/5th_ed/xx3/xx3.f90|program xx3.f90]]
  i. If your GPU accelerated program is sequential, you will need to parallelise it using MPI. <<MailTo(its-research@manchester.ac.uk,Contact us)>> for further guidance.

University of Manchester GPU FAQ


alt="NVIDIA CUDA Research Centre"

  • Software for GPUs inc. compiler/directives, maths libs & tools (debuggers and profilers)


Please login and add your own questions or solutions

1. Hardware

2. Performance Issues

  • Q: how do I get max performance from NVIDIA cards?
    1. A: you need to use pinned memory and asynchronous comms
  • Q: how do I profile my GPU codes?
    1. SDK solution:
    2. add calls to clock() but beware asynchronicity

  • Q: how do I determine the amount of memory actually used by the GPU?
    1. CUDA Driver API has cuMemGetInfo(size_t *free, size_t *total), callable by runtime API once CUDA context established;

    2. CLI command cuda-memcheck (for debugging mem calls)

  • Q: how do I debug my GPU codes?
    1. CUDA SDK has cuda-gdb
  • Q: how can I tell if my NVIDIA card is running in exclusive mode or not?
    1. A: load the SDK and run deviceQuery and examine the computeMode output

  • Q: what does the computeMode output mean on NVIDIA cards?

    1. CUDA 4.0 has 4 compute modes
      • Default: Multiple host threads can use the device
      • Exclusive-process: Only one CUDA context may be created on the device across all processes in the system and that context may be current to as many threads as desired within the process that created that context.
      • Exclusive-process-and-thread: Only one CUDA context may be created on the device across all processes in the system and that context may only be current to one thread at a time.
      • Prohibited: No CUDA context can be created on the device.
    2. CUDA 3 has 3 compute nodes
      • Default - same as CUDA 4.0
      • Exclusive - only one host thread to use device at any given time.
      • Prohibited - same as CUDA 4.0
  • Q: how can I (and only me) run on a GPU card?
    1. There's two ways of setting this up on NVIDIA cards
      1. set your card to be in exclusive mode (see above); OR

      2. grab a whole node (using, eg, qsub -pe mpich 4 if 4 slots per node) AND explicitly select each GPU (eg from your CUDA or OpenCL context) if using more than one card otherwise the kernels all run on same card

    2. The HECToR GPU testbed consists of 4 nodes, each with multiple GPUs (see here). Therefore it is important to consider how programs running on the same node share these nodes, which is controlled by the compute modes.

    3. For more information see section 3.6 in the NVIDIA CUDA C Programming Guide

    4. On the HECToR GPU Testbed all NVidia GPUs are now in Exclusive Process Mode (many threads in one process is able to use cudaSetDevice() with this device) - checked on 7 July 2011

  • Q: What about exclusive use of AMD cards?
    1. they don't have different modes

3. Using GPUs with other programming languages

3.1. FORTRAN

  • Q: How do I call OpenCL and CUDA from fortran code?
    1. No native fortran OpenCL library exists so you can call C functions from your fortran code which in turn call OpenCL functions. CUDA provides some Fortran support but the same technique of using a C layer is also useful. See this skeleton code and overview (PPT)

3.2. MPI

4. Techniques

4.1. Numerical Libraries

5. Accessing GPU Resources

  • Q: how do I use the GPU testbed on HECToR?
    1. A: Accessing and running; contact us <its-research@manchester.ac.uk> if you'd like trial access

  • Q: how do I run on a specific GPU card on a specific node (eg on HECToR)?
    1. Under SGE you can use qsub -q <queue>@<node>. So on HECToR, to use the gpu3 node with the C2070, you would submit to -q tesla@gpu3 (and the C2070s are the first two devices (if unused))

  • Q: how do I use MPI and GPUs in the same computer program?
    1. If you have an existing MPI program, this is straightforward. Adapt the computational kernel for GPU acceleration as normal. Assuming the host MPI program decomposes the problem, the accelerated kernel should work on local data only for each MPI process. A CUDA MPI+GPU example can be seen in ParaFEM program program xx3.f90

    2. If your GPU accelerated program is sequential, you will need to parallelise it using MPI. Contact us <its-research@manchester.ac.uk> for further guidance.