GPU Club: 02 March 2012

  • Software for GPUs inc. compiler/directives, maths libs & tools (debuggers and profilers)

About 30 people attended. The following presentations were given and discussed:

After the presentations the following points were discussed and actions agreed.

On the presentations:

  1. A question about whether the order in which various optimizations are applied to CUDA code (e.g., removing branching/warp divergence, using a wide block size) is important? Do the optimizations interact with each other differently depending on order. It was reported that in Mark Mawson's code the order was not important, there wasn't any preferred order.
  2. Can the Nsight debugger be used to find a bug deep within the code? It could but you'd have to narrow it down to a region and then use breakpoints to step through a section of code to really pinpoint it.
  3. Can performance logs/files be exported from linux and examined in Nsight on windows. No, you can't do this. The visual profiler runs on linux but you can't export files for viewing in Nsight. Nsight does not run on linux.
  4. Does the Nsight debugger work on multiple GPUs. Yes, you get a list of contexts to select from. The debugging session then works in that context.
  5. Where the matlab GPU toolbox was used, what precision was it using? It was reported to be in double precision.
  6. Is the GPU toolbox available in the latest version of MATLAB? Yes.

On community involvement, there was general interest in having a presentation from CAPS on HMPP and in both OpenCL & CUDA training. Other specific interests raised included:

  1. How the GPU architectures are changing and which products are coming through. (eg general news update at each Club meeting)
  2. Updates on resources available to researchers: GPU Club Resources page.

  3. Difference between a low-level CUDA approach and the directives approach.
  4. Would like to hear about CUDA code processing large volumes of input data but which reduce this to small amounts of output data.
  5. Talks to include specifics of kernel optimization with Community offering example codes (with before/after timings) on the wiki eg
    1. tips (eg avoiding branching) & experiences (eg solving the memory bandwidth problem)

    2. Advanced memory access patterns.
    3. Methods of using more than the memory on-board a GPU (eg 6GB on Fermi)
    4. Indirect addressing methods
  6. Info about the parallel compute toolbox (PCT) in MATLAB on GPUs.
  7. OpenGL / OpenCL integration examples

Further GPU Clubs will address these and the Community is invited to contribute to the GPU FAQ - please email if you require an account.

