Computational Science Community Wiki

amber11 on nVidia GPGPUs

I attended the morning of a "bio-workbench" workshop held at UoM on 16/7/2010. The event was advertised as talks by Ian Gould (UCL) and Ross Walker (SDSC, was IG's postdoc). Ross was unable to attend (house moving).

The morning talks focussed exclusively on the amber software package (for MD calculations) and on the CUDA port of the pmemd routines. Covered and extended the online discussion (pmemd is generally a fast implementation of a subset of the sander/sander.mpi functionality)

amber11 (IanG)

I took few notes about amber itself since my interest was in the CUDA port

Q & A:

GPGPUs (IanG)

Ian described comparitive performance of pmemd.cuda (the CUDA port of pmemd)

accuracy

much covered online. Papers about to be published (author list begins: Scott L. Grand, Andreas W. Goetz,...)

Timings

I believe all following results & timings are SPDP for pmemd.cuda but full DP for pmemd (compiled with gfortran since "Intel kept breaking the code")

Example 1

(check which type of MD sim)

small case: med case: large case: 314M uniq interactions

ns/day sims by hardware:

hardware (cost)

small case

med

large

1060

254

28

.52

2050 (£2K)

268 ns

50

1.04

2x E5462 (£45K)

114 ns (pmemd)

4

.06

Example 2

PME: 1K to 1M atoms, use of FFT for long range

  1. atomic ops too slow, shared mem too small -> divide/conquor

  2. use MPI for multiple GPUs: 'txiwt GPU mems is very expensive; still some serial; some severe limits on performance, ongoing tuning

hardware: 1,2,4 or 8 2050s connected with InfiniBand; 256 cores of NICS Kraken, XT3 (up to 1024 cores but underpopulating nodes due to memory b/w limits)

small case: JAC (std b/mark): 23K atmos med case: 91K atoms large case: 127K atoms

graphs were presented so these are 'by sight' estimations

ns/day sims by hardware:

h/w

small

med

256 cores Kraken

45

22

8x 2050

50

20

4x 2050

42

15

2x 2050

??

8

1x 2050

20

7

1x 1060

12

3

2x E5462

15

2

large case : 127K atmos

Lessons
  1. sep calcs in time as well as space (?)
  2. IB saturated/limits
  3. cuda's FFT best for powers of 2

future

other GPU implements (mostly only SP)

acemd (matt harvey), charmm (Amber devs go in to charmm), folding@home, gromacs, hoomd, lammps, namd, openmm

Qs