This previous post,
Plenty of Room at Exascale, focuses on one specific commercial approach to scaling CFD to large problems on heterogeneous hardware (CPU & GPU) clusters. Here's some more references I found interesting reading on this sort of approach.
Strategies
Recent progress and challenges in exploiting graphics processors in computational fluid dynamics provides some general strategies for using multiple levels of parallelism accross GPUs, CPU cores and cluster nodes based on that review of the literature:
- Global memory should be arranged to coalesce read/write requests, which can improve performance by an order of magnitude (theoretically, up to 32 times: the number of threads in a warp)
- Shared memory should be used for global reduction operations (e.g., summing up residual values, finding maximum values) such that only one value per block needs to be returned
- Use asynchronous memory transfer, as shown by Phillips et al. and DeLeon et al. when parallelizing solvers across multiple GPUs, to limit the idle time of either the CPU or GPU.
- Minimize slow CPU-GPU communication during a simulation by performing all possible calculations on the GPU.