Architecture-Aware Optimization on a 1600-core Graphics Processor

Daga, Mayank and Scogland, Thomas R.W. and Feng, Wu-chun (2011) Architecture-Aware Optimization on a 1600-core Graphics Processor. Technical Report TR-11-08, Computer Science, Virginia Tech.

Full text available as:
PDF - Requires Adobe Acrobat Reader or other PDF viewer.
Arch-Aware-Opt.pdf (485412)

Abstract

The graphics processing unit (GPU) continues to make significant strides as an accelerator in commodity cluster computing for high-performance computing (HPC). For example, three of the top five fastest supercomputers in the world, as ranked by the TOP500, employ GPUs as accelerators. Despite this increasing interest in GPUs, however, optimizing the performance of a GPU-accelerated compute node requires deep technical knowledge of the underlying architecture. Although significant literature exists on how to optimize GPU performance on the more mature NVIDIA CUDA architecture, the converse is true for OpenCL on the AMD GPU. Consequently, we present and evaluate architecture-aware optimizations for the AMD GPU. The most prominent optimizations include (i) explicit use of registers, (ii) use of vector types, (iii) removal of branches, and (iv) use of image memory for global data. We demonstrate the efficacy of our AMD GPU optimizations by applying each optimization in isolation as well as in concert to a large-scale, molecular modeling application called GEM. Via these AMD-specific GPU optimizations, the AMD Radeon HD 5870 GPU delivers 65% better performance than with the wellknown NVIDIA-specific optimizations.

Item Type:	Departmental Technical Report
Keywords:	GPU, OpenCL, CUDA, performance evaluation, performance optimization, hill climbing. kernel splitting, local staging
Subjects:	Computer Science > Software Engineering
ID Code:	1159
Deposited By:	Administrator, Eprints
Deposited On:	06 July 2011