Computer Science Technical Reports
CS at VT

Accelerating Data-Serial Applications on Data-Parallel GPGPUs: A Systems Approach

Aji, Ashwin M. and Feng, Wu-chun (2008) Accelerating Data-Serial Applications on Data-Parallel GPGPUs: A Systems Approach. Technical Report TR-08-24, Computer Science, Virginia Tech.

Full text available as:
PDF - Requires Adobe Acrobat Reader or other PDF viewer.
ipdps08.pdf (1341678)

Abstract

The general-purpose graphics processing unit (GPGPU) continues to make significant strides in high-end computing by delivering unprecedented performance at a commodity price. However, the many-core architecture of the GPGPU currently allows only data-parallel applications to extract the full potential out of the hardware. Applications that require frequent synchronization during their execution do not experience much performance gain out of the GPGPU. This is mainly due to the lack of explicit hardware or software support for inter thread communication across the entire GPGPU chip. In this paper, we design, implement, and evaluate a highly-efficient software barrier that synchronizes all the thread blocks running on an offloaded kernel on the GPGPU without having to transfer execution control back to the host processor. We show that our custom software barrier achieves a three-fold performance improvement over the existing approach, i.e., synchronization via the host processor. To illustrate the aforementioned performance benefit, we parallelize a data-serial application, specifically an optimal sequence-search algorithm called Smith-Waterman (SWat), that requires frequent barrier synchronization across the many cores of the nVIDIA GeForce GTX 280 GPGPU. Our parallelization consists of a suite of optimization techniques — optimal data layout, coalesced memory accesses, and blocked data decomposition. Then, when coupled with our custom software-barrier implementation, we achieve nearly a nine-fold speed-up over the serial implementation of SWat. We also show that our solution delivers 25 faster on-chip execution than the na¨ıve implementation.

Item Type:Departmental Technical Report
Subjects:Computer Science > Algorithms and Data Structure
ID Code:1052
Deposited By:Administrator, Eprints
Deposited On:16 December 2008