GPGPU Programming

GPU History

First GPUs were used to run shader programs, which were very customized to graphics pipeline. Drawbacks included

GPU Architecture

One GPU has many Streaming Multiprocessors (SMs).

Architecture of an SM

Different compute capabilities have different SM architectures.

The cores in an SM share:

  1. Common L1 cache
  2. Texture memory
  3. Warp scheduler

What is CUDA

Compute Unified Device Architecture (CUDA): is a general purpose programming model.

CUDA Layers

C $\rightarrow$ CUDA text intermediate representation (PTX) $\rightarrow$ device specific code (CUBIN)

CUDA C Runtime:

CUDA driver API:

CUDA linkers

Definitions

Kernel: Function running on device

Host: CPU

Device: GPU

Thread: thread of execution on an SM

Block: All threads are organized in groups called blocks. One kernel is executed by a grid of thread blocks (note this config is virtual). Threads in a block cooperate via

  1. Shared memory
  2. Atomic operations
  3. Barrier sync

Programs can thus scale transparently to any number of processors. Threads beyond a block cannot cooperate.

Transparent scalability: HW is free to schedule threads to any processor at any time.

Thread mapping and architecture

SIMT (Single instruction multiple thread) execution model

SM creates/manages/schedules/executes threads in warps

Warp: Group of 32 parallel threads.

Memory model

On-chip device memory (on the SM):

Off-chip device memory:

Global Memory

Coalesced access to global memory: threads in a warp (set of 32 threads) coalesced into a number of transactions.

Shared Memory

Higher bandwidth + lower latency than local/global

Equal sized memory modules (banks)

Optimization

Memory optimization

  1. Minimize host-device data transfer
    1. Peak on-off chip bandwidth » Peak device-host bandwidth
    2. Page-lockedpinned
  2. Coalesce global memory accesses
  3. Minimize global memory accesses
  4. Minimize bank conflicts
  5. Page locked (pinned) memory transfer
    1. Not cached
    2. Zero-copy that accesses directly host memory
    3. In CUDA its __managed__ memory (Unified memory model)

Pinned memory

Execution config

Occupancy vs resource utilization

Avoid multiple contexts per GPU with same CUDA application. (i.e. avoid running multiple CUDA application processes as it means multiple contexts)

#Instruction throughput

Max out fast arithmetic instructions!

Minimize warp divergence by control flow.

Optimize out sync points.