Shared Memory Systems

No need to partition data
More efficient communication opposed to distributed
Synchronization constructs
Lack of scalability due to memory contention

Examples

OpenMP:

easy to add parallelism (by just adding compiler directives on top of C/C++ program)
No need to copy memory to a separate device
heavyweight threads
unrestricted resources: access has to be coordinated and synchronized

CUDA:

lightweight threads that are numerous, easy to create and destroy
reduce memory overheads and contention by exploiting good use of shared memory (only shared amongst threads)
requires code that can run efficiently in lockstep and is slowed down by conditionals

Foster’s Methodology

Decomposition: partition data or tasks.

Task granularity: Impact on communication/thread formation

Fine grained task partition
- parallelism overhead (creation and merging of threads)
- communication overhead
Coarse grained task partition
- less parallelism
- but less overhead

Communication: local (parallel) or global (sequential)

Rules of thumb:

Balanced amongst tasks
Performedin parallel
Overlap with computation

Agglomeration: Combine groups of tasks for sending/receiving

improve performance
improve scalability

Mapping: Assigning of tasks to execution units

Parallel Programming models

Task pool
Parbegin-parend
SIMD/SPMD
Master-Worker
Client-Server/MPMD
Task pool
Producer Consumer
Pipelining

Metrics

perf list:

branch instructions
page faults
cache misses
cycles
instructions
floating point operations

perf stat

Algorithm description

must include:

Data distribution
Parallel programming model
Key constructs (MPI, CUDA, OpenMP)
Metrics (utilization of processes, resources, idle time, cache)
Interconnection (if distributed)
Sources of inefficiency
1. Waiting/idle time
2. Overheads
3. Cache misses/thrashing/memory contention
Prevention of deadlock/data race
1. Odd even often works

CUDA Programming: Memory Management

Coalescing access to global memory in 32 byte chunks
Shared memory usage
- Bank conflict
- Strided access