Shared Memory Systems
- No need to partition data
- More efficient communication opposed to distributed
- Synchronization constructs
- Lack of scalability due to memory contention
Examples
OpenMP:
- easy to add parallelism (by just adding compiler directives on top of C/C++ program)
- No need to copy memory to a separate device
- heavyweight threads
- unrestricted resources: access has to be coordinated and synchronized
CUDA:
- lightweight threads that are numerous, easy to create and destroy
- reduce memory overheads and contention by exploiting good use of shared memory (only shared amongst threads)
- requires code that can run efficiently in lockstep and is slowed down by conditionals
Foster’s Methodology
Decomposition: partition data or tasks.
Task granularity: Impact on communication/thread formation
- Fine grained task partition
- parallelism overhead (creation and merging of threads)
- communication overhead
- Coarse grained task partition
- less parallelism
- but less overhead
Communication: local (parallel) or global (sequential)
Rules of thumb:
- Balanced amongst tasks
- Performedin parallel
- Overlap with computation
Agglomeration: Combine groups of tasks for sending/receiving
- improve performance
- improve scalability
Mapping: Assigning of tasks to execution units
Parallel Programming models
- Task pool
- Parbegin-parend
- SIMD/SPMD
- Master-Worker
- Client-Server/MPMD
- Task pool
- Producer Consumer
- Pipelining
Metrics
perf list
:
- branch instructions
- page faults
- cache misses
- cycles
- instructions
- floating point operations
perf stat
Algorithm description
must include:
- Data distribution
- Parallel programming model
- Key constructs (MPI, CUDA, OpenMP)
- Metrics (utilization of processes, resources, idle time, cache)
- Interconnection (if distributed)
- Sources of inefficiency
- Waiting/idle time
- Overheads
- Cache misses/thrashing/memory contention
- Prevention of deadlock/data race
- Odd even often works
CUDA Programming: Memory Management
- Coalescing access to global memory in 32 byte chunks
- Shared memory usage
- Bank conflict
- Strided access