Parallel Computing Architectures (Hardware)

How to enable

to understand and optimize the performance of parallel programs.

Concurrency vs Parallelism

Concurrency Parallelism
Tasks have overlapping time periods Tasks can run simultaneously exactly at the same time
Interleaiving ok Interleaving not happening

Due to the core’s pipeline, simultaneous execution is possible at the level of the core

Forms of parallelism

Type of parallelism Description
Superscalar Multiple instructions from same thread at same time
Multithreading Multiple threads execute at same time
Multiprocessing Multiple threads/processes execute at same time

Bit-level parallelism

Since we mostly operate on multiple bits at once in a single operation, we often have bit-level parallelism.

We often operate on words.

Word size:

Instruction-level parallelism

Across time Across space
Pipelining Superscalar

5 stage instruction execution

Pipelineing

Problems:

Superscalar

Superscalar duplicating the ALU

Duplicates all or some the stages of the pipeline. Instead of 1 instruction in the fetch stage, we can have 2 instructions in the stage at the pipeline.

Summary:

ILP

The speedup is still bottlenecked by the instruction execution, which all comes from the same thread.

Thread-level parallelism

While software can run multiple threads concurrently, the processor can run multiple threads in parallel!

Implementations

Fine-grained multithreading: switch threads per instruction

Coarse-grained multithreading: switch threads per stall (types of stalls below)

The processor can also have multiple cores!

Core-level parallelism

These days we put multiple cores in the same processor.

Flynn’s Parallel Architecture Taxonomy

SISD (Single insn single data)

  1. One stream of instruction
  2. One instruction one piece of data
  3. Uniprocessors

SISD

SIMD (Single insn multiple data)

Data parallelism

SIMD

Only for the

MISD (Multiple insn single data)

MISD

MIMD (Multiple insn multiple data)

MIMD

Stream processor variants (SIMD + MIMD)

Multiprocessor + GPGPU

Multicore architecture

Data Design

Hierarchical design

Hierarchical design

Pipelined design

Pipelined design

Network-based design

Network-based design

BUT future trends indicate:

Memory Organizaiton

Uniprocessor

Problems in memory management:

  1. Memory latency
    1. Time for a memory request to be serviced by memory
    2. e.g. 100 cycles or 100nsec
  2. Memory bandwidth
    1. How fast can memory give you data
    2. e.g. 20GB/s
    3. The bottleneck nowadays

Memory bus is fully occupied, cannot give out data faster rthan the bandwidth

Distributed-Memory systems

Shared Memory System

We’re not talking about the underlying hardware Do they have a shared address space.

Pros Cons
No partitions Synchronization constructs required
No physical movement of data Lack of scability (contention)

UMA: Uniform Memory Access (Time) UMA

DDRx the memory sped is constant, but bandwidth is improving

Cache coherence problem

Shared distributed architecture

NUMA architectures can reduce memory contention in shared distriubuted memory systems

Two separate cores accessing the same memory – contention

NUMA implicitly implies multiple memory units: reduces contention

Memory consistency problem

A=1 from proc 0

then proc 1, can be very delayed or not it could print before Flag = 1 or after.

while (Flag == 0); print A when flat isn’t 0 anymore

proc1 and 2 print A proc 2 will always print 1

always done in program order

some overhead to maintain memory consistency (e.g. proc 2)

lslide 42: BLUE: memory system is busy getting snad sending tothe processor GREEN: Load instruction in processor (if in cache, short time, if not, have to go to shared memory) EVERY LOAD is going to memory GRAY: rest of the pipeline

memory contention: