GPU Parallelism 
---------------

.. admonition:: Overview
   :class: Overview

    * **Tutorial:** 30 min
    * **Exercises:** 10 min

        **Objectives:**
            #. Learn the general architecture of GPUs.
            #. Learn about the execution model of an NVIDIA GPU. 
            #. Learn about thread indexing in GPUs.
            #. Learn about streams in NVIDIA GPU.
            #. Learn about data movements in GPUs.

Gadi only has NVIDIA GPUs. So when we say GPUs we mean NVIDIA GPUs. Nevertheless, many concepts discussed here 
are the same across different vendors. While the CPU is optimized to do a single operation as fast as it 
can (low latency operation), the GPU is optimized to do a large number of slow operations (high throughput 
operation). GPUs are composed of multiple Streaming Multiprocessors (SMs), an on-chip L2 cache, and 
high-bandwidth DRAM. The SMs execute operations and the data and code are accessed from DRAM through the L2 cache.

.. image:: ../figs/SM.png

Each SM is organized into CUDA cores capable of doing specialized operations.

.. image:: ../figs/cuda_cores.png

GPU Execution Model
*******************

Each GPU kernels are launched with a set of threads. The threads can be organized into blocks, and the blocks 
can be organized into a grid. The maximum number of threads a block can have will depend on the GPU generation. 

.. image:: ../figs/blocks.png

A block can be executed only in one SM, but an SM can have multiple blocks simultaneously. The maximum number 
of blocks an SM can host will depend on the GPU generation. Since an SM can execute multiple thread blocks 
concurrently, it is always a good idea to launch a kernel with blocks several times higher than the number 
of SMs. 

.. image:: ../figs/wave.png

**Wave** is the number of thread blocks that run concurrently. So if we have 12 SMs and we launch a kernel 
with 8 blocks, with an occupancy of 1 block per SM, there will be two waves.


Thread Indexing
***************

Threads, blocks, and grids are organized in three dimensions: x, y, and z. For simplicity, we will use only 
two dimensions.

**Dimensions**:

1.  *gridDim.x* — blocks in the x dimension of the grid 
2.  *gridDim.y* — blocks in the y dimension of the grid 
3.  *blockDim.x* — threads in the x dimension of the block 
4.  *blockDim.y* — threads in the y dimension of the block 

**Indexing**: 

1.  *blockIdx.x* — block index in x dimension 
2.  *blockIdx.y* — block index in y dimension 
3.  *threadIdx.x* — thread index in x dimension 
4.  *threadIdx.y* — thread index in y dimension 

How do we assign a unique thread id to each thread?
***************************************************

.. image:: ../figs/thread_index.drawio.png


1. Find the blockId --> 
.. code-block:: console
    blockId  = (gridDim.y * blockIdx.x) + blockIdx.y

2. Using the blockId, find the threadId 
.. code-block:: console
    threadId = [(blockDim.x * blockDim.y) * blockId] + [(blockDim.y * threadIdx.x) + threadIdx.y]

Warps and Warp Schedulers
*************************

While we can arrange the threads in any order, the SM schedules the threads as **Warps**, and each warp 
contains 32 threads. For example, if you launch a block with 256 threads, those 256 threads are arranged as 
8 warps (256/8). All the threads in the same warp can only execute the same instruction at a given time. 
For example, if we have a program

.. code-block:: console
    a = b + c
    d = x * y

*All* the threads in the warp should finish executing the addition operation, only then can the threads 
execute the multiplication operation. Depending on the generation of the GPU, it may contain more than one 
warp scheduler. For instance, in the *Fermi GPU*, each SM features two warp schedulers and two instruction 
dispatch units. This allows two warps to be issued and executed concurrently. It is always a good idea to 
consider the warp size (32) and the maximum number of concurrent warps possible when deciding the block size.

.. image:: ../figs/warp.png

Data Movement in GPUs
*********************

.. image:: ../figs/gpu-node.png

The are two types of data movement in GPUs:

1.  Host-to-Device data movement (H2D): Move data from the host memory to the GPU memory.
2.  Device-to-Device data movement (D2D): Move data from the memory of one GPU to another.

H2D transfer happens through the PCIe switch and D2D transfer happens through NVLink. This makes D2D 
transfers more faster than H2D transfers.

Streams
*******

Streams are used to manage and optimize parallel computing tasks. It is a sequence of operations (or tasks) 
that are executed on the GPU. Streams allow for the parallel execution of these tasks, enabling more efficient 
use of the GPU's resources. Each stream can execute kernels and manage data transfers between the CPU and GPU.

.. image:: ../figs/streams.png

The main advantages of using streams are:

1. **Concurrency**: Streams allow for concurrent execution of multiple tasks. This means you can have multiple 
streams, each handling different tasks, and the GPU can process these tasks in parallel. For example, one stream 
might be handling a computation kernel, while another is performing data transfer.

2. **Asynchronous Execution**: Operations within a stream are executed in the order they are issued. 
However, different streams can operate asynchronously with respect to each other. This means that tasks 
in one stream can proceed independently of tasks in another stream, leading to potential performance improvements.

3. **Data Transfers**: Streams can be used to overlap data transfers between the CPU and GPU with kernel 
executions. For example, while one stream is using data, another stream can be used to transfer new data 
to the GPU, which helps in hiding the latency of data transfers.

Exercise
*********

1. What speedup is achieved with GPU parallelism?
    
.. code-block:: console
    :linenos:
    
    qsub 4_gpu.pbs

2. Does using a separate stream affect GPU performance?

.. code-block:: console
    :linenos:
    
    qsub 5_gpu_streams.pbs
    

.. admonition:: Key Points
   :class: hint

    #. Thread indexing play an important role in GPU performance.
    #. Warp size play an important role in GPU performance.
    #. Streams can improve asynchronous parallelism in GPUs.