CUDA notes

Introduction

CUDA provides three key abstractions: a hierarchy of thread groups, shared memories, and barrier synchronization, that provide a clear parallel structure to conventional C code for one thread of the hierarchy.

Thread Hierarchy

The programmer organizes these threads into a hierarchy of grids of thread blocks. A thread block is a set of concurrent threads that can cooperate among themselves through barrier synchronization and shared access to a memory space private to the block. A grid is a set of thread blocks that may each be executed independently and thus may execute in parallel.
Each thread is given a unique thread ID number threadIdx within its thread block, numbered
0, 1, 2, …, blockDim–1, and each thread block is given a unique block ID number blockIdx within its grid. CUDA supports thread blocks containing up to 512 threads.
kernel<<<dimGrid, dimBlock>>>(... parameter list ...); where dimGrid and dimBlock are three-element vectors of type dim3 that specify the dimensions of the grid in blocks and the dimensions of the blocks in threads.
CUDA requires that thread blocks execute independently. It must be possible to execute blocks

in any order, in parallel or in series. Different blocks have no means of direct communication, although they may coordinate their activities using atomic memory operations on the global memory visible to all threads—by atomically incrementing queue pointers, for example.

To enable CUDA programs to run on any number of processors, communication between thread blocks within the same kernel grid is not allowed—they must execute independently.

Block Grid

Nvidia’s blog: You could think of that a grid wraps a bunch of blocks, a block wraps a bunch of threads, and a thread wraps a bunch of basic array elements.
Another Blog which contains figures illustrating the Block and Grid.
Here is an simple example of add kernel.

__global__
void add(int n, float *x, float *y)
{
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = x[i] + y[i];
}

CUDA Memory

Each thread block has a shared memory visible to all threads of the block that has the same lifetime as the block. Finally, all threads have access to the same global memory. Programs declare variables in shared and global memory with the __shared__ and __device__ type qualifiers
Shared memory is expected to be a low-latency memory near each processor, much like an L1 cache.
Unified Virtual Address Space (UVA): The location of any memory on the host allocated through CUDA, or on any of the devices which use the unified address space, can be determined from the value of the pointer using cudaPointerGetAttributes().
Page-Locked Host Memory: codaHostAlloc() allocate page-locked host memory where malloc() malloc pageable host memory.
- Benefit: Asynchronous copy, maybe eliminate the need to copy, higher bandwidth(It’s pinned so there is no page fault, so device could fetch data without help from CPU)
Mapped Memory: A block of page-locked host memory can also be mapped into the address space of the device. Such a block could have another address in device memory that can be retrieved using cudaHostGetDevicePointer() and then used to access the block from within a kernel.
An atomic function performs a read-modify-write atomic operation on one word residing in global or shared memory.

CUDA context

CUDA Stream: A stream is a sequence of commands that execute in order. Different streams execute their commands out of order with respect to one another or concurrently.
The cuda API exposes features of a stateful library: two consecutive calls relate one-another. In short, the context is its state.
The runtime API is a wrapper/helper of the driver API. You can see in the driver API that the context is explicitly made available, and you can have a stack of contexts for convenience. There is one specific context which is shared between driver and runtime API (See primary context)).
The context holds all the management data to control and use the device. For instance, it holds the list of allocated memory, the loaded modules that contain device code, the mapping between CPU and GPU memory for zero copy, etc.
Within a CUDA context, kernels are explicitly loaded as PTX or binary objects by the host code as described in Module. Kernels written in C must therefore be compiled separately into PTX or binary objects. Kernels are launched using API entry points as described in Kernel Execution.
Any application that wants to run on future device architectures must load PTX, not binary code. This is because binary code is architecture-specific and therefore incompatible with future architectures, whereas PTX code is compiled to binary code at load time by the device driver.

Flow control instruction

Avoid different execution paths within the same warp.
Any flow control instruction (if, switch, do, for, while) can significantly affect the instruction throughput by causing threads of the same warp to diverge; that is, to follow different execution paths. If this happens, the different execution paths must be serialized, since all of the threads of a warp share a program counter; this increases the total number of instructions executed for this warp. When all the different execution paths have completed, the threads converge back to the same execution path.

CUDA C Programming

__forceinline__: Normally the nvcc device code compiler will make it’s own decisions about when to inline a particular device function and generally speaking, you probably don’t need to worry about overriding that with the forceinline decorator/directive.
__device__ and __host__: Sometimes the same functionality is needed in both the host and the device portions of CUDA code. To avoid code duplication, CUDA allows such functions to carry both host and device attributes, which means the compiler places one copy of that function into the host compilation flow (to be compiled by the host compiler, e.g. gcc or MSVC), and a second copy into the device compilation flow (to be compiled with NVIDIA’s CUDA compiler).
__global__: is a CUDA C keyword (declaration specifier) which says that the function: 1. Executes on device (GPU); 2. calls from host (CPU) code. However, __device__ functions could only be called from device code.

Ouyang's Blog