# CUDA for beginners

*Note: This is the follow-up post for this post (second in the GPGPU series). *

If you’ve ever had a glance at any assembly language, you might know how **complicated** the concept of **parallelism** is. You need to overlap multiple instructions, so that more computing is done in the same amount of time. But you also need to make sure that instruction-2 does not require data updated by instruction-1 before instruction-1 actually updates it. So this forces you to incorporate lags or stalls in the instructions.

**CUDA** lets you implement your **parallel algorithms at ease** (trust me when I say “at ease”). It handles the thread distribution to different cores in the GPU on its own. The user is required to do the following tasks:

1. Coordinate the scheduling of computation on the CPU & GPU.

2. Memory allocation for variables (both CPU as well as GPU).

3. Transfer of data between the two memories (system memory and global memory).

Programming a code to be executed on multiple number of devices was kinda tricky as well as not-so-user-friendly. Here, Nvidia decided to develop a **C-like language** and programming environment that would improve the productivity of GPU programmers (earlier ways proved tough for entry-level programmers). They named it CUDA which stands for **Compute Unified Device Architecture**. In CUDA terminology, we have the system processor (*host*) and a C/C++ dialect for the GPU (*device*). **OpenCL** (Open Computing Language) is a similar platform that is vendor-independent and can be used for a variety of GPUs (Nvidia as well as AMD).

Each parallel task is done by something called a *CUDA Thread*. Various styles of parallelism can be utilized by these CUDA *Threads*: multithreading, MIMD (Mulitiple instruction, multiple data), SIMD (Single instruction, multiple data) as well as instruction level parallelism. Due to the popularity of the term Thread in CUDA, Nvidia classified this model as **SIMT** (Single instruction, multiple thread).

So a number of *threads* come together to form a *block* of *threads*, a number of *blocks* form a *grid* of *blocks*; and your task is divided into these number of *blocks* that do computations in their respective zones by multiple *threads* in that particular *block*. These *threads* then send back their results to Global memory, where all the *blocks* come back together to give you the overall solution.

You might not be able to understand this completely, so I’d like to demonstrate this with a **simple example** (the most common when it comes to GPGPU).

**Question:** Lets say you’ve got 2 arrays A[] and B[] both containing n elements. You want to form an output array C[i], where C[i] = k*A[i] + B[i] ….. ( k —> constant ).

* Solution-1:* When you think of such a situation, you have a pretty easy and basic C-code:

calculate(n, 2, A, B); void calculate(int n, int k, int *A, int *B){ for(int i=0; i<n; i++) C[i] = k*A[i] + B[i]; }

** Solution-2:** Now comes the time to showcase some really cool and awesome block of code. Have a look at it, try to understand, anyway I provide the explanation below it.

__host__ int nBlocks = (n + 255)/256; calculate<<<nblocks, 256>>>(n, 2, A, B); __device__ void calculate(int n, int k, int *A, int *B){ int i = blockIdx.x*blockDim.x + threadIdx.x; if(i<n) C[i] = k*A[i] + B[i]; }

Now the explanation. Modern GPUs can handle a **maximum of 512 CUDA Threads** in a

*block*of

*threads*. But its better go for 256

*threads*in a

*block*for now (what if your GPU doesn’t support 512

*threads*! Be careful here). Now we suppose we have only 1 element in our array

**(**, we need only 1

*n*= 1)*block*containing 1

*thread*. When

**as well, we will need**

*n*= 255*nblocks*= 1. When

**n = 256**,

*nblocks*= 1; but for

**,**

*n*= 257*nblocks*= 2. So taking care of these results, we declare the following relation between

*n*and

*nblocks*.

`nblocks = (n + 255)/256`

Calling a function in CUDA ***function is known as kernel in CUDA*** is quite similar to the calling in C. It takes in

**2 additional arguments**along with the array pointers, k and n. These are, namely, the number of

*blocks*(

*nblocks*) and number of

*threads*in each

*block*(256 in this case). The regular syntax for passing them is as given above.

Now comes the *___device___* part (The GPU * kernel*). As I said earlier, CUDA handles the calling of the

*kernel*multiple number of times to all the different cores. Whenever the

*kernel*is run on a particular core, the

*kernel*carries along with it the

**index**of the

*thread*within its

*block*in the x-dimension

**(**, the

*threadIdx.x*)**index**of that particular

*block*

**(**(As I mentioned above, we only consider a 1-Dimensional array for the time being) and also the size or dimension of the

*blockIdx.x*)*block*

**(**. Using these 3 indices, we calculate the

*blockDim.x*)**absolute index**of that particular element:

`int i = blockIdx.x*blockDim.x + threadIdx.x`

Try coming up to this result on your own, its pretty easy. In case you do not follow, please request for an edit.

So once you have the absolute index of this array element, simply compute the value in the output array for the same absolute index. The parallelism in this example comes because no element depends on any other element in the same array, so you can perform all the multiplications and additions parallel to each other (independent of each other). You need to understand that parallelism is useful when you’ve got **HUGE amount of computation** to be done on a multi-million number of elements. There is **enormous increase in the speed-up** as we increase the number of elements in those vectors (arrays).

Just to show you guys the speed-up levels achieved by parallel computation…. This is application of CUDA technology in the field of Computational Fluid Dynamics (I bet you’re :O by the level of speed-up)

Here is a link to the ‘Intro to parallel programming’ course at Udacity. Happy coding!

My next post would be on the basics of GPU architecture…. Until then!