Skip to content

CUDA for beginners

September 17, 2013

Note: This is the follow-up post for this post (second in the GPGPU series).

If you’ve ever had a glance at any assembly language, you might know how complicated the concept of parallelism is. You need to overlap multiple instructions, so that more computing is done in the same amount of time. But you also need to make sure that instruction-2 does not require data updated by instruction-1 before instruction-1 actually updates it. So this forces you to incorporate lags or stalls in the instructions.

CUDA lets you implement your parallel algorithms at ease (trust me when I say “at ease”). It handles the thread distribution to different cores in the GPU on its own. The user is required to do the following tasks:
1. Coordinate the scheduling of computation on the CPU & GPU.
2. Memory allocation for variables (both CPU as well as GPU).
3. Transfer of data between the two memories (system memory and global memory).

Programming a code to be executed on multiple number of devices was kinda tricky as well as not-so-user-friendly. Here, Nvidia decided to develop a C-like language and programming environment that would improve the productivity of GPU programmers (earlier ways proved tough for entry-level programmers). They named it CUDA which stands for Compute Unified Device Architecture. In CUDA terminology, we have the system processor (host) and a C/C++ dialect for the GPU (device). OpenCL (Open Computing Language) is a similar platform that is vendor-independent and can be used for a variety of GPUs (Nvidia as well as AMD).

Each parallel task is done by something called a CUDA Thread. Various styles of parallelism can be utilized by these CUDA Threads: multithreading, MIMD (Mulitiple instruction, multiple data), SIMD (Single instruction, multiple data) as well as instruction level parallelism. Due to the popularity of the term Thread in CUDA, Nvidia classified this model as SIMT (Single instruction, multiple thread).

So a number of threads come together to form a block of threads, a number of blocks form a grid of blocks; and your task is divided into these number of blocks that do computations in their respective zones by multiple threads in that particular block. These threads then send back their results to Global memory, where all the blocks come back together to give you the overall solution.

You might not be able to understand this completely, so I’d like to demonstrate this with a simple example (the most common when it comes to GPGPU).

Question: Lets say you’ve got 2 arrays A[] and B[] both containing n elements. You want to form an output array C[i], where C[i] = k*A[i] + B[i] ….. ( k —> constant ).

Solution-1: When you think of such a situation, you have a pretty easy and basic C-code:

calculate(n, 2, A, B);
void calculate(int n, int k, int *A, int *B){
    for(int i=0; i<n; i++)
        C[i] = k*A[i] + B[i];
}

Solution-2: Now comes the time to showcase some really cool and awesome block of code. Have a look at it, try to understand, anyway I provide the explanation below it.

__host__
int nBlocks = (n + 255)/256;
calculate<<<nblocks, 256>>>(n, 2, A, B);
__device__
void calculate(int n, int k, int *A, int *B){
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    if(i<n)
        C[i] = k*A[i] + B[i];
}

Now the explanation. Modern GPUs can handle a maximum of 512 CUDA Threads in a block of threads. But its better go for 256 threads in a block for now (what if your GPU doesn’t support 512 threads! Be careful here). Now we suppose we have only 1 element in our array (n = 1), we need only 1 block containing 1 thread. When n = 255 as well, we will need nblocks = 1. When n = 256, nblocks = 1; but for n = 257, nblocks = 2. So taking care of these results, we declare the following relation between n and nblocks.
nblocks = (n + 255)/256

Calling a function in CUDA *function is known as kernel in CUDA* is quite similar to the calling in C. It takes in 2 additional arguments along with the array pointers, k and n. These are, namely, the number of blocks (nblocks) and number of threads in each block (256 in this case). The regular syntax for passing them is as given above.

Now comes the ___device___ part (The GPU kernel). As I said earlier, CUDA handles the calling of the kernel multiple number of times to all the different cores. Whenever the kernel is run on a particular core, the kernel carries along with it the index of the thread within its block in the x-dimension (threadIdx.x), the index of that particular block (blockIdx.x) (As I mentioned above, we only consider a 1-Dimensional array for the time being) and also the size or dimension of the block (blockDim.x). Using these 3 indices, we calculate the absolute index of that particular element:
int i = blockIdx.x*blockDim.x + threadIdx.x

Try coming up to this result on your own, its pretty easy. In case you do not follow, please request for an edit.

So once you have the absolute index of this array element, simply compute the value in the output array for the same absolute index. The parallelism in this example comes because no element depends on any other element in the same array, so you can perform all the multiplications and additions parallel to each other (independent of each other). You need to understand that parallelism is useful when you’ve got HUGE amount of computation to be done on a multi-million number of elements. There is enormous increase in the speed-up as we increase the number of elements in those vectors (arrays).

Just to show you guys the speed-up levels achieved by parallel computation…. This is application of CUDA technology in the field of Computational Fluid Dynamics (I bet you’re :O by the level of speed-up)

Here is a link to the ‘Intro to parallel programming’ course at Udacity. Happy coding!

My next post would be on the basics of GPU architecture…. Until then!

Advertisements

From → GPU & GPGPU

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: