This is the final post to the 2-D Transient Heat Conduction series. In the previous posts (part 1 & part 2) I have demonstrated the mathematical model for this problem. We even saw a way of numerically expressing the Laplace Equation and solving it by the Gauss-Seidel iterative algorithm. After you have understood the serial code written in C, you will probably have very little problem understanding its parallel version. This was run on a Nvidia GT200 and code written in CUDA.

A brief introduction to the CUDA environment can be found in my previous posts in this category.

In CUDA, we could think of the entire grid made of sub-grids (blocks). These blocks are two dimensional blocks, square blocks in our case, but need not be square necessarily. Every node in our grid is represented by a unique thread, and a number of threads make up a block of threads (depending on the block dimension). We can even try out a rectangular block of threads just to compare computation times. So the grid would look like:

The blue squares indicate the two dimensional blocks of our grid. Each element in a particular block will have a thread index in both the directions from 0 to (n-1), where n = blockDim.x or blockDim.y, i.e. dimension of that block in the respective direction.

We explicitly define the level of parallelism we wish. This is done by defining the dimensions of grid and block. Later on, we send in the temperature array to the GPU kernel, and get back an updated array. It has to be noted that the updated array Tout has to be swapped with Tin, only then will the iterative algorithm calculate right results. This can be achieved by the following block of code:


//compute grid and block dimensions
grid_dim = dim3(ni/NI_TILE + 1, nj/NJ_TILE + 1, 1);
block_dim = dim3(NI_TILE, NJ_TILE, 1);

// launch kernel
step_kernel_gpu<<<grid_dim, block_dim>>>(ni, nj, tfac, temp1_d, temp2_d);

// swap the temperature pointers
temp_tmp = temp1_d;
temp1_d = temp2_d;
temp2_d = temp_tmp;



Note that we already define NI_TILE and NJ_TILE before this part. I tested (64,8), (32,32) and some more combinations. You can change these values and compare changes in the computation times. The temp1_d and temp2_d mean that these variables have been allotted memory in the device. tfac is the value of $\alpha_{factor}$.

Once we get this part, the next part is understanding the GPU kernel. The simplest method would be to read as well as write to Global memory of the GPU. Using its shared memory, we do enhance our timings, but for now let’s just have a look at the most basic method, i.e. Global memory:


// kernel to update temperatures - GPU version (not using shared mem)
__global__ void step_kernel_gpu(int ni, int nj, float tfac, float *temp_in, float *temp_out)
{
int i, j, ti, tj, i00, im10, ip10, i0m1, i0p1;
float d2tdx2, d2tdy2;

// find i and j indices of this thread
i = blockIdx.x*(NI_TILE) + ti;
j = blockIdx.y*(NJ_TILE) + tj;

// find indices into linear memory
i00 = I2D(ni, i, j);
im10 = I2D(ni, i-1, j);
ip10 = I2D(ni, i+1, j);
i0m1 = I2D(ni, i, j-1);
i0p1 = I2D(ni, i, j+1);

// check that thread is within domain (not on boundary or outside domain)
if (i > 0 && i < ni-1 && j > 0 && j < nj-1) {
// evaluate derivatives
d2tdx2 = temp_in[im10] - 2*temp_in[i00] + temp_in[ip10];
d2tdy2 = temp_in[i0m1] - 2*temp_in[i00] + temp_in[i0p1];

// update temperature
temp_out[i00] = temp_in[i00] + tfac*(d2tdx2 + d2tdy2);

}
}



Note that the terminologies used for variable naming might not be completely understood.

1. i00 :- (i, j)
2. im10 :- (i-1, 0)
3. ip10 :- (i+1, 0)
4. i0m1 :- (i, j-1)
5. i0p1 :- (i, j+1)
6. I2D(n, x, y) = y*n + x

This kernel was called time_steps number of times from within a loop. The results obtained from the GPU kernel exactly matched those obtained from the CPU kernel. An update will be made to this post regarding the speed-ups with respect to grid dimensions and time steps.