When you’re in Atlanta for 1-2 months, there are some things you just can’t miss out on. Its been a month now that I am here, and I have been to some good places. Popular tourist spots in the city are Georgia Aquarium, the World of Coca Cola and the CNN center.

So a little general knowledge for you readers. Coca Cola drink was first invented by John Pemberton some time in the 1980s and was later on bought by businessman Griggs Candler. World of Coca Cola is a place they built as an exhibition for tourists which features the history of Coca-Cola, their production and manufacturing technology.

But most importantly, time and time again they try emphasizing on their SECRET formula (chemical composition of Coca-Cola). They have made a smart choice by building a secret, high-secure vault, where they apparently have stored their chemical formula. Seriously! Not even teenagers would fall for that. But just because you pay around 20 dollars to get in, you fall prey to their strategy. Credits to the Coca-Cola team, because they do entertain you throughout your journey and you probably wouldn’t be bored.

Look at me! Moments after mocking their lame attempt to excite the audience, here I am smiling at the camera!

Next, they give you a chance to meet the Coca-Cola Polar Bear! Ahh, that experience was damn cool! I still don’t know the name of the bear, but wait. You get to capture yourself with him, you get to tickle him, hug him and may be play around a bit if you’re a kid.

Now when you’re done with all this, comes the most exciting part of the tour. Coca-Cola actually lets you taste all, ALL the drinks they have produced or bought till date. There are 63 drinks that come from different parts of the world, Thumbs up is the drink from India. The cranberry flavored soda, some lime flavored soda from Italy I think, and a few others were great! My friends and me tasted all the 63 and the only side effect was…. burps, burps and burps. More burps, random flavored burps!

Oh yes and they show you a 4D movie clip, where a chemist working in his laboratory goes through a bumpy ride to finally discover coke’s unique and secret formula, which is in fact… Uniformity, Universal availability and most importantly, you (U). So they end it by saying that coke is coke because of us, the people in this world. Yes its true 😐 Nice trick to hold on to the audience for an hour. Its definitely worth a visit if you happen to be in Atlanta, but I would be quite lenient if I gave it a 4 on 5. I guess 3 on 5 would be decent.

Next up would be the Georgia Aquarium, I hope it turns out to be better than the World of Coke.

Ever wondered why research scholars, PhD students and nowadays interns start hating Fortran?

I experienced my share today. And it was amazing. I was in mixed emotions. So fortran comes in to the picture because the **Fast Multipole Method** (FMM) library is a Fortran library. You can find the FMM in the last century’s top 10 algorithms here. In our N-body simulation problem, we aim to replicate a biological system consisting of a cell membrane and a number of molecules/particles moving within it. The problem has to be scaled to a large number of particles, about a million!! That is how it would be close to a real system.

So I have a C code and my goal is to check whether the Fast Multipole approximation works for this case. How to call a fortran subroutine from a C code?

From what I knew, I needed an object file (.o extension) and linking the correct object file to the fortran code would let me use it. Let me begin by explaining how you call a fortran subroutine from a C code. Step 1 is to create a header file that declares the subroutine you wish to call and what kind of variables you would need.

Fortran reminder number 1 – You must pass 1-dimensional arrays. If you ever happen to pass a 2-dimensional arrays, don’t be scared by the junk values it outputs, because C may not allocate the pointers in contiguous memory. The fortran compiler isn’t smart enough and it fails to locate them. So the simplest and an important note! Avoid 2-dimensional arrays, make use of 1-dimensional arrays. Get used to indexing. But remember!

Fortran reminder number 2 – Fortran interprets 2-dimensional arrays (matrices) by the column major rule. So it’d be better to know this before you start coding in C and using a fortran library.

You create a header file and include it in your C code.

extern void your_fortran_subroutine_(argument list);

and save it as “your_header_file.h”.

Include this .h file in your C code. While compiling your code, just link your object file for the fortran code. Something similar to this:

$ gcc /your_object_file_directory/object_file.o -lgfortran

But just the way I came to know, its always better to create a library out of all your object files, so that you can link the entire library at once.

$ ar -cvq libctest.a ctest1.o ctest2.o

Be sure that you add an underscore character in your fortran subroutine, this is the standard way of calling fortran from C.

But even if the entire code looks flawless, there may be an error popping out which says “can’t find ‘your_fortran_subroutine_()'”. You may wonder if the compiler isn’t able to locate the fortran object file, or you are missing out on including some necessary files. Finally, after an hour of intense debugging, I came to know that fortran is **CASE-INSENSITIVE**. Yes people, when fortran was made, it was all in Caps. But slowly, people preferred writing in lower-case and so in order to incorporate this feature without making undesirable tweaks, fortran was made case-insensitive.

And another point to be noted… Fortran interprets matrices by **COLUMN MAJOR** rule and NOT row major. So what you pass to a fortran subroutine is the transpose of the regular matrix.

Remember guys…. a friend in need, will never ever be fortran indeed 🙂

But as we know, some of the most useful libraries, for example LAPACK, BLAS (linear algebra libraries) are Fortran libraries.

Blogging after a lot of time guys!

I’ve been here in the United States for a fortnight now, working on a research project at the Georgia Institute of Technology, College of Computing. Ever wondered how a Mechanical Engineering undergrad could work in the Klaus Advanced Computing Building? Computational Fluid Dynamics, and a little enthusiasm in computing, programming, etc, etc.

Its an amazing architectural creation, funded majorly by Christopher Klaus, an alumni of the Georgia Tech. This building has a cross-over bridge, called the Binary Bridge, because it has the words “Thank you Klaus” printed on it (in the binary language of course :D).

I am currently working under professor Edmond Chow and yes it has been an awesome experience. I never imagined myself developing interest in research work, but I really enjoy the 8-10 hours I spend everyday in our lab. There are other students from India as well, working under other faculty and on various topics from High Performance Computing to Machine Learning to Bioinformatics.

What project am I currently working on? Its basically particle simulation methods. I have ‘n’ particles within a boundary (think of it as a large number of molecules inside a cell membrane). They interact with each other, move within the boundary, exert force on other particles, experience forces from other particles, etc. How does this movement exactly look like! And what are the laws that govern their interaction and movement! We are currently working in an area where Brownian Dynamics largely governs the particle behavior. It takes in to account force interactions between each particle pair, and also a Brownian motion for each of the particles in consideration. You could go through the introduction given here for related fundamentals.

Atlanta is a really nice place to live in. It isn’t noisy, there’s hardly any traffic on roads, the natives are quite helpful and of course, there’s every popular junk food outlet for foodies like me.

There’ll definitely be further updates on the project and of course, the USA!

This is the final part of this problem. Previous posts can be found **here**.

This very post mainly focuses on the validation of this problem, whether we can be sure that the computation was mathematically correct. We compare our results with results obtained by solving the same problem in ANSYS FLUENT. I will demonstrate results of two sets of boundary conditions:

- T
_{ab}= 200 , T_{bc}= 400 , T_{cd}= 200 , T_{da}= 200. - T
_{a}= 200 , T_{b}= 300 , T_{c}= 300 , T_{d}= 200 (edges supposed to have linear temperature distribution).

1. We solved this problem with initial internal nodal temperature of 200. The CUDA code returned temperature values, and the image below shows its contour plot.

The temperature contour obtained from ANSYS FLUENT is shown below.

As we see, the FLUENT plot includes more number of color shades in the contour. We therefore compare the center line temperatures of both these contours, and obtain the following curve.

2. For the second set of boundary conditions, we plot the same data and in a similar fashion, we obtain the below shown results.

The code computes results correctly, we can conclude.

This is the final post to the 2-D Transient Heat Conduction series. In the previous posts (**part 1** & **part 2**) I have demonstrated the mathematical model for this problem. We even saw a way of numerically expressing the Laplace Equation and solving it by the Gauss-Seidel iterative algorithm. After you have understood the serial code written in C, you will probably have very little problem understanding its parallel version. This was run on a Nvidia GT200 and code written in CUDA.

A brief introduction to the CUDA environment can be found in my **previous posts **in this category.

In CUDA, we could think of the entire grid made of sub-grids (blocks). These blocks are two dimensional blocks, square blocks in our case, but need not be square necessarily. Every node in our grid is represented by a unique thread, and a number of threads make up a block of threads (depending on the block dimension). We can even try out a rectangular block of threads just to compare computation times. So the grid would look like:

The blue squares indicate the two dimensional blocks of our grid. Each element in a particular block will have a thread index in both the directions from 0 to (n-1), where n = blockDim.x or blockDim.y, i.e. dimension of that block in the respective direction.

We explicitly define the level of parallelism we wish. This is done by defining the dimensions of grid and block. Later on, we send in the temperature array to the GPU kernel, and get back an updated array. It has to be noted that the updated array Tout has to be swapped with Tin, only then will the iterative algorithm calculate right results. This can be achieved by the following block of code:

//compute grid and block dimensions grid_dim = dim3(ni/NI_TILE + 1, nj/NJ_TILE + 1, 1); block_dim = dim3(NI_TILE, NJ_TILE, 1); // launch kernel step_kernel_gpu<<<grid_dim, block_dim>>>(ni, nj, tfac, temp1_d, temp2_d); // swap the temperature pointers temp_tmp = temp1_d; temp1_d = temp2_d; temp2_d = temp_tmp;

Note that we already define *NI_TILE* and *NJ_TILE* before this part. I tested (64,8), (32,32) and some more combinations. You can change these values and compare changes in the computation times. The *temp1_d* and *temp2_d* mean that these variables have been allotted memory in the device. *tfac* is the value of .

Once we get this part, the next part is understanding the GPU kernel. The simplest method would be to read as well as write to Global memory of the GPU. Using its shared memory, we do enhance our timings, but for now let’s just have a look at the most basic method, i.e. Global memory:

// kernel to update temperatures - GPU version (not using shared mem) __global__ void step_kernel_gpu(int ni, int nj, float tfac, float *temp_in, float *temp_out) { int i, j, ti, tj, i00, im10, ip10, i0m1, i0p1; float d2tdx2, d2tdy2; // find i and j indices of this thread ti = threadIdx.x; tj = threadIdx.y; i = blockIdx.x*(NI_TILE) + ti; j = blockIdx.y*(NJ_TILE) + tj; // find indices into linear memory i00 = I2D(ni, i, j); im10 = I2D(ni, i-1, j); ip10 = I2D(ni, i+1, j); i0m1 = I2D(ni, i, j-1); i0p1 = I2D(ni, i, j+1); // check that thread is within domain (not on boundary or outside domain) if (i > 0 && i < ni-1 && j > 0 && j < nj-1) { // evaluate derivatives d2tdx2 = temp_in[im10] - 2*temp_in[i00] + temp_in[ip10]; d2tdy2 = temp_in[i0m1] - 2*temp_in[i00] + temp_in[i0p1]; // update temperature temp_out[i00] = temp_in[i00] + tfac*(d2tdx2 + d2tdy2); } }

Note that the terminologies used for variable naming might not be completely understood.

1. i00 :- (i, j)

2. im10 :- (i-1, 0)

3. ip10 :- (i+1, 0)

4. i0m1 :- (i, j-1)

5. i0p1 :- (i, j+1)

6. I2D(n, x, y) = y*n + x

This kernel was called *time_steps* number of times from within a loop. The results obtained from the GPU kernel exactly matched those obtained from the CPU kernel. An update will be made to this post regarding the speed-ups with respect to grid dimensions and time steps.

In **this** previous post, I have demonstrated the mathematical concept behind a Gauss-Seidel iterative solution to the Laplace Equation, i.e. the equation below:

For a numerical computation, this differential equation has been discretized by the Finite Difference Scheme to obtain an approximate representation:

Now coming to HOW we obtain a value of = 0.2.

Consider Aluminium as your 2-D sheet material. As we know, . For Aluminium, the Thermal Diffusivity is roughly 84.18 mm^{2}/s. Next, we chose total number of time steps as 30,000. So taking one time step as 0.01 seconds, we are allowing 300 seconds to reach steady state (sounds comfortable). Our 2-D square metal sheet has 512 grid nodes in each direction. Since , we know that (where is the side length of the square). Since we chose , we can now calculate that , i.e. roughly 2 mm.

That means we are approximating a grid length of , which seems perfectly fine (a grid size of 1×1 m^{2}).

This is an update to my project work. The previous post can be found **here**, that provides some speed-up stats in matrix multiplication using CUDA.

The next problem to tackle before targeting Fluid Flow problems would be related to heat transfer. As we know, heat conduction is governed by a single partial differential equation:

Where, **T** is Temperature; **t** is Time; is Thermal Diffusivity

And for a 2-Dimensional plate, this equation would look like:

In order to solve this equation numerically, you need a grid. I chose a square plate and divided it into a grid consisting of 512 or 1024 nodes. It would look something similar to this:

And since we intend to solve the above PDE numerically, we must find out a way to represent it by a Finite Difference Scheme. A possible approximation could be:

Since we choose a square plate and a square grid, we divide the and directions in the same number of small divisions, i.e.

In order to convert some of the parameters to non-dimensional form, we rewrite the equation in the form:

and we denote the expression as

I looked up in the **bible** for Heat Transfer by Prof. Frank Incropera and found a numerical example as a reference to my problem. According to the literature, a decent assumption would be:

So my inputs to the problem were:

- 512×512 square grid.
- Initialized temperature for the 4 corners of the plate, i.e. indirectly setting up the boundary edge temperatures.
- Time steps till approximate steady-state arrival

We can now iterate over each grid node, updating it in each time step and running this operation over a period of time defined by 30000 time steps.

This can be done easily by a C code:

void cpu_compute(int ni, int nj, float alpha, float *temp_in, float *temp_out){ int i, j, i00, im10, ip10, i0m1, i0p1; float d2tdx2, dt2dy2; for(j=1; j<nj-1; j++){ for(i=1; i<ni-1; i++){ i00 = I2D(ni, i, j); im10 = I2D(ni, i-1, j); ip10 = I2D(ni, i+1, j); i0m1 = I2D(ni, i, j-1); i0p1 = I2D(ni, i, j+1); d2tdx2 = temp_in[im10] - 2*temp_in[i00] + temp_in[ip10]; d2tdy2 = temp_in[i0m1] - 2*temp_in[i00] + temp_in[i0p1]; temp_out[i00] = temp_in[i00] + alpha*(d2tdx2 + d2tdy2); } } }

We use the Gauss-Seidel iterative method here for updating the node values. After the function is run, we swap the input and output temperature arrays. This continues for the initialized number of time steps.

The GPU version of this can be found **here** in another post. I have primarily explained the functioning of the GPU kernel.

The initial temperatures at the 4 corners A, B, C and D were taken as 200, 200, 300 and 300 K respectively. Since the temperatures are only added or subtracted, they need not compulsorily be in Kelvin scale, we may even use the Celsius scale. After the corner temperatures are set, we calculate the border or edge temperatures. We could argue that the steady state temperature at the edge AB would be constant at 200 K. Similarly edge CD would have a constant temperature of 300 K. Now, edge AD and BC could be approximated to have temperatures varying linearly with distance (from T = 200 K to T = 300 K). The Gauss-Seidel iterative algorithm is used to find the temperature of the inner grid nodes, i.e. the 510×510 center square.

Satisfactory results were obtained after 30000 iterations. We may export the obtained data as a text file or write it to a data file (.dat). I used MATLAB to plot 2-D and 3-D contours for these outputs. The following contours were obtained.

A detailed explanation for the dimensional analysis can be found **here**. I am currently trying to simulate the same problem on ANSYS FLUENT and compare results. Any new developments would be added here.