Skip to content

Atlanta weekends…

When you’re in Atlanta for 1-2 months, there are some things you just can’t miss out on. Its been a month now that I am here, and I have been to some good places. Popular tourist spots in the city are Georgia Aquarium, the World of Coca Cola and the CNN center.

So a little general knowledge for you readers. Coca Cola drink was first invented by John Pemberton some time in the 1980s and was later on bought by businessman Griggs Candler. World of Coca Cola is a place they built as an exhibition for tourists which features the history of Coca-Cola, their production and manufacturing technology.

Entrance to the World of Coke

Entrance to the World of Coke

But most importantly, time and time again they try emphasizing on their SECRET formula (chemical composition of Coca-Cola). They have made a smart choice by building a secret, high-secure vault, where they apparently have stored their chemical formula. Seriously! Not even teenagers would fall for that. But just because you pay around 20 dollars to get in, you fall prey to their strategy. Credits to the Coca-Cola team, because they do entertain you throughout your journey and you probably wouldn’t be bored.

The secret vault

The secret vault

Look at me! Moments after mocking their lame attempt to excite the audience, here I am smiling at the camera!

Next, they give you a chance to meet the Coca-Cola Polar Bear! Ahh, that experience was damn cool! I still don’t know the name of the bear, but wait. You get to capture yourself with him, you get to tickle him, hug him and may be play around a bit if you’re a kid.

With the Coca-Cola polar bear

Group click!

Yes he's damn tall





Now when you’re done with all this, comes the most exciting part of the tour. Coca-Cola actually lets you taste all, ALL the drinks they have produced or bought till date. There are 63 drinks that come from different parts of the world, Thumbs up is the drink from India. The cranberry flavored soda, some lime flavored soda from Italy I think, and a few others were great! My friends and me tasted all the 63 and the only side effect was…. burps, burps and burps. More burps, random flavored burps!

Oh yes and they show you a 4D movie clip, where a chemist working in his laboratory goes through a bumpy ride to finally discover coke’s unique and secret formula, which is in fact… Uniformity, Universal availability and most importantly, you (U). So they end it by saying that coke is coke because of us, the people in this world. Yes its true 😐 Nice trick to hold on to the audience for an hour. Its definitely worth a visit if you happen to be in Atlanta, but I would be quite lenient if I gave it a 4 on 5. I guess 3 on 5 would be decent.

Next up would be the Georgia Aquarium, I hope it turns out to be better than the World of Coke.

Fortran O my Fortran

Ever wondered why research scholars, PhD students and nowadays interns start hating Fortran?

I experienced my share today. And it was amazing. I was in mixed emotions. So fortran comes in to the picture because the Fast Multipole Method (FMM) library is a Fortran library. You can find the FMM in the last century’s top 10 algorithms here. In our N-body simulation problem, we aim to replicate a biological system consisting of a cell membrane and a number of molecules/particles moving within it. The problem has to be scaled to a large number of particles, about a million!! That is how it would be close to a real system.

So I have a C code and my goal is to check whether the Fast Multipole approximation works for this case. How to call a fortran subroutine from a C code?

From what I knew, I needed an object file (.o extension) and linking the correct object file to the fortran code would let me use it. Let me begin by explaining how you call a fortran subroutine from a C code. Step 1 is to create a header file that declares the subroutine you wish to call and what kind of variables you would need.

Fortran reminder number 1 – You must pass 1-dimensional arrays. If you ever happen to pass a 2-dimensional arrays, don’t be scared by the junk values it outputs, because C may not allocate the pointers in contiguous memory. The fortran compiler isn’t smart enough and it fails to locate them. So the simplest and an important note! Avoid 2-dimensional arrays, make use of 1-dimensional arrays. Get used to indexing. But remember!

Fortran reminder number 2 – Fortran interprets 2-dimensional arrays (matrices) by the column major rule. So it’d be better to know this before you start coding in C and using a fortran library.

You create a header file and include it in your C code.

extern void your_fortran_subroutine_(argument list);

and save it as “your_header_file.h”.

Include this .h file in your C code. While compiling your code, just link your object file for the fortran code. Something similar to this:

$ gcc /your_object_file_directory/object_file.o -lgfortran 

But just the way I came to know, its always better to create a library out of all your object files, so that you can link the entire library at once.

$ ar -cvq libctest.a ctest1.o ctest2.o

Be sure that you add an underscore character in your fortran subroutine, this is the standard way of calling fortran from C.

But even if the entire code looks flawless, there may be an error popping out which says “can’t find ‘your_fortran_subroutine_()'”. You may wonder if the compiler isn’t able to locate the fortran object file, or you are missing out on including some necessary files. Finally, after an hour of intense debugging, I came to know that fortran is CASE-INSENSITIVE. Yes people, when fortran was made, it was all in Caps. But slowly, people preferred writing in lower-case and so in order to incorporate this feature without making undesirable tweaks, fortran was made case-insensitive.

And another point to be noted… Fortran interprets matrices by COLUMN MAJOR rule and NOT row major. So what you pass to a fortran subroutine is the transpose of the regular matrix.

Remember guys…. a friend in need, will never ever be fortran indeed 🙂

But as we know, some of the most useful libraries, for example LAPACK, BLAS (linear algebra libraries) are Fortran libraries.

Research Work Begins!

Blogging after a lot of time guys!

I’ve been here in the United States for a fortnight now, working on a research project at the Georgia Institute of Technology, College of Computing. Ever wondered how a Mechanical Engineering undergrad could work in the Klaus Advanced Computing Building? Computational Fluid Dynamics, and a little enthusiasm in computing, programming, etc, etc.


The Klaus Advanced Computing Building

Its an amazing architectural creation, funded majorly by Christopher Klaus, an alumni of the Georgia Tech. This building has a cross-over bridge, called the Binary Bridge, because it has the words “Thank you Klaus” printed on it (in the binary language of course :D).


The Binary Bridge, Klaus Advanced Computing Building


“Thank you Klaus” in binary

I am currently working under professor Edmond Chow and yes it has been an awesome experience. I never imagined myself developing interest in research work, but I really enjoy the 8-10 hours I spend everyday in our lab. There are other students from India as well, working under other faculty and on various topics from High Performance Computing to Machine Learning to Bioinformatics.

What project am I currently working on? Its basically particle simulation methods. I have ‘n’ particles within a boundary (think of it as a large number of molecules inside a cell membrane). They interact with each other, move within the boundary, exert force on other particles, experience forces from other particles, etc. How does this movement exactly look like! And what are the laws that govern their interaction and movement! We are currently working in an area where Brownian Dynamics largely governs the particle behavior. It takes in to account force interactions between each particle pair, and also a Brownian motion for each of the particles in consideration. You could go through the introduction given here for related fundamentals.

Atlanta is a really nice place to live in. It isn’t noisy, there’s hardly any traffic on roads, the natives are quite helpful and of course, there’s every popular junk food outlet for foodies like me.

There’ll definitely be further updates on the project and of course, the USA!

2-D Transient Heat Conduction – Part 4

This is the final part of this problem. Previous posts can be found here.

This very post mainly focuses on the validation of this problem, whether we can be sure that the computation was mathematically correct. We compare our results with results obtained by solving the same problem in ANSYS FLUENT. I will demonstrate results of two sets of boundary conditions:

  1. Tab = 200 , Tbc = 400 , Tcd = 200 , Tda = 200.
  2. Ta = 200 , Tb = 300 , Tc = 300 , Td = 200 (edges supposed to have linear temperature distribution).

1. We solved this problem with initial internal nodal temperature of 200. The CUDA code returned temperature values, and the image below shows its contour plot.


The temperature contour obtained from ANSYS FLUENT is shown below.


As we see, the FLUENT plot includes more number of color shades in the contour. We therefore compare the center line temperatures of both these contours, and obtain the following curve.


2. For the second set of boundary conditions, we plot the same data and in a similar fashion, we obtain the below shown results.




The code computes results correctly, we can conclude.

2-D Transient Heat Conduction CUDA – Part 3

This is the final post to the 2-D Transient Heat Conduction series. In the previous posts (part 1 & part 2) I have demonstrated the mathematical model for this problem. We even saw a way of numerically expressing the Laplace Equation and solving it by the Gauss-Seidel iterative algorithm. After you have understood the serial code written in C, you will probably have very little problem understanding its parallel version. This was run on a Nvidia GT200 and code written in CUDA.

A brief introduction to the CUDA environment can be found in my previous posts in this category.

In CUDA, we could think of the entire grid made of sub-grids (blocks). These blocks are two dimensional blocks, square blocks in our case, but need not be square necessarily. Every node in our grid is represented by a unique thread, and a number of threads make up a block of threads (depending on the block dimension). We can even try out a rectangular block of threads just to compare computation times. So the grid would look like:


The blue squares indicate the two dimensional blocks of our grid. Each element in a particular block will have a thread index in both the directions from 0 to (n-1), where n = blockDim.x or blockDim.y, i.e. dimension of that block in the respective direction.

We explicitly define the level of parallelism we wish. This is done by defining the dimensions of grid and block. Later on, we send in the temperature array to the GPU kernel, and get back an updated array. It has to be noted that the updated array Tout has to be swapped with Tin, only then will the iterative algorithm calculate right results. This can be achieved by the following block of code:

//compute grid and block dimensions
grid_dim = dim3(ni/NI_TILE + 1, nj/NJ_TILE + 1, 1);
block_dim = dim3(NI_TILE, NJ_TILE, 1);

// launch kernel
step_kernel_gpu<<<grid_dim, block_dim>>>(ni, nj, tfac, temp1_d, temp2_d);

// swap the temperature pointers
temp_tmp = temp1_d;
temp1_d = temp2_d;
temp2_d = temp_tmp;

Note that we already define NI_TILE and NJ_TILE before this part. I tested (64,8), (32,32) and some more combinations. You can change these values and compare changes in the computation times. The temp1_d and temp2_d mean that these variables have been allotted memory in the device. tfac is the value of \alpha_{factor}.

Once we get this part, the next part is understanding the GPU kernel. The simplest method would be to read as well as write to Global memory of the GPU. Using its shared memory, we do enhance our timings, but for now let’s just have a look at the most basic method, i.e. Global memory:

// kernel to update temperatures - GPU version (not using shared mem)
__global__ void step_kernel_gpu(int ni, int nj, float tfac, float *temp_in, float *temp_out)
    int i, j, ti, tj, i00, im10, ip10, i0m1, i0p1;
    float d2tdx2, d2tdy2;

    // find i and j indices of this thread
    ti = threadIdx.x;
    tj = threadIdx.y;
    i = blockIdx.x*(NI_TILE) + ti;
    j = blockIdx.y*(NJ_TILE) + tj;

    // find indices into linear memory
    i00 = I2D(ni, i, j);
    im10 = I2D(ni, i-1, j);
    ip10 = I2D(ni, i+1, j);
    i0m1 = I2D(ni, i, j-1);
    i0p1 = I2D(ni, i, j+1);

    // check that thread is within domain (not on boundary or outside domain)
    if (i > 0 && i < ni-1 && j > 0 && j < nj-1) {
            // evaluate derivatives
            d2tdx2 = temp_in[im10] - 2*temp_in[i00] + temp_in[ip10];
            d2tdy2 = temp_in[i0m1] - 2*temp_in[i00] + temp_in[i0p1];

            // update temperature
            temp_out[i00] = temp_in[i00] + tfac*(d2tdx2 + d2tdy2);


Note that the terminologies used for variable naming might not be completely understood.

1. i00 :- (i, j)
2. im10 :- (i-1, 0)
3. ip10 :- (i+1, 0)
4. i0m1 :- (i, j-1)
5. i0p1 :- (i, j+1)
6. I2D(n, x, y) = y*n + x

This kernel was called time_steps number of times from within a loop. The results obtained from the GPU kernel exactly matched those obtained from the CPU kernel. An update will be made to this post regarding the speed-ups with respect to grid dimensions and time steps.

2-D Transient Heat Conduction – Part 2

In this previous post, I have demonstrated the mathematical concept behind a Gauss-Seidel iterative solution to the Laplace Equation, i.e. the equation below:

\frac{\partial T}{\partial t} = \alpha\nabla^2T.

For a numerical computation, this differential equation has been discretized by the Finite Difference Scheme to obtain an approximate representation:

\frac{\Delta T}{\Delta t} = \alpha(\frac{T_{i+1,j} - 2T_{i,j} + T_{i-1,j}}{\Delta x^2} + \frac{T_{i,j+1} - 2T_{i,j} + T_{i,j-1}}{\Delta y^2}).

Now coming to HOW we obtain a value of   \alpha_{factor}. = 0.2.

Consider Aluminium as your 2-D sheet material. As we know, \alpha_{factor} = \frac{\alpha.\Delta t}{\Delta x^2}. For Aluminium, the Thermal Diffusivity is roughly 84.18 mm2/s. Next, we chose total number of time steps as 30,000. So taking one time step as 0.01 seconds, we are allowing 300 seconds to reach steady state (sounds comfortable). Our 2-D square metal sheet has 512 grid nodes in each direction. Since  \Delta x = \Delta y, we know that 511\times\Delta x = l (where l is the side length of the square). Since we chose   \alpha_{factor} = 0.2 , we can now calculate that   \Delta x = 2.052 mm, i.e. roughly 2 mm.

That means we are approximating a grid length of 511\times 2 = 1022 mm , which seems perfectly fine (a grid size of 1×1 m2).

2-D Transient Heat Conduction – Part 1

This is an update to my project work. The previous post can be found here, that provides some speed-up stats in matrix multiplication using CUDA.

The next problem to tackle before targeting Fluid Flow problems would be related to heat transfer. As we know, heat conduction is governed by a single partial differential equation:

\frac{\partial T}{\partial t} = \alpha\nabla^2T.

Where, T is Temperature; t is Time; \alpha is Thermal Diffusivity

And for a 2-Dimensional plate, this equation would look like:

\frac{\partial T}{\partial t} = \alpha(\frac{\partial^2 T}{\partial^2 x} + \frac{\partial^2 T}{\partial^2 y}).

In order to solve this equation numerically, you need a grid. I chose a square plate and divided it into a grid consisting of 512 or 1024 nodes. It would look something similar to this:


And since we intend to solve the above PDE numerically, we must find out a way to represent it by a Finite Difference Scheme. A possible approximation could be:

\frac{\Delta T}{\Delta t} = \alpha(\frac{T_{i+1,j} - 2T_{i,j} + T_{i-1,j}}{\Delta x^2} + \frac{T_{i,j+1} - 2T_{i,j} + T_{i,j-1}}{\Delta y^2}).

Since we choose a square plate and a square grid, we divide the x and y directions in the same number of small divisions, i.e. \Delta x = \Delta y.

In order to convert some of the parameters to non-dimensional form, we rewrite the equation in the form:

\Delta T = \frac{\alpha.\Delta t}{\Delta x^2}(T_{i+1,j} - 2T_{i,j} + T_{i-1,j} + T_{i,j+1} - 2T_{i,j} + T_{i,j-1}).

and we denote the expression \frac{\alpha.\Delta t}{\Delta x^2}   as   \alpha_{factor}.

I looked up in the bible for Heat Transfer by Prof. Frank Incropera and found a numerical example as a reference to my problem. According to the literature, a decent assumption would be:   \alpha_{factor} < 0.25.

So my inputs to the problem were:

  1. 512×512 square grid.
  2. Initialized temperature for the 4 corners of the plate, i.e. indirectly setting up the boundary edge temperatures.
  3. Time steps till approximate steady-state arrival t_{steps} = 30000.
  4. \alpha_{factor} = 0.2.

We can now iterate over each grid node, updating it in each time step and running this operation over a period of time defined by 30000 time steps.

This can be done easily by a C code:

void cpu_compute(int ni, int nj, float alpha, float *temp_in, float *temp_out){

    int i, j, i00, im10, ip10, i0m1, i0p1;
    float d2tdx2, dt2dy2;

    for(j=1; j<nj-1; j++){
        for(i=1; i<ni-1; i++){
            i00 = I2D(ni, i, j);
            im10 = I2D(ni, i-1, j);
            ip10 = I2D(ni, i+1, j);
            i0m1 = I2D(ni, i, j-1);
            i0p1 = I2D(ni, i, j+1);

            d2tdx2 = temp_in[im10] - 2*temp_in[i00] + temp_in[ip10];
            d2tdy2 = temp_in[i0m1] - 2*temp_in[i00] + temp_in[i0p1];

            temp_out[i00] = temp_in[i00] + alpha*(d2tdx2 + d2tdy2);

We use the Gauss-Seidel iterative method here for updating the node values. After the function is run, we swap the input and output temperature arrays. This continues for the initialized number of time steps.

The GPU version of this can be found here in another post. I have primarily explained the functioning of the GPU kernel.

The initial temperatures at the 4 corners A, B, C and D were taken as 200, 200, 300 and 300 K respectively. Since the temperatures are only added or subtracted, they need not compulsorily be in Kelvin scale, we may even use the Celsius scale. After the corner temperatures are set, we calculate the border or edge temperatures. We could argue that the steady state temperature at the edge AB would be constant at 200 K. Similarly edge CD would have a constant temperature of 300 K. Now, edge AD and BC could be approximated to have temperatures varying linearly with distance (from T = 200 K to T = 300 K). The Gauss-Seidel iterative algorithm is used to find the temperature of the inner grid nodes, i.e. the 510×510 center square.

Satisfactory results were obtained after 30000 iterations. We may export the obtained data as a text file or write it to a data file (.dat). I used MATLAB to plot 2-D and 3-D contours for these outputs. The following contours were obtained.


A detailed explanation for the dimensional analysis can be found here. I am currently trying to simulate the same problem on ANSYS FLUENT and compare results. Any new developments would be added here.

CUDA – Matrix Multiplication

First update on my Bachelor’s Project work.

In order to demonstrate the computing power of a GPU, I performed Matrix Multiplication on a CPU and a GPU. As expected, the GPU beat the CPU by a satisfactory ratio (given that this GPU belongs to one of the older generations). I used the Nvidia GeForce 210 for my computation. Given that it only has 16 CUDA cores (2 Multiprocessors and 8 Cores per MP), I did not expect a huge speed-up ratio.

My code allowed non-square matrix multiplications, but I did not compare their results. Instead, I only compared results for square matrices. The following results were obtained:



I avoided going beyond 1000 elements, although its not really a big issue for a processor. I can provide results for a larger order of dimension on demand. So as you can see, a GPU is more efficient than a CPU IF AND ONLY IF your task is computationally expensive. For 3×3 Matrices or even up to 25×25 Matrices a GPU kernel actually ran slower than a CPU function, meaning the task wasn’t really compute oriented and you never really needed parallelism. A plot of the computation times is shown below:


As you cross the 25×25 barrier and go towards a 50×50 dimension, you can see that the GPU makes the most of its parallel architecture. And thereon, the GPU starts dominating. The speed-up ratio keeps increasing at a higher and higher pace – which clearly shows that as the task becomes computationally more expensive, the GPU outperforms a CPU with a bigger margin. A plot of Speed-up ratio versus dimension is shown below:



In case we go beyond the 1000 elements mark, we will see an even greater speed up factor. I am currently working on the Cholesky Decomposition Algorithm for Matrix Inversion. Apparently, a modern GPU can compute large dimensional matrix inversion faster than the inbuilt MATLAB inversion based on Cholesky Algorithm. Lets see.

Note: Code will be provided, only if contacted.

Scrapy – Web Crawling with a Proxy Network

I have been using Scrapy for a couple of weeks now. It wasn’t giving me any sort of errors. The day I changed my system proxy, it showed up an error, something like this:

proxy error

So when some error like this shoots up, you know its because of a manual proxy setting.
Scrapy provides a simple solution to this:

  • A new python script to be added.
  • Editing to be done in the

1. Go into your project directory (lets say /home/you/Documents/Project/sample).

2. Create a file and add the following code:

# Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication
import base64

# Start your middleware class
class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] = "http://YOUR_PROXY:PORT"

        # Use the following lines if your proxy requires authentication
        proxy_user_pass = "USERNAME:PASSWORD"
        # setup basic authentication for the proxy
        encoded_user_pass = base64.encodestring(proxy_user_pass)
        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

3. Add the following lines in your script:

    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
    'sample.middlewares.ProxyMiddleware': 100,

and try crawling with the same spider again. You’ll find that it works now 😀 .

Note:- You must make changes to the above mentioned code depending on your Project name and Proxy.

Make Subtitles (.srt file) in text editor

Its late night, was just trying out silly stuff, and I don’t know what made me do this..
but I spent 15 odd minutes into this!

Yes there are some softwares with very attractive GUIs and most importantly, that are user-friendly. But the most primitive way to create a subtitle file (.srt file) is quite easy (although boring). Please DO NOT fall a prey to this post and go on to create subtitles for a 90-mins movie… No please STAY SAFE, DO NOT TRY THIS AT HOME 😛

Firstly, there’s a chance you might see a similar post elsewhere, but nobaady’s gonna tell you this:
This thing doesn’t seem to work for avi files :/

So just record a video and say whatever you want to. Now make sure you’ve saved the video as an mp4 !!! Once you have the video, simply open up any text editor (I used gedit). Mentioned below is the standard syntax for an srt file:

start_time --> end_time

The format of start_time & end_time is:


You could download my video (mp4) and srt file, play around with the srt file to try out how it works. Its child’s stuff. My video can be downloaded from YouTube, I’ve also shared it below:

My srt file looks something like this (be careful with the ‘periods’ and ‘commas’):

00:00:00,000 --> 00:00:02,000

00:00:02,200 --> 00:00:04,000
My name's Rohit...

00:00:04,800 --> 00:00:06,800
Yeah this is really silly,

00:00:07,000 --> 00:00:08,300

00:00:08,500 --> 00:00:11,000
Yeah I just wanted to try this out.