Skip to content

CUDA – Matrix Multiplication

First update on my Bachelor’s Project work.

In order to demonstrate the computing power of a GPU, I performed Matrix Multiplication on a CPU and a GPU. As expected, the GPU beat the CPU by a satisfactory ratio (given that this GPU belongs to one of the older generations). I used the Nvidia GeForce 210 for my computation. Given that it only has 16 CUDA cores (2 Multiprocessors and 8 Cores per MP), I did not expect a huge speed-up ratio.

My code allowed non-square matrix multiplications, but I did not compare their results. Instead, I only compared results for square matrices. The following results were obtained:



I avoided going beyond 1000 elements, although its not really a big issue for a processor. I can provide results for a larger order of dimension on demand. So as you can see, a GPU is more efficient than a CPU IF AND ONLY IF your task is computationally expensive. For 3×3 Matrices or even up to 25×25 Matrices a GPU kernel actually ran slower than a CPU function, meaning the task wasn’t really compute oriented and you never really needed parallelism. A plot of the computation times is shown below:


As you cross the 25×25 barrier and go towards a 50×50 dimension, you can see that the GPU makes the most of its parallel architecture. And thereon, the GPU starts dominating. The speed-up ratio keeps increasing at a higher and higher pace – which clearly shows that as the task becomes computationally more expensive, the GPU outperforms a CPU with a bigger margin. A plot of Speed-up ratio versus dimension is shown below:



In case we go beyond the 1000 elements mark, we will see an even greater speed up factor. I am currently working on the Cholesky Decomposition Algorithm for Matrix Inversion. Apparently, a modern GPU can compute large dimensional matrix inversion faster than the inbuilt MATLAB inversion based on Cholesky Algorithm. Lets see.

Note: Code will be provided, only if contacted.


Scrapy – Web Crawling with a Proxy Network

I have been using Scrapy for a couple of weeks now. It wasn’t giving me any sort of errors. The day I changed my system proxy, it showed up an error, something like this:

proxy error

So when some error like this shoots up, you know its because of a manual proxy setting.
Scrapy provides a simple solution to this:

  • A new python script to be added.
  • Editing to be done in the

1. Go into your project directory (lets say /home/you/Documents/Project/sample).

2. Create a file and add the following code:

# Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication
import base64

# Start your middleware class
class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] = "http://YOUR_PROXY:PORT"

        # Use the following lines if your proxy requires authentication
        proxy_user_pass = "USERNAME:PASSWORD"
        # setup basic authentication for the proxy
        encoded_user_pass = base64.encodestring(proxy_user_pass)
        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

3. Add the following lines in your script:

    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
    'sample.middlewares.ProxyMiddleware': 100,

and try crawling with the same spider again. You’ll find that it works now 😀 .

Note:- You must make changes to the above mentioned code depending on your Project name and Proxy.

Make Subtitles (.srt file) in text editor

Its late night, was just trying out silly stuff, and I don’t know what made me do this..
but I spent 15 odd minutes into this!

Yes there are some softwares with very attractive GUIs and most importantly, that are user-friendly. But the most primitive way to create a subtitle file (.srt file) is quite easy (although boring). Please DO NOT fall a prey to this post and go on to create subtitles for a 90-mins movie… No please STAY SAFE, DO NOT TRY THIS AT HOME 😛

Firstly, there’s a chance you might see a similar post elsewhere, but nobaady’s gonna tell you this:
This thing doesn’t seem to work for avi files :/

So just record a video and say whatever you want to. Now make sure you’ve saved the video as an mp4 !!! Once you have the video, simply open up any text editor (I used gedit). Mentioned below is the standard syntax for an srt file:

start_time --> end_time

The format of start_time & end_time is:


You could download my video (mp4) and srt file, play around with the srt file to try out how it works. Its child’s stuff. My video can be downloaded from YouTube, I’ve also shared it below:

My srt file looks something like this (be careful with the ‘periods’ and ‘commas’):

00:00:00,000 --> 00:00:02,000

00:00:02,200 --> 00:00:04,000
My name's Rohit...

00:00:04,800 --> 00:00:06,800
Yeah this is really silly,

00:00:07,000 --> 00:00:08,300

00:00:08,500 --> 00:00:11,000
Yeah I just wanted to try this out.

Un-learn and Re-learn!

I am new to blogging, I wish I had started a lot earlier. What am I doing these days?…. sighh…. I’m really tired (never physically, but mentally YES! HELL YES).
Thesis work hasn’t yet kicked off as well as I hoped (although I’m gunning for it now);
I’ve been inspired by a couple of friends to Read and NOT Watch (books/movies in this case);
I still continue looking for a research group willing to fund summer-2014 for my internship (its frustrating at times);
And yes programming will always be there on my mind.

So I’ve thought of trying my hands at PHP-MySQL, nothing complicated, just building a database of something (No you aren’t supposed to know 😛 ). Found an awesome python library ‘Scrapy‘ which will help me in scripting my web crawler in order to scrape out data from on-the-line ( 😛 have been watching THE INTERNSHIP, yes its damn funny). Scrapy‘s documentation does NOT let you export the data out as a MySQL database. But HOW LONG can you ignore MySQL huh? C’mon you can let us write items to JSON, XML, CSV, blah blah! And NOT MySQL? You gotta be kidding me! Well I’m really late to see Scrapy-Dappy-Doo, but then came a Messiah (in 2009! I’m so late!) who edited the usual Item Pipeline procedure and showed us all how to write scraped data to a MySQL database.

Its my first day with this, and I’m still working on my first Spider-Pipeline. I also came to know about a cool chrome extension (yes it is FREE!) from a friend Vipul. Its called ‘Selector Gadget‘ and its actually something you can call cool! Once you install it, you click on its icon at the top-right corner, and boom! up comes a cool widget kind of thing (whatever you call it):

Selector Gadget

Once you go through the basics of Scrapy, you’d know that your crawler/spider/scraper needs the XPath of the data field you’re looking for. So you can easily select the data you want, deselect any additional data coming along, and look at its XPath with Selector Gadget!

XPath of the 'price' field

Follow the Scrapy at a glance and then its Tutorial to get a basic idea about it. I’ll keep you informed with new developments. Oh yes, it seems Android has officially launched the BBM App.

But after going through this post, you must be wondering WHY the title to this post. Its because just a few minutes back, I’d come across a blog post by an awesome, fantastic personality (yes he’s my old pal! from the IIT-JEE days) Gaurav Ojha. Guys, have a look at this post (Its a page-turner, or rather, a page scroller). He’s made his own webpage, and it is an exhibition of true creativity. I’d heard a few poems from him back in the JEE days (its been 4 years now) and I knew he was quite enthusiastic. The title of this post comes from his description about himself 🙂

That’s all!

CUDA Installation – Ubuntu 10.04

This setup was done in the Computational Fluid Dynamics Laboratory.

Firstly, as I said in the previous post, the CUDA toolkit is specifically for Nvidia GPUs. So you need to know which Video Card (Video Graphics Array – VGA) is currently a part of your system (if there is any!). So open up your terminal and try the following command, and check the entry in the VGA section.

$ lspci -v

You might get a long list, but look for ‘VGA’ and you’ll find something like this:

Okay my laptop does not have a Nvidia card (its a Intel HD 4000 series) but then for the Nvidia users….

You’ll need 3 things here:
1. Nvidia Developer Drivers (260.19.26)
2. CUDA toolkit (Ubuntu 10.04; 64-bit)
3. GPU Computing SDK Sample examples (optional, but usually for verification)

Go to this link and download these.

Note: Make sure you download the correct version, i.e. 64-bit (some errors in 32-bit installation have been noticed), and also note down the version of ‘Nvidia Developer Drivers’ that you download (260.19.26 at present).

1. First and foremost, get rid of some drivers that might possibly interfere with our installation.

a) Blacklist kernel modules:

$ gksudo gedit /etc/modprobe.d/blacklist.conf

b) Add the following lines to the file that opens up, save and quit:

blacklist vga16fb
blacklist nouveau
blacklist rivafb
blacklist nvidiafb
blacklist rivatv

c) Remove any Nvidia driver already installed:

$ sudo apt-get --purge remove nvidia-*

d) Your system might through up an error (nouveau module still running), it has been covered later on.

2. Reboot your PC.

3. Go to virtual terminal (CTRL+ALT+F5) [to return back to normal mode, CTRL+ALT+F7 / CTRL+ALT+F8]

4. Log in at the virtual terminal and run the following line of code

$ sudo service gdm stop

5. Install the Nvidia Development Drivers.

a) Go to the location where you saved the downloaded file, become super user and run:

$ chmod +x
$ ./

b) Accept the license agreement, say ‘yes’ to “Install Nvidia’s 32-bit compatibility OpenGL libraries?”

c) Would you like to run the nvidia-xconfig utility to automatically update your X configuration file so that the NVIDIA X driver will be used when you restart X? Say ‘yes’

Note: Some systems might throw up an error that has something to do with the nouveau module. Trust me, it troubled me a lot, for around 30-40 mins before I figured a way out. The error occurs because the nouveau module is still running, and hence Nvidia can’t go ahead with its driver installation.
Steps 6 is only for users facing this problem!

6. So the first thought that might come to your mind is, why not simply remove the nouveau module like we did in case of the already existing Nvidia drivers? Yes this might help, so go on with the following command:

$ sudo apt-get --purge remove xserver-xorg-video-nouveau

Reboot, then check if its still working or not:

$ sudo modprobe -r nouveau

In case it is still working (even purging it hasn’t help) you will see the following output (yes I got this)

FATAL: Module nouveau is in use.

Then I found the solution here. Just follow these instructions to help yourself out of this.

a) Purge nouveau:

$ sudo apt-get --purge remove xserver-xorg-video-nouveau

b) Open ‘/etc/default/grub’ file:

$ gksudo gedit /etc/default/grub

c) Edit this file to add the following line to it:


d) Update the grub and later, reboot:

$ sudo update-grub

7. Now you can retry the step 5 and see that it works. Once you’re done with step 5, move on to install CUDA:

a) Go to the directory where you’ve saved your file and run it.

$ chmod +x
$ ./

b) When it asks you for the path of installation, simply choose the default.

8. Now we set up the environment variables:

a) Set PATH:

$ gksudo gedit /etc/environment

b) When the text opens, append the path to CUDA libraries to the existing path, save and quit editor.


change to…


c) Reload the newly edited path, so that the system updates it.

$ source /etc/environment


$ gksudo gedit /etc/

e) When a new text file opens up, paste the following lines into it, save and quit.


f) Reload this path:

$ sudo ldconfig

9. CUDA is now installed, but some more repairs and we’re done. You can now install the ‘GPU Computing SDK’ so that you can compile and verify CUDA.

a) Go to the directory location and:

$ chmod +x
$ ./

b) Install compiler, enter Y when asked:

$ sudo apt-get install g++

c) Repair the broken libGL dependency. Here we also generate a symbolic link between the common name and the existing one (Take care of your version. Mine was 260.19.26, please check yours)

$ sudo rm /usr/lib/
$ sudo ln -s /usr/lib/ /usr/lib/

d) Create link to common name

$ sudo ln -s /usr/lib/ /usr/lib/

e) Install some additional libraries for functioning of CUDA:

$ sudo apt-get install freeglut3-dev libxi-dev

10. Now you can go to the directory where the ‘GPU Computing SDK’ was extracted.

$ cd ~/NVIDIA_GPU_Computing_SDK
$ cd C

11. Build the examples provided:

$ make

Now you have successfully configured your Nvidia GPU in a Linux environment and also installed CUDA. You can verify the installation by simply running a sample code that comes along with the ‘GPU Computing SDK’. All the compiled example codes can be found under “~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/”.

Once you are into that directory, you can run the ‘deviceQuery’ code.

$ ./deviceQuery

If it outputs something like the following figure, you have successfully completed your installation.

cuda installation verification - rohit

Doubts, issues, installation problems – do comment! That’s all!


CUDA for beginners

Note: This is the follow-up post for this post (second in the GPGPU series).

If you’ve ever had a glance at any assembly language, you might know how complicated the concept of parallelism is. You need to overlap multiple instructions, so that more computing is done in the same amount of time. But you also need to make sure that instruction-2 does not require data updated by instruction-1 before instruction-1 actually updates it. So this forces you to incorporate lags or stalls in the instructions.

CUDA lets you implement your parallel algorithms at ease (trust me when I say “at ease”). It handles the thread distribution to different cores in the GPU on its own. The user is required to do the following tasks:
1. Coordinate the scheduling of computation on the CPU & GPU.
2. Memory allocation for variables (both CPU as well as GPU).
3. Transfer of data between the two memories (system memory and global memory).

Programming a code to be executed on multiple number of devices was kinda tricky as well as not-so-user-friendly. Here, Nvidia decided to develop a C-like language and programming environment that would improve the productivity of GPU programmers (earlier ways proved tough for entry-level programmers). They named it CUDA which stands for Compute Unified Device Architecture. In CUDA terminology, we have the system processor (host) and a C/C++ dialect for the GPU (device). OpenCL (Open Computing Language) is a similar platform that is vendor-independent and can be used for a variety of GPUs (Nvidia as well as AMD).

Each parallel task is done by something called a CUDA Thread. Various styles of parallelism can be utilized by these CUDA Threads: multithreading, MIMD (Mulitiple instruction, multiple data), SIMD (Single instruction, multiple data) as well as instruction level parallelism. Due to the popularity of the term Thread in CUDA, Nvidia classified this model as SIMT (Single instruction, multiple thread).

So a number of threads come together to form a block of threads, a number of blocks form a grid of blocks; and your task is divided into these number of blocks that do computations in their respective zones by multiple threads in that particular block. These threads then send back their results to Global memory, where all the blocks come back together to give you the overall solution.

You might not be able to understand this completely, so I’d like to demonstrate this with a simple example (the most common when it comes to GPGPU).

Question: Lets say you’ve got 2 arrays A[] and B[] both containing n elements. You want to form an output array C[i], where C[i] = k*A[i] + B[i] ….. ( k —> constant ).

Solution-1: When you think of such a situation, you have a pretty easy and basic C-code:

calculate(n, 2, A, B);
void calculate(int n, int k, int *A, int *B){
    for(int i=0; i<n; i++)
        C[i] = k*A[i] + B[i];

Solution-2: Now comes the time to showcase some really cool and awesome block of code. Have a look at it, try to understand, anyway I provide the explanation below it.

int nBlocks = (n + 255)/256;
calculate<<<nblocks, 256>>>(n, 2, A, B);
void calculate(int n, int k, int *A, int *B){
    int i = blockIdx.x*blockDim.x + threadIdx.x;
        C[i] = k*A[i] + B[i];

Now the explanation. Modern GPUs can handle a maximum of 512 CUDA Threads in a block of threads. But its better go for 256 threads in a block for now (what if your GPU doesn’t support 512 threads! Be careful here). Now we suppose we have only 1 element in our array (n = 1), we need only 1 block containing 1 thread. When n = 255 as well, we will need nblocks = 1. When n = 256, nblocks = 1; but for n = 257, nblocks = 2. So taking care of these results, we declare the following relation between n and nblocks.
nblocks = (n + 255)/256

Calling a function in CUDA *function is known as kernel in CUDA* is quite similar to the calling in C. It takes in 2 additional arguments along with the array pointers, k and n. These are, namely, the number of blocks (nblocks) and number of threads in each block (256 in this case). The regular syntax for passing them is as given above.

Now comes the ___device___ part (The GPU kernel). As I said earlier, CUDA handles the calling of the kernel multiple number of times to all the different cores. Whenever the kernel is run on a particular core, the kernel carries along with it the index of the thread within its block in the x-dimension (threadIdx.x), the index of that particular block (blockIdx.x) (As I mentioned above, we only consider a 1-Dimensional array for the time being) and also the size or dimension of the block (blockDim.x). Using these 3 indices, we calculate the absolute index of that particular element:
int i = blockIdx.x*blockDim.x + threadIdx.x

Try coming up to this result on your own, its pretty easy. In case you do not follow, please request for an edit.

So once you have the absolute index of this array element, simply compute the value in the output array for the same absolute index. The parallelism in this example comes because no element depends on any other element in the same array, so you can perform all the multiplications and additions parallel to each other (independent of each other). You need to understand that parallelism is useful when you’ve got HUGE amount of computation to be done on a multi-million number of elements. There is enormous increase in the speed-up as we increase the number of elements in those vectors (arrays).

Just to show you guys the speed-up levels achieved by parallel computation…. This is application of CUDA technology in the field of Computational Fluid Dynamics (I bet you’re :O by the level of speed-up)

Here is a link to the ‘Intro to parallel programming’ course at Udacity. Happy coding!

My next post would be on the basics of GPU architecture…. Until then!

Graphics Processing Units & GPGPU

So GPGPU stands for General Purpose Computing on GPUs. Although this concept of using a GPU as a computation tool isn’t very popular in India, it has been a major part of research worldwide. Why do I say it is not so popular in India? Because I study in one of the Indian Institutes of Technology, and only 1 out of 10 people around me might know what GPGPU is. What everybody knows, is that you need GPUs for playing high-end graphic games.

Who can exploit this computing power? Anybody studying engineering can, WHATEVER subject he/she is majoring in. This high performance and computing power is not an option nowadays, but a necessity. It has applications in EVERY field of science or engineering ranging from Bioinformatics, to Fluid mechanics, to Finance, image processing and many more. Yes you must know the basics of programming in order to learn parallel programming.

You put in a few hundred dollars and you have a GPU with hundreds of parallel floating-point units (latest GPUs have thousands of parallel ALUs). What we’re talking about is dedicated graphics. GPUs that have their own RAM also known as the Frame Buffer (Global memory, if you already know CUDA). Nvidia and ATI (now AMD) are the 2 biggest manufacturers of GPUs, followed by Intel, which manufactures Integrated Graphics (you might have heard Intel HD series). It might sound unbelievable but the latest Kepler Architecture GPUs from Nvidia ACTUALLY have thousands of cores in them (like a quad-core CPU containing 4 cores, this one has over a thousand). What Kepler Architecture is, will be discussed in the following posts.

Parallel computation is a relatively older concept. We’re able to work with a number of different applications at the same time because even the CPU is designed to exploit thread-level and instruction-level parallelism. But implementing such parallel algorithms on GPU hardware is still a new concept. That is exactly why Nvidia provides a lot of support and help regarding GPGPU. CUDA (Compute unified device architecture) is a platform designed by Nvidia, so that you can code programs to make use of multiple computing devices (CPU as well as GPU). Although setting up your system so the compiler identifies the CUDA libraries and you can code, it is fun.

I have been studying some of the details pertaining to parallel architecture and parallel computing for around six months now. It has a wide scope in Computational Fluid Dynamics, which is my B.Tech project topic. GPUs are relatively cheap, provide about 10-100x performance rise in various applications (depending on the scope for parallelism in your algorithm) and the CUDA toolkit is for FREE!! I’ll be putting up more stuff on this……