Cuda Extension for Python

Over the winter break, I have been watching ITunesU lectures on a programming language called Cuda. Cuda is an extension to C which allows you to run code on your NVIDIA graphics card (GPU). GPUs are designed to run parallel programs very efficiently. Instead of executing a single thread very quickly (which a CPU does), a GPU has a high throughput for a block of threads where each individual thread runs slower. This is very appealing for numerical calculations which can be parallelized. However, this post is not a tutorial on programming in Cuda. There is plenty of resources on web. Instead, I want to talk about combining the power of Python with Cuda.

In the past few years, there has been an increasing number of researchers that are moving their numerical simulations from MATLAB to Python. There are several advantages to using Python. As a programmer, I like it because it is an actual programming languages instead of a glorified calculator. Python, also, allows you to create C extensions to speedup bottlenecks in your code. This is similar to MEX files in MATLAB. The question is what you should do if you want to utilize the benefits of a GPU. If you are using Python, you could use PyCuda. However, I found all the code examples horrifically ugly. It is never a good sign when you are forced to write C code in a Python string. It just strips all the elegance of Python away. Granted, I have spent practically no time with it. I just saw coding examples and got repulsed by the smell of it. An alternative to PyCuda (or other Python bindings) is Cuda extensions. By this I mean, we create a standard C extension for Python and use the nvcc compiler instead of gcc. This means we can then use all the Cuda goodness in our extension.

Before you continue, you should make sure that you can create a normal C extension. I went through a tutorial on the SciPy website which shows how to create an extension that interacts with NumPy arrays. This is really useful for numerical simulations. In my example, I am using Cuda to add two NumPy vectors together. For the record, I am not suggesting you use this code for anything other than an example. The goal is merely to get Cuda code running from Python without additional bindings.

The simple part is constructing a Cuda program. The code that I wrote can be found here. There are two things that I should point out about the code which differs from standard C extensions and Cuda programs. First, I had to use the extern “C” keyword when defining the functions AddVector and init_CU_AddVector. The reason for this is that gcc was changing the function names when it was linking everything into a shared library. This caused Python to look for function names that no longer existed. The second thing that I wanted to point out is that my GPU only supports single precision floats. This means that I had to cast and move an array of doubles into an array of floats. This isn’t the best way of dealing with this problem. However, it was the quickest which meant more time to get everything else working. Please note, there are a lot of GPUs that support double precision operations. You can find a list of GPUs in the Cuda Progammer’s Guide. Any GPU with a Compute Capability of 1.3 or higher should support double precision operations.

Once you have written an extension, you need to compile and create a shared library. I used the following Makefile.

PYTHON_DIR = /Library/Frameworks/Python.framework/Versions/Current
# ---- Link ---------------------------  CU_AddVector.o
   gcc -bundle -m32 -flat_namespace -undefined suppress 
      -o  CU_AddVector.o -L/usr/local/cuda/lib -lcudart
# ---- Cuda compile ------------------
   nvcc  -c -m32 

They key is to, first, compile your Cuda extension to an object file using the nvcc compiler. You will need to provide links to Python.h (in your Python installation) and arrayobject.h (in your NumPy installation). Secondly, you use the gcc compiler to link everything into a shared library. You will need to provide a link to the Cuda library and include the flag lcudart so that the Cuda Runtime will be included in the shared library.

Lastly, now that we have created a shared library, we can use Python to import the library. This is demonstrated in the following script.

import _CU_AddVector
import numpy as np

import random;
v1 = np.array( [ random.uniform(0, 20) for i in range(5) ] )

v2 = np.array( [ random.uniform(0, 20) for i in range(5) ] )

result = _CU_AddVector.AddVector(v1, v2)