CUDA programming in Python is a good way to start learning to leverage the power of GPU hardware. At the same time, wouldn’t it be nice to be able to speed up those bottleneck Python functions? Looky here:
Background
Python has a reputation for being ‘slow’. In part this is because it is an interpreted language. That means that there is software emulating a machine on hardware. The alternative, compiled languages, take source code and compile it directly to the machine language of the hardware on which it runs.
Many computer languages have reputations of being fast or slow. Sometimes this is due to the level of abstraction of the language. For example, the programming language C is very close a description of machine code, so it is considered fast. It’s only a couple of levels up the abstraction ladder.
On the other hand, there are very many more languages with more sophisticated abstract data structures and algorithms. These abstractions can make the language appear ‘slow’. It’s not a fair comparison of course. These abstractions make programming easier. Most people have difficulty programming these abstractions in something like C. As Philip Greenspun’s Tenth Rule of Programming half jokingly states:
Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.
Over the years this has wandered away from Common Lisp somewhat to other full featured interpreted languages. The basic sentiment is that complex systems implemented in low-level languages cannot avoid reinventing and/or reimplementing (poorly on both counts) the facilities built into higher-level languages. It’s safe to consider Python a high-level abstract language.
Just In Time (JIT) Compiler
Back in the late 1980s/early 90s, researchers at Stanford and Sun Microsystems implemented the modern Just In Time (JIT) Compilers for interpreted languages. For most computer programs, the majority of the execution time is isolated to a small part area of code. It’s something like a 80/20 rule for software. The idea of the JIT compiler is to monitor the program and figure out where the time is being spent. Then take the functions which are taking the most time and compile them into machine code, link, load, and run them directly. Subsequent calls to the function are redirected to run in native code.
There are a couple of drawbacks to this approach. The main one is that you have to compile the function while the program is running. This takes a little bit of time. The speedups though are dramatic. Java was one of the first commercial projects to take advantage of this. Javascript adopted this approach a few years later. Python has shied away from this approach. The standard CPython reference implementation does not take advantage of this technology.
As a note, PyPy is a Python interpreter which is JIT enabled. They state it’s about 5 times faster than CPython.
Multi-Processing
Back around 2005 there were a couple of landmark changes in the computer industry. The first is multi-core/multi-threading microprocessors. The second is the availability of general purpose computing on graphics processing units, introduced by NVIDIA as CUDA. As a result, the nature of how to write software changed significantly.
In addition, we are seeing today that there is an effort to getting both the CPU cores and the GPU units on the same chip. Apple has three generations under its belt with the M3 chips. On the embedded computer side, NVIDIA has the Jetson. Going forward, we expect more offerings like this such as the rumored NVIDIA ARM/GPU Windows desktop chips.
That begs the question, “How do you take advantage of the hardware most efficiently?” GPUs are really good at two things: math and lightweight multiprocessing. But there’s a catch in most of todays hardware architectures. In many cases, the memory in use usually has to be moved back and forth from the host computer to the CUDA device (GPU). This overhead can use up the advantage gained from faster math. This changes when you have a unified memory architecture where you share memory between the CPU and the GPU.
Numba
There are several libraries which allow you to write CUDA code in Python. CuPy is a direct replacement for Numpy which utilizes the GPU for computation. CUDA Python is the official NVIDIA on ramp to being able to access the CUDA driver using Python Wrappers.
Numba is another library in the ecosystem which allows people entry into GPU-accelerated computing using Python with a minimum of new syntax and jargon. As a bonus, Numba also provides JIT compilation of Python functions.
To use a JIT version of a function, you only need to add a decorator to the code. For example:
from numba import jit
@jit(nopython=True )
def sobel_filter(input_image):
It’s similar for CUDA:
from numba import cuda
@cuda.jit
def my_cuda_kernel(io_array):
Numba compiles the function into either machine code or PTX (CUDA) code on the first call. For full examples, you can look at the JetsonHacks Github repository cuda-using-numba that we went through in the video.
While it’s simple to start with Numba, for best performance you’ll need to learn a lot more. Fortunately it’s well documented on the Numba website, and enough people are using it so that there are some great examples out there.
Conclusion
I’ve always found interpreters fascinating. I was shocked by how complex production compilers and interpreters are in practice after learning about them at university. The ones in school seem so simple, but when they are out in the wild they’re a whole different beast.
There’s a great resource for how Numba takes Python source code, turns it into a CUDA kernel and launches it in the life-of-numba-kernel on the gmarkall account on Github. It’s by Graham Markall from NVIDIA, who also worked on Numba at Anaconda. A wonderful overview and walk-thru of how exactly this works. Well worth the read.
This is one of those “Hey there’s stuff out there which makes Python faster” articles. I found Numba to be fun to work with, so I thought I would share it. As always, thanks for reading.
One Response
I tried the Numba route a few years back and it turned into a monster of a rabbit hole. Was successful at getting some lightweight stuff running but it was quite difficult and not a lot of fun.
Then came Julia Language. MUCH shorter learning route but will take as long to master as C does (not C++, thank the Lord.) Julia already has multiprocessing, multithreading, multi-computer tools already built-in. And, of course, CUDA and ROCm are supported WITHOUT the programmer needing to design the full CUDA overhead himself, as it is already built-in to the language, as mentioned. One can get up and running with moderate programs relatively quickly, once the basics of Julia are understood. Julia Lang is about 10 years old and still improving rapidly, with over 5,000 packages already available for free, as is Julia Lang itself.
There are still some kludges to be programmed in order to field a pure executable and the binaries are enormous (but I think still smaller than a Netsomething (Microsoft environment) deliverable…
Julia uses JIT compilation but that is getting better, too.
I’m told there’s a version for the Jetson family, too, but I’ve not personally done that yet. Started it but life got in the way… Other Arm-based Julia versions are available but you have to compile some of them the first time. Of course, Intel and AMD have binaries downloadable.
It’s a fun language and such enormous power!