User-defined CUDA kernels in lua/c++ - c++

I want to allow users to define functions to be used in CUDA kernels (or be called by CUDA kernels).
I don't want to expose CUDA API to the users. The functions should look like typical c++/lua functions.
I've checked pyCUDA, but it seems to only be wrapper around .cu code.
I'd rather have a .lua or .cc file and use function pointers.
Is it remotely possible?

No, it is not remotely possible.
CUDA kernels, by design, execute on the GPU. They are compiled into (NVIDIA) GPU-specific machine language and executed in an execution environment that is utterly alien to anything a C++ function operates in, let alone Lua. They cannot simply call arbitrary code.
The absolute most that you might do is write a compiler to compile from C++/Lua into a CUDA library. But that would be a substantial undertaking for either language.

Related

Is compiling user made c++ code at runtime as an extension a good idea? [duplicate]

I am writing a research application that will utilise GPGPU using C++ and CUDA. I want to allow users of the application to be able to tailor the program by writing kernal code that will be executed on the GPU.
My only thought so far is outputting the user code into a .cu file, then calling the platforms compiler to create a Dynamic Library, which can then be loaded at runtime by the host application. Is this viable? Even if it is I'm very concerned that doing this will make my program unstable and a nightmare to make cross-platform.
Any thoughts/alternatives or comments would be greatly appreciated.
Theoretical it is possible. I would instead recommend OpenCL instead of Cuda. It is not as optimzed as Cuda on Nvidia platform, but is designed to support run time compilation ( every OpenCl runtime driver includes a compiler, that as first step of executing a kernel, compiles it).
Another advantage would be that OpenCL is more portable than Cuda, as OpenCL runs also on ATI (GPU and CPU) and Intel.
You could do it, it's viable, but IMO you would need to have a really good reason for allowing users to edit the CUDA kernel. I'm not sure what you have in mind for a user interface, and how the code that the user runs in the CUDA kernel will interface with the outside world, but this could get tricky. It might be better if you pre-implement a set of CUDA kernels and allow users to a known set of parameters for each kernel.
Have you looked at pycuda? It basically implements a similar idea to allow python users to write C++ CUDA kernels inside python applications. Pycuda, provides functionality that helps users integrate their python code with the kernels that they write, so that when they run the python script, the kernel compiles and runs as part of it.
I haven't looked at the inner workings of pycuda but I assume that at its core it is doing something similar to what you are trying to achieve. Looking at pycuda might give you an idea of what's needed to write your own implementation.

How does the C++ standard library work behind the scenes?

This question has been bothering me so much for the past couple of days. I was wondering how the standard library works, in terms of functionality. I couldn't find an answer anywhere, even by checking the source code provided by the LLVM compiler which is, for a beginner like me, a really complicated piece of code.
What I'm basically trying to understand here is how does the C++ standard library work. For example let's take the fstream header file which consist of a bunch of functions that help to write to and read from files.
How does it work? Does it use the OS specific API (since the library is cross platform), or what? And, if the standard library can do it, aren't I supposed to be able to mess with some files as well without calling the standard fstream file (which to my experience I can't do)?
I apologize if my questions are unclear since I'm not a native English speaker: feel free to modify this text so as to make it clearer.
Does it use the OS specific API (since the library is cross platform), or what?
At some point, the OS specific API is used. The fstream implementation does not necessarily call an OS function directly. It might use other classes, which call functions inherited from C, etc., but eventually the call chain will lead to an OS call. (Yes, the details are often too complicated for an intermediate programmer to follow. So, as a self-described beginner, your findings are not surprising.)
The library is cross-platform in the sense that on your end (the C++ programmer), the interface is the same regardless of platform. It is not, however, the same library on every platform. Each platform has its own library, exposing the same interface on the C++ side, but making use of different OS calls. (In fact, the same platform might have multiple standard libraries, as the library implementation is provided by your toolchain, not by the standards committee.)
And, if the standard library can do it, aren't I supposed to be able to mess with some files as well without calling the standard fstream file (which to my experience I can't do)?
Yes, you are allowed to. Apparently, you have not been able to yet, but with some practice and guidance you should be able to. Everything in the standard library can be recreated in your own code. The point of the standard library (and most libraries, for that matter) is to save you time, not to enable something that was otherwise unavailable. For example, you don't have to implement a file stream for every program you write; it's in the standard library so you can focus on more interesting aspects of your project.
A compiler is just a program which create executable file or library. You can use the compiler default libraries to gain time or write your own. The default libraries communicate with the os for file operation or memory allocation and provide a simple standard classes to allow the developper to write only one code which work on all target platforms supported by the compiler and the libraries. If you want to write your own you have to write each function for all your target os.
The standard library is cross-platform in a sense that its interface does not change between platforms but its implementation does - or in practical terms - if you only use C++ and its standard library, you can write your code the same way for Linux / Windows / MacOS / Android / Whatever and if you find a C++ compiler for one of those platforms that supports the language features you used, you will be able to compile your code for that platform without rewriting anything.
So while you can use std::vector or std::fstream or any other feature in the library independently of the platform you're writing for and expect the function definitions, type names, etc. to look the same, you cannot expect the executable which you compiled for PC with Windows 10 to run on a phone with Android. You cannot even expect the same executable to run on the same PC but with different system - that is what I mean by "the implementation is different"
There are two main reasons for this difference:
Processors with different architectures (x86-64 and ARM for example) use different instruction sets and as such the C++ source would need to be compiled to a completely different machine code to run properly
Computers with processors of the same architecture which have a different operating system have different ways of dynamically allocating memory, creating files, creating streams, writing to console, creating and scheduling threads etc. - which is part of the system functionality that you use via the standard library
If you really wanted to you could use HeapAlloc() instead of operator new() or CreateThread() instead of stdlib's std::thread but that would force you to both rewrite your program every time you wanted to compile it for something else than Windows and recompile it with the target platform's compiler (and by proxy learn its API). Standard library saves you from that trouble by abstracting away those system calls.
As for the fstream in particular, here is what it uses internally on most PCs nowadays.
Basically, fstream, iostream and printf works based on a kernel function write(). When your code call printf (we use printf as an example), it will finally call write() to let the kernel work on the IO stuff. After that, write() returns and printf returns and your code continues.
So if you really want to know how the printf works internally, you have to read the source code of the Kernel.
But you shouldn't do that for now.
For a beginner, do not try to go deeper when you haven't got a basic cognition about computer. A computer is a project, just like a building. So the right way to learn it is to learn it level by level. First, learning how to use brick and cement to build a building, this is what you should do for now. What you shouldn't do is that you are learning how to build a building and this is your first time to try to use brick, then you are interested in how to produce a brick and start to focus on brick, this is a wrong way to learn IT.
If you are learning C/C++, just learn it. Remember, learn it level by level. For now, knowing how to use printf is enough.

Difference between C++ MEX and C MEX

I wrote a TCPIP-Socket-Connection with Server and Client in C++, which works quite nice in VisualStudio. Now I want to use the C++ - Client in MATLAB/Simulink through MEX-Files and later in a S-Function.
I found two descriptions about MEX-Files.
C++ MEX File Application Just for C++
C/C++ MEX Files C/C++
Now I am confused, which one would be the one to take. I wrote some easy programms with the second, but always got into problems with datatypes. I think, it is because the given examples and functions are only for C and not for C++.
I appreciate any help! Thank you very much!
The differences:
The C interface described in the second link is much, much older (I used this interface way back in 1998). You can create a MEX-file with this interface and have it run on a large set of different versions of MATLAB. You can use it from C as well as C++ code.
The C++-only interface described in the first link is new in MATLAB R2018a (the C++ classes to manipulate MATLAB arrays were introduced in R2017b, but the ability to write a MEX-file was new in R2018a). MEX-files you write with this interface will not run on prior versions of MATLAB.
Additionally, this interface (finally!) allows for creating shared-data copies, in-place operations, etc. (the stuff we have been asking for for many years, but they didn't want to put into the old C interface because they worried it would be too complex for the average MEX-file writer).
Another change to be aware of:
In R2018a, MATLAB also changed the way that complex arrays are stored in memory. In older versions of MATLAB, the real and imaginary components are stored in separate memory blocks. In R2018a and on, they are stored in the same memory block, in the same fashion as you would likely use in your own code.
This affects MEX-files! If you MEX-file uses complex data, it needs to read and write them in the way that MATLAB stores them. If you run a MEX-file compiled for an older version of MATLAB, or compile a MEX-file using the current default building options in R2018a, a complex array will be copied to the old storage model before being passed to the MEX-file. A new compile option to the mex command, -R2018a, creates MEX-files that pass the data in the new storage model unchanged. But those MEX-files will not be compatible with previous versions of MATLAB.
How to choose?
If you need your MEX-files to run on versions of MATLAB prior to the newest R2018a, use the old C interface, you don't have a choice.
If you want to program in C, use the old C interface.
If you need to use complex data, and don't want to incur the cost of the copy, you need to target R2018a and newer, and R2017b and older, separately. You need to write separate code for these two "platforms". The older versions can only be targeted with the C interface. For the newer versions you can use either interface.
If you appreciate the advantages of modern C++ and would like to take advantage of them, and are targeting only the latest and greatest MATLAB version, then use the new C++ interface. I haven't tried it out yet, but from the documentation it looks to be very well designed.

How does a language expand itself?

I am learning C++ and I've just started learning about some of Qt's capabilities to code GUI programs. I asked myself the following question:
How does C++, which previously had no syntax capable of asking the OS for a window or a way to communicate through networks (with APIs which I don't completely understand either, I admit) suddenly get such capabilities through libraries written in C++ themselves? It all seems terribly circular to me. What C++ instructions could you possibly come up with in those libraries?
I realize this question might seem trivial to an experienced software developer but I've been researching for hours without finding any direct response. It's gotten to the point where I can't follow the tutorial about Qt because the existence of libraries is incomprehensible to me.
A computer is like an onion, it has many many layers, from the inner core of pure hardware to the outermost application layer. Each layer exposes parts of itself to the next outer layer, so that the outer layer may use some of the inner layers functionality.
In the case of e.g. Windows the operating system exposes the so-called WIN32 API for applications running on Windows. The Qt library uses that API to provide applications using Qt to its own API. You use Qt, Qt uses WIN32, WIN32 uses lower levels of the Windows operating system, and so on until it's electrical signals in the hardware.
You're right that in general, libraries cannot make anything possible that isn't already possible.
But the libraries don't have to be written in C++ in order to be usable by a C++ program. Even if they are written in C++, they may internally use other libraries not written in C++. So the fact that C++ didn't provide any way to do it doesn't prevent it from being added, so long as there is some way to do it outside of C++.
At a quite low level, some functions called by C++ (or by C) will be written in assembly, and the assembly contains the required instructions to do whatever isn't possible (or isn't easy) in C++, for example to call a system function. At that point, that system call can do anything your computer is capable of, simply because there's nothing stopping it.
C and C++ have 2 properties that allow all this extensibility that the OP is talking about.
C and C++ can access memory
C and C++ can call assembly code for instructions not in the C or C++ language.
In the kernel or in a basic non-protected mode platform, peripherals like the serial port or disk drive are mapped into the memory map in the same way as RAM is. Memory is a series of switches and flipping the switches of the peripheral (like a serial port or disk driver) gets your peripheral to do useful things.
In a protected mode operating system, when one wants to access the kernel from userspace (say when writing to the file system or to draw a pixel on the screen) one needs to make a system call. C does not have an instruction to make a system calls but C can call assembler code which can trigger the correct system call, This is what allows one's C code to talk to the kernel.
In order to make programming a particular platform easier, system calls are wrapped in more complex functions which may perform some useful function within one's own program. One is free to call the system calls directly (using assembler) but it is probably easier to just make use of one of the wrapper functions that the platform supplies.
There is another level of API that are a lot more useful than a system call. Take for example malloc. Not only will this call the system to obtain large blocks of memory but will manage this memory by doing all the book keeping on what is take place.
Win32 APIs wrap some graphic functionality with a common platform widget set. Qt takes this a bit further by wrapping the Win32 (or X Windows) API in a cross platform way.
Fundamentally though a C compiler turns C code into machine code and since the computer is designed to use machine code, you should expect C to be able to accomplish the lions share or what a computer can do. All that the wrapper libraries do is do the heavy lifting for you so that you don't have to.
Languages (like C++11) are specifications, on paper, usually written in English. Look inside the latest C++11 draft (or buy the costly final spec from your ISO vendor).
You generally use a computer with some language implementation (You could in principle run a C++ program without any computer, e.g. using a bunch of human slaves interpreting it; that would be unethical and inefficient)
Your C++ implementation general works above some operating system and communicate with it (using some implementation specific code, often in some system library). Generally that communication is done thru system calls. Look for instance into syscalls(2) for a list of system calls available on the Linux kernel.
From the application point of view, a syscall is an elementary machine instruction like SYSENTER on x86-64 with some conventions (ABI)
On my Linux desktop, the Qt libraries are above X11 client libraries communicating with the X11 server Xorg thru X Windows protocols.
On Linux, use ldd on your executable to see the (long) list of dependencies on libraries. Use pmap on your running process to see which ones are "loaded" at runtime. BTW, on Linux, your application is probably using only free software, you could study its source code (from Qt, to Xlib, libc, ... the kernel) to understand more what is happening
I think the concept you are missing is system calls. Each operating system provides an enormous amount of resources and functionality that you can tap into to do low-level operating system related things. Even when you call a regular library function, it is probably making a system call behind the scenes.
System calls are a low-level way of making use of the power of the operating system, but can be complex and cumbersome to use, so are often "wrapped" in APIs so that you don't have to deal with them directly. But underneath, just about anything you do that involves O/S related resources will use system calls, including printing, networking and sockets, etc.
In the case of windows, Microsoft Windows has its GUI actually written into the kernel, so there are system calls for making windows, painting graphics, etc. In other operating systems, the GUI may not be a part of the kernel, in which case as far as I know there wouldn't be any system calls for GUI related things, and you could only work at an even lower level with whatever low-level graphics and input related calls are available.
Good question. Every new C or C++ developer has this in mind. I am assuming a standard x86 machine for the rest of this post. If you are using Microsoft C++ compiler, open your notepad and type this (name the file Test.c)
int main(int argc, char **argv)
{
return 0
}
And now compile this file (using developer command prompt) cl Test.c /FaTest.asm
Now open Test.asm in your notepad. What you see is the translated code - C/C++ is translated to assembler. Do you get the hint ?
_main PROC
push ebp
mov ebp, esp
xor eax, eax
pop ebp
ret 0
_main ENDP
C/C++ programs are designed to run on the metal. Which means they have access to lower level hardware which makes it easier to exploit the capabilities of the hardware. Say, I am going to write a C library getch() on a x86 machine.
Depending on the assembler I would type something this way :
_getch proc
xor AH, AH
int 16h
;AL contains the keycode (AX is already there - so just return)
ret
I run it over with an assembler and generate a .OBJ - Name it getch.obj.
I then write a C program (I dont #include anything)
extern char getch();
void main(int, char **)
{
getch();
}
Now name this file - GetChTest.c. Compile this file by passing getch.obj along. (Or compile individually to .obj and LINK GetChTest.Obj and getch.Obj together to produce GetChTest.exe).
Run GetChTest.exe and you would find that it waits for the keyboard input.
C/C++ programming is not just about language. To be a good C/C++ programmer you need to have a good understanding on the type of machine that it runs. You will need to know how the memory management is handled, how the registers are structured, etc., You may not need all these information for regular programming - but they would help you immensely. Apart from the basic hardware knowledge, it certainly helps if you understand how the compiler works (ie., how it translates) - which could enable you to tweak your code as necessary. It is an interesting package!
Both languages support __asm keyword which means you could mix your assembly language code too. Learning C and C++ will make you a better rounded programmer overall.
It is not necessary to always link with Assembler. I had mentioned it because I thought that would help you understand better. Mostly, most such library calls make use of system calls / APIs provided by the Operating System (the OS in turn does the hardware interaction stuff).
How does C++ ... suddenly get such capabilities through libraries
written in C++ themselves ?
There's nothing magical about using other libraries. Libraries are simple big bags of functions that you can call.
Consider yourself writing a function like this
void addExclamation(std::string &str)
{
str.push_back('!');
}
Now if you include that file you can write addExclamation(myVeryOwnString);. Now you might ask, "how did C++ suddenly get the capability to add exclamation points to a string?" The answer is easy: you wrote a function to do that then you called it.
So to answer your question about how C++ can get capabilities to draw windows through libraries written in C++, the answer is the same. Someone else wrote function(s) to do that, and then compiled them and gave them to you in the form of a library.
The other questions answer how the window drawing actually works, but you sounded confused about how libraries work so I wanted to address the most fundamental part of your question.
The key is the possibility of the operating system to expose an API and a detailed description on how this API is to be used.
The operating system offers a set of APIs with calling conventions.
The calling convention is defining the way a parameter is given into the API and how results are returned and how to execute the actual call.
Operating systems and the compilers creating code for them play nicely together, so you usually have not to think about it, just use it.
There is no need for a special syntax for creating windows. All that is required is that the OS provides an API to create windows. Such an API consists of simple function calls for which C++ does provide syntax.
Furthermore C and C++ are so called systems programming languages and are able to access arbitrary pointers (which might be mapped to some device by the hardware). Additionally, it is also fairly simple to call functions defined in assembly, which allows the full range of operations the processor provides. Therefore it is possible to write an OS itself using C or C++ and a small amount of assembly.
It should also be mentioned that Qt is a bad example, as it uses a so-called meta compiler to extend C++' syntax. This is however not related to it's ability to call into the APIs provided by the OS to actually draw or create windows.
First, there's a little misunderstading, I think
How does C++, which previously had no syntax capable of asking the OS for a window or a way to communicate through networks
There is no syntax for doing OS operations. It's the question of semantics.
suddenly get such capabilities through libraries written in C++ themselves
Well, the operating system is writen mostly in C. You can use shared libraries (so, dll) to call the external code. Additionally, the operating system code can register system routines on syscalls* or interrupts which you can call using assembly. That shared libraries often just make that system calls for you, so you are spared using inline assembly.
Here's the nice tutorial on that: http://www.win.tue.nl/~aeb/linux/lk/lk-4.html
It's for Linux, but the principles are the same.
How the operating system is doing operations on graphic cards, network cards etc? It's a very broad thema, but mostly you need to access interrupts, ports or write some data to special memory region. Since that operations are protected, you need to call them through the operating system anyway.
In an attempt to provide a slightly different view to other answers, I shall answer like this.
(Disclaimer: I am simplifying things slightly, the situation I give is purely hypothetical and is written as a means of demonstrating concepts rather than being 100% true to life).
Think of things from the other perspective, imagine you've just written a simple operating system with basic threading, windowing and memory management capabilities. You want to implement a C++ library to let users program in C++ and do things like make windows, draw onto windows etc. The question is, how to do this.
Firstly, since C++ compiles to machine code, you need to define a way to use machine code to interface with C++. This is where functions come in, functions accept arguments and give return values, thus they provide a standard way of transferring data between different sections of code. They do this by establishing something known as a calling convention.
A calling convention states where and how arguments should be placed in memory so that a function can find them when it gets executed. When a function gets called, the calling function places the arguments in memory and then asks the CPU to jump over to the other function, where it does what it does before jumping back to where it was called from. This means that the code being called can be absolutely anything and it will not change how the function is called. In this case however, the code behind the function would be relevant to the operating system and would operate on the operating system's internal state.
So, many months later and you've got all your OS functions sorted out. Your user can call functions to create windows and draw onto them, they can make threads and all sorts of wonderful things. Here's the problem though, your OS's functions are going to be different to Linux's functions or Windows' functions. So you decide you need to give the user a standard interface so they can write portable code. Here is where QT comes in.
As you almost certainly know, QT has loads of useful classes and functions for doing the sorts of things that operating systems do, but in a way that appears independent of the underlying operating system. The way this works is that QT provides classes and functions that are uniform in the way they appear to the user, but the code behind the functions is different for each operating system. For example QT's QApplication::closeAllWindows() would actually be calling each operating system's specialised window closing function depending on the version used. In Windows it would most likely call CloseWindow(hwnd) whereas on an os using the X Window System, it would potentially call XDestroyWindow(display,window).
As is evident, an operating system has many layers, all of which have to interact through interfaces of many varieties. There are many aspects I haven't even touched on, but to explain them all would take a very long time. If you are further interested in the inner workings of operating systems, I recommend checking out the OS dev wiki.
Bear in mind though that the reason many operating systems choose to expose interfaces to C/C++ is that they compile to machine code, they allow assembly instructions to be mixed in with their own code and they provide a great degree of freedom to the programmer.
Again, there is a lot going on here. I would like to go on to explain how libraries like .so and .dll files do not have to be written in C/C++ and can be written in assembly or other languages, but I feel that if I add any more I might as well write an entire article, and as much as I'd love to do that I don't have a site to host it on.
When you try to draw something on the screen, your code calls some other piece of code which calls some other code (etc.) until finally there is a "system call", which is a special instruction that the CPU can execute. These instructions can be either written in assembly or can be written in C++ if the compiler supports their "intrinsics" (which are functions that the compiler handles "specially" by converting them into special code that the CPU can understand). Their job is to tell the operating system to do something.
When a system call happens, a function gets called that calls another function (etc.) until finally the display driver is told to draw something on the screen. At that point, the display driver looks at a particular region in physical memory which is actually not memory, but rather an address range that can be written to as if it were memory. Instead, however, writing to that address range causes the graphics hardware to intercept the memory write, and draw something on the screen.
Writing to this region of memory is something that could be coded in C++, since on the software side it's just a regular memory access. It's just that the hardware handles it differently.
So that's a really basic explanation of how it can work.
Your C++ program is using Qt library (also coded in C++). The Qt library will be using Windows CreateWindowEx function (which was coded in C inside kernel32.dll). Or under Linux it may be using Xlib (also coded in C), but it could as well be sending the raw bytes that in X protocol mean "Please create a window for me".
Related to your catch-22 question is the historical note that “the first C++ compiler was written in C++”, although actually it was a C compiler with a few C++ notions, enough so it could compile the first version, which could then compile itself.
Similarly, the GCC compiler uses GCC extensions: it is first compiled to a version then used to recompile itself. (GCC build instructions)
How i see the question this is actually a compiler question.
Look at it this way, you write a piece of code in Assembly(you can do it in any language) which translates your newly written language you want to call Z++ into Assembly, for simplicity lets call it a compiler (it is a compiler).
Now you give this compiler some basic functions, so that you can write int, string, arrays etc. actually you give it enough abilities so that you can write the compiler itself in Z++. and now you have a compiler for Z++ written in Z++, pretty neat right.
Whats even cooler is that now you can add abilities to that compiler using the abilities it already has, thus expanding the Z++ language with new features by using the previous features
An example, if you write enough code to draw a pixel in any color, then you can expand it using the Z++ to draw anything you want.
The hardware is what allows this to happen. You can think of the graphics memory as a large array (consisting of every pixel on the screen). To draw to the screen you can write to this memory using C++ or any language that allows direct access to that memory. That memory just happens to be accessible by or located on the graphics card.
On modern systems accessing the graphics memory directly would require writing a driver because of various restrictions so you use indirect means. Libraries that create a window (really just an image like any other image) and then write that image to the graphics memory which the GPU then displays on screen. Nothing has to be added to the language except the ability to write to specific memory locations, which is what pointers are for.

User defined CUDA code in C++

I am writing a research application that will utilise GPGPU using C++ and CUDA. I want to allow users of the application to be able to tailor the program by writing kernal code that will be executed on the GPU.
My only thought so far is outputting the user code into a .cu file, then calling the platforms compiler to create a Dynamic Library, which can then be loaded at runtime by the host application. Is this viable? Even if it is I'm very concerned that doing this will make my program unstable and a nightmare to make cross-platform.
Any thoughts/alternatives or comments would be greatly appreciated.
Theoretical it is possible. I would instead recommend OpenCL instead of Cuda. It is not as optimzed as Cuda on Nvidia platform, but is designed to support run time compilation ( every OpenCl runtime driver includes a compiler, that as first step of executing a kernel, compiles it).
Another advantage would be that OpenCL is more portable than Cuda, as OpenCL runs also on ATI (GPU and CPU) and Intel.
You could do it, it's viable, but IMO you would need to have a really good reason for allowing users to edit the CUDA kernel. I'm not sure what you have in mind for a user interface, and how the code that the user runs in the CUDA kernel will interface with the outside world, but this could get tricky. It might be better if you pre-implement a set of CUDA kernels and allow users to a known set of parameters for each kernel.
Have you looked at pycuda? It basically implements a similar idea to allow python users to write C++ CUDA kernels inside python applications. Pycuda, provides functionality that helps users integrate their python code with the kernels that they write, so that when they run the python script, the kernel compiles and runs as part of it.
I haven't looked at the inner workings of pycuda but I assume that at its core it is doing something similar to what you are trying to achieve. Looking at pycuda might give you an idea of what's needed to write your own implementation.