JIT compilation of CUDA __device__ functions - c++

I have a fixed kernel and I want the ability to incorporate user defined device functions to alter the output. The user defined functions will always have the same input arguments and will always output a scalar value. If I knew the user defined functions at compile time I could just pass them in as pointers to the kernel (and have a default device function that operates on the input if given no function). I have access to the user defined function's PTX code at runtime and am wondering if I could use something like NVIDIA's jitify to compile the PTX at run time, get a pointer to the device function, and then pass this device function to the precompiled kernel function.
I have seen a few postings that get close to answering this (How to generate, compile and run CUDA kernels at runtime) but most suggest compiling the entire kernel along with the device function at runtime. Given that the device function has fixed inputs and outputs I don't see any reason why the kernel function couldn't be compiled ahead of time. The piece I am missing is how to compile just the device function at run time and get a pointer to it to then pass to the kernel function.

You can do that doing the following:
Generate your cuda project with --keep, and look-up the generated ptx or cubin for your cuda project.
At runtime, generate your ptx (in our experiment, we needed to store the function pointer in a device memory region, declaring a global variable).
Build a new module at runtime starting with cuLinkCreate, adding first the ptx or cubin from the --keep output and then your runtime generated ptx with cuLinkAddData.
Finally, call your kernel. But you need to call the kernel using the freshly generated module and not using the <<<>>> notation. In the later case it would be in the module where the function pointer is not known. This last phase should be done using driver API (you may want to try runtime API cudaLaunchKernel also).
The main element is to make sure to call the kernel from the generated module, and not from the module that is magically linked with your program.

I have access to the user defined function's PTX code at runtime and
am wondering if I could use something like NVIDIA's jitify to compile
the PTX at run time, get a pointer to the device function, and then
pass this device function to the precompiled kernel function.
No, you cannot do that. NVIDIA's APIs do not expose device functions, only complete kernels. So there is no way to obtain runtime compiled device pointers.
You can perform runtime linking of a pre-compiled kernel (PTX or cubin) with device functions you runtime compile using NVRTC. However, you can only do this via the driver module APIs. That functionality is not exposed by the runtime API (and based on my understanding of how the runtime API works it probably can't be exposed without some major architectural changes to the way embedded statically compiled code is injected at runtime).

Related

A Function in an application(.exe) should be called only once regardless of how many times I run the same application

Suppose there are two functions, one to print "hello" and other to print "world" and I call these two functions inside the main function. Now, when I compile it will create a .exe file. When I run this .exe for the first time both functions will print "hello world".This .exe is terminated.
But if I run the same .exe for the second time or multiple times, only one function must execute ie. it should print only "world". I want to a piece of code or function that should only run once and after that, it should destroy itself and should be not be executed again regardless of how many times I run the application(.exe)
I can achieve this by accessing locally or windows registry and write some value for once and can check if that value is present, no need to execute this piece of code or function.
Can I achieve it without any external help that the application itself should be capable of performing this behaviour?
Any ideas are appreciated. thanks for reading
There is no coherent or portable way1 to do this from software without requiring the use of an external resource of some kind.
The issue is that you want the invocation of this process to be aware of the amount of times it has been executed, but the amount of times it has been executed is not a property that is recorded anywhere2. A program itself has no memory of its previous executions unless you program it do so.
Your best bet is to write out this information in some canonicalized location so that it can be read on later executions. This could be as a file in the filesystem (such as a hidden .firstrun file or something), or it could be through the registry (Windows specific), or some other environment-specific form of communication.
The main thing is that this must persist between executions and be available to your process.
1 You could potentially write code that overwrites the executable itself after the first invocation -- but this is extraordinarily brittle, and will be highly specific to the executable format. This is not an ideal nor recommended approach to solving this problem.
2 This is not a capability defined in the C or C++ standard. It's possible that there may be some specialized operating systems/flavors of linux that allow querying this -- but this is not something seen in most general-purpose operating systems. Generally the approach is communicate via an external resource.
Can I achieve it without any external help that the application itself
should be capable of performing this behaviour?
Not by any means defined by C or C++, and probably not on Windows at all.
You have to somehow, somewhere memorialize the fact that the one-time function has been called. If you have nothing but the compiled program to use for that, then the only alternative is to modify the program. Neither C nor C++ provides for live modification of the running program, much less for writing it back to the executable file containing its image.
Conceivably, if the program knows where to find or how to recreate its own source code, and if it knows how to run the compiler, then it could compile a modified version of itself. On Windows, however, it very likely could not overwrite its own executable file while it was running (though that would be possible on various other operating systems), so that would not solve the problem.
Moreover, note that any approach that involves modifying the executable would be at least a bit wonky, for different copies of the program would have their own, semi-independent idea of whether the one-time function had been run.
Basically, then, no.

How does the linker decide where the code execution will start from? [Embedded]

As beginner in embedded C programming I am very curious how every (every in my experience) program execution starts with main() function? It is like the linker recognizes the main() and puts the address of that "special" function into address that the reset vector points to.
Usually a linker script creates a special section which is mapped to the reset vector and includes a jump/goto instruction to the C startup code, which, in turn, calls the main().
C defines different specifications for code that will run in a "hosted" environment and code that will run in a "freestanding" environment. Most programmers will go their whole careers without ever having to deal with a freestanding environment, but most of the exceptions are among those who work with embedded programming, kernel programming, boot loaders, and other software that runs on bare metal.
In a hosted environment, C specifies that program execution starts with a call to main(). That does not preclude preliminary setup performed by the system before that call, but that's outside the scope of the specification. The C compiler and / or linker is responsible for arranging for that to happen; details are implementation dependent.
In a freestanding implementation, on the other hand, the program entry point is determined in a manner chosen by the implementation. There might not be a main() function, and if there is one then its signature does not need to match those permitted to programs run in hosted environments.
It is not the linker, it's the processor who is deciding. On power-up the instruction pointer is set to a predefined memory address, usually the same as the reset interrupt vector. Then the linker kicks in by placing the branch instruction to the startup code at that address.
The linker links a module for processor and runtime environment initialisation. That module is entered from the reset vector. In the gcc toolchain, the module is normally called crt0.o and is built from the source crt0.s (assembly code). Your toolchain may vary, but some sort of start-up code will be linked, and the source should be available for customisation.
The start-up code will typically perform hardware initialisation such as configuring the PLL for the desired clock speed, and initialising a memory controller if external memory is used. The C runtime initialisation requires the setting of the stack pointer, and the initialisation of global static data, and possibly runtime library initialisation - heap and stdio initialisation for example. For C++ it also invokes the constructors for any global static objects. Finally main() is called.
Note that it is not the linker specifically that knows about main(); that is simply an unresolved link in the runtime start-up module. If your program did not have a main(), it would fail to link.
You could of course modify the start-up code to use a different symbol other than main(), but main() is defined by the language standard as the entry point.
Some application frameworks or environments may appear to not have a main(); for example in the RTOS VxWorks, applications start at usrAppInit(), but in fact that is simply because main() is defined in the VxWorks library.
The linker locates the start-up code according to either directives in the assembly source, or within the linker script; toolchains may differ.
On ARM Cortex-M devices, the initial stack pointer is defined in the vector table and loaded automatically; as a consequence, it is possible for these devices to run C code directly from reset (albeit in a somewhat limited environment), and allows much of the runtime environment initialisation to be written in C rather than assembler.
Each processor and tool chain is different. Generally, though, they're set up where the entry point to the run time library (many times _start) is reached from the reset vector. The run time library prepares the processor state, clears .bss memory, initializes .data memory, maybe sets up the heap, and calls a few call outs to allow customization of the startup, then calls all global constructors (if c++), before finally jumping to main().
It's a mix of hardware requirements, tool chain assumptions, run time library, and system code. You can trim a lot of it out, because the only real requirement for C is that you have a stack. The rest is library code you may or may not use.
In order to meet the standard or at least expectations of programmers, before main you need bss cleared, compile time initialized variables (globals with an = something for example), a c library and other fun things. So you have this chicken and egg problem, how can you have C code with such assumptions or requirements and have C code that fills those requirements. you dont. There is other code, not uncommon to be assembly but could come from C where the assumptions are known to be not true. sometimes called bootstrap code. it doesnt matter if this is an embedded system or an application running on an operating system. there is some glue between the first instructions in that "program" to main. If you disassemble something gnu tools created you can see this execution path between a label named _start and main. other toolchains may or may not name their entry point differently.
in a microcontroller or situation where you might be bare metal (the bios on a pc, the startup code that launches the rtos/os) the bare minimum if you dont care about some of the requirements/assumptions of C, loading the stack pointer and branching to main is all you need. zeroing out bss and copying .data from flash to its proper home in ram, are the next two things you need to get closer to the C language requirements, and you will find those are all the steps you get in some embedded systems.
probably other processors too, but the arm cortex-m hardware has the ability to load the stack pointer and branch to an address (reset always branches to an address or runs code from some known address), further the interrupt system saves state for you so you dont need to wrap asm around interrupt service routines written in C (or do some compiler specific declaration which does the same thing)(this is the next question you would have needed ask anyway, 1) reset to C code 2) interrupts to C code), so the interrupt vector table can have addresses to C functions directly. A nice feature of that product line.
use the toolchains disassembler and examine the code from the entry point to main()...some toolchains certainly in the past, would make assuptions when it saw main() specifically and add extra code. so sometimes you see some other C function name used as the first C function to avoid the toolchain linking in other stuff.
Clifford hit the nail on the head though the linker is simply looking for unresolved symbols, one being main, with a gnu toolchain the other being _start. and it links in stuff it already knows about or you have provided on the command line until all the labels are resolved.

Halide extern methods

I use AOT compilation to use Halide code without Halide libraries.
I see in HalideRuntime.h (Available on the sources) that I have many extern methods availables in my .o files.
halide_dev_malloc and halide_dev_free are very instresting. I already use halide_copy_to_dev without problem, but I see that my memory was allocated.
If I want to do a simple memcpy between host and device and use halide_dev_malloc instead, is this possible?
Did HalideRuntime.h group all the extern functions available or the object files contains a lot of others?
Jay
HalideRuntime.h is intended to document all the routines that can be called or replaced by clients. There are many other symbols in the runtime, but they should be considered internal. We recently moved these other routines into their own namespace to indicate that they are internal.
The runtime for device backends is still a work in progress and there will be an improved design intended to allow more flexibility and to allow code to do more while still working generically across multiple backends. At present, halide_dev_malloc will allocate the device handle for whichever device backend is selected via the Target at Halide compile time. However, this handle is backend specific and thus in order to do anything with it you must know which backend is used and how that backend interacted with the device API. E.g. in order to use the handle with memcpy, you need to know that the device backend supports some sort of uniform memory architecture ("Unified Virtual Address Space" in CUDA terminology)and the device memory was allocated with the right API calls to make a memory buffer that can be accessed from both the device and CPU with the same pointer, etc. Depending on which backend you are using and which platform you are on, that may or may not work out at present. (Uniform memory designs are a fairly recent thing for the most part. We haven't put a lot of effort into supporting them.)
For CUDA/PTX, halide_dev_malloc calls cuMemAlloc and I think it may be in Unified Virtual Address Space on many systems by default, but I am not sure.
Yes, you can use halide_dev_malloc and copy things yourself manually. See https://github.com/halide/Halide/blob/master/src/runtime/cuda.cpp line 466 for what halide_copy_to_dev actually does.
First it does a halide_dev_malloc, and then uses cuda's cuMemcpyHtoD. There's a bunch of extra logic there in case the buffer isn't dense in memory, but most of the time it turns into a single cuMemcpyHtoD.
I believe HalideRuntime.h contains all the useful extern functions. There are a few other internal ones like halide_create_cuda_context that might conceivably be interesting. To see all of them, look for functions marked as WEAK that start with the name halide_ in this folder: https://github.com/halide/Halide/tree/master/src/runtime
Or you can run nm on a halide-generated object file and see all the symbols that start with halide_.

Alternative methods of locating function offsets for use as function pointer?

When writing code that is to be injected into a running process, and subsequently call functions from within that application, sometimes you need to create a function pointer if you're wanting to call a function provided by that application itself - in the manner of a computer-based training application, or a computer game hack, etc.
Function pointers are easy in C++, if you know the offset of the function. Finding those offsets are what become the time consuming part, if the application that you're working with is frequently updated, because updates to the application may change the offset.
Are there any methods of automatically tracking these offsets? I seem to recall hearing about fingerprinting methods or something that would attempt to automatically locate the functions for you. Any ideas about those?
This is very dependent on what you're injecting into and the environment you're running.
If you're in a windows environment, i'd give this a read through
x86 code injection into an x86 proccess from a x64 process
In a linux type environment you could do something with the global offset table?
You could always do signature based approach to find the function. Or perhaps there is an exported function that calls the function you want to hook. You could trace the logic as such.

what happens when a CUDA kernel is called?

I'm wondering what happens in a CUDA program when a line like
myKernel<<<16,4>>>(arg1,arg2);
is encountered.
What happens then? Is the CUDA driver invoked and the ptx code passed to it or what?
"It just works". Just kidding. Probably I will get flamed for posting this answer, as my knowledge is not extensive in this area. But this is what I can say:
The nvcc code processor is a compiler driver, meaning it uses multiple compilers and steers pieces of code in one direction or another. You might want to read more about the nvcc toolchain here if you have questions like these. Anyway, one of the things the nvcc tool will do is replace the kernel launch syntax mykernel<<<...>>> with a sequence of api calls (served by various cuda and GPU api libraries). This is how the cuda driver gets "invoked" under the hood.
As part of this invocation sequence, the driver will perform a variety of tasks. It will inspect the executable to see if it contains appropriate SASS (device assembly) code. The device does not actually execute PTX, which is an intermediate code, but SASS. If no appropriate SASS is available, but PTX code is available in the image, the driver will do a JIT-compile step to create the SASS. (In fact, some of this actually happens at context creation time/CUDA lazy initialization, rather than at the point of the kernel launch.)
Additionally, in the invocation sequence, the driver will do various types of device status checking, data validity checking (e.g. kernel launch configuration parameters), and data copying (e.g. kernel sass code, kernel parameters) to the device.
Finally, the driver will initiate execution on the device, and then immediately return control to the host thread.
Additional insight into kernel execution may be obtained by studying kernel execution in the driver API. To briefly describe the driver API, I could call it a "lower level" API than the cuda runtime API. However, the point of mentioning it is that it may give some insight into how a kernel launch syntax (runtime API) could be translated into a C-level API that actually looks like library calls.
Someone else may come along with a better/more detailed explanation.