what happens when a CUDA kernel is called? - c++

I'm wondering what happens in a CUDA program when a line like
myKernel<<<16,4>>>(arg1,arg2);
is encountered.
What happens then? Is the CUDA driver invoked and the ptx code passed to it or what?

"It just works". Just kidding. Probably I will get flamed for posting this answer, as my knowledge is not extensive in this area. But this is what I can say:
The nvcc code processor is a compiler driver, meaning it uses multiple compilers and steers pieces of code in one direction or another. You might want to read more about the nvcc toolchain here if you have questions like these. Anyway, one of the things the nvcc tool will do is replace the kernel launch syntax mykernel<<<...>>> with a sequence of api calls (served by various cuda and GPU api libraries). This is how the cuda driver gets "invoked" under the hood.
As part of this invocation sequence, the driver will perform a variety of tasks. It will inspect the executable to see if it contains appropriate SASS (device assembly) code. The device does not actually execute PTX, which is an intermediate code, but SASS. If no appropriate SASS is available, but PTX code is available in the image, the driver will do a JIT-compile step to create the SASS. (In fact, some of this actually happens at context creation time/CUDA lazy initialization, rather than at the point of the kernel launch.)
Additionally, in the invocation sequence, the driver will do various types of device status checking, data validity checking (e.g. kernel launch configuration parameters), and data copying (e.g. kernel sass code, kernel parameters) to the device.
Finally, the driver will initiate execution on the device, and then immediately return control to the host thread.
Additional insight into kernel execution may be obtained by studying kernel execution in the driver API. To briefly describe the driver API, I could call it a "lower level" API than the cuda runtime API. However, the point of mentioning it is that it may give some insight into how a kernel launch syntax (runtime API) could be translated into a C-level API that actually looks like library calls.
Someone else may come along with a better/more detailed explanation.

Related

JIT compilation of CUDA __device__ functions

I have a fixed kernel and I want the ability to incorporate user defined device functions to alter the output. The user defined functions will always have the same input arguments and will always output a scalar value. If I knew the user defined functions at compile time I could just pass them in as pointers to the kernel (and have a default device function that operates on the input if given no function). I have access to the user defined function's PTX code at runtime and am wondering if I could use something like NVIDIA's jitify to compile the PTX at run time, get a pointer to the device function, and then pass this device function to the precompiled kernel function.
I have seen a few postings that get close to answering this (How to generate, compile and run CUDA kernels at runtime) but most suggest compiling the entire kernel along with the device function at runtime. Given that the device function has fixed inputs and outputs I don't see any reason why the kernel function couldn't be compiled ahead of time. The piece I am missing is how to compile just the device function at run time and get a pointer to it to then pass to the kernel function.
You can do that doing the following:
Generate your cuda project with --keep, and look-up the generated ptx or cubin for your cuda project.
At runtime, generate your ptx (in our experiment, we needed to store the function pointer in a device memory region, declaring a global variable).
Build a new module at runtime starting with cuLinkCreate, adding first the ptx or cubin from the --keep output and then your runtime generated ptx with cuLinkAddData.
Finally, call your kernel. But you need to call the kernel using the freshly generated module and not using the <<<>>> notation. In the later case it would be in the module where the function pointer is not known. This last phase should be done using driver API (you may want to try runtime API cudaLaunchKernel also).
The main element is to make sure to call the kernel from the generated module, and not from the module that is magically linked with your program.
I have access to the user defined function's PTX code at runtime and
am wondering if I could use something like NVIDIA's jitify to compile
the PTX at run time, get a pointer to the device function, and then
pass this device function to the precompiled kernel function.
No, you cannot do that. NVIDIA's APIs do not expose device functions, only complete kernels. So there is no way to obtain runtime compiled device pointers.
You can perform runtime linking of a pre-compiled kernel (PTX or cubin) with device functions you runtime compile using NVRTC. However, you can only do this via the driver module APIs. That functionality is not exposed by the runtime API (and based on my understanding of how the runtime API works it probably can't be exposed without some major architectural changes to the way embedded statically compiled code is injected at runtime).

kernel vs user-space audio device driver on macOS

I'm in a need to develop an audio device driver for System Audio Capture(based on Soundflower). But soon a problem appeared that it seems IOAudioFamily stack is being deprecated in OSX 10.10 and later. Looking through the IOAudioDevice and IOAudioEngine header files it seems that apple recommends now using the <CoreAudio/AudioServerPlugIn.h> API which runs in user-space. But I can't find lots of information on this user-space device drivers topic. It seems that the only resource is the Apple provided sample devices from https://developer.apple.com/library/prerelease/content/samplecode/AudioDriverExamples/Introduction/Intro.html
Looking through the examples I find that its a lot harder and more work to develop a user-space driver instead of I/O Kit kernel based.
So the question arises what should motivate to develop a device driver in user-space instead of kernel space?
The "SimpleAudioDriver" example is somewhat misnamed. It demonstrates pretty much every feature of the API. This is handy as a reference if you actually need to use those features. It's also structured in a way that's maybe a little more complicated than necessary.
For a virtual device, the NullAudioDriver is probably a much better base, and much, much easier to understand (single source file, if I remember correctly). SimpleAudioDriver is more useful for dealing with issues such as hotplugging, multiple instances of identical devices, etc.
IOAudioEngine is deprecated as you say, and has been since OS X 10.10. Expect it to go away eventually, so if you build your driver with it, you'll probably need to rewrite it sooner than if you create a Core Audio Server Plugin based one.
Testing and debugging audio drivers is awkward either way (due to being so time sensitive), but I'd say userspace ones are slightly less frustrating to deal with. You'll still want to test on a different machine than your development Mac, because if coreaudiod crashes or hangs, apps usually start locking up too, so being able to just ssh in, delete your plugin and kill coreaudiod is handy. Certainly quicker turnaround than having to reboot.
(FWIW, I've shipped both kernel and userspace OS X audio drivers, and I spend a lot of time working on kexts.)
There is a great book on this subject, available free online here:
http://free-electrons.com/doc/books/ldd3.pdf
See page 37 for a summary of why you might want a user-space driver, copied here for convenience:
The advantages of user-space drivers are:
The full C library can be linked in. The driver can perform many exotic tasks without resorting to external programs (the utility
programs implementing usage policies that are usually distributed
along with the driver itself).
The programmer can run a conventional debugger on the driver code without having to go through contortions to debug a running kernel.
If a user-space driver hangs, you can simply kill it. Problems with the driver are unlikely to hang the entire system, unless the hardware
being controlled is really misbehaving.
User memory is swappable, unlike kernel memory. An infrequently used device with a huge driver won’t occupy RAM that other programs could
be using, except when it is actually in use.
A well-designed driver program can still, like kernel-space drivers, allow concurrent access to a device.
If you must write a closed-source driver, the user-space option makes it easier for you to avoid ambiguous licensing situations and
problems with changing kernel interfaces.

How does an OpenGL program interface with different graphic cards

From what I understand (correct me if I am wrong), the OpenGL api converts the function calls written by the programmer in the source code into the specific gpu driver calls of our graphic card. Then, the gpu driver is able to really send instructions and data to the graphic card through some hardware interface like PCIe, AGP or PCI.
My question is, does openGL knows how to interact with different graphic cards because there are basically only 3 types of physical connections (PCIe, AGP and PCI)?
I think it is not that simple, because I always read that different graphic cards have different drivers, so a driver is not just a way to use the physical interfaces, but it serves also the purpose to have graphic cards able to perform different types of commands (which are vendor specific).
I just do not get the big picture.
This is a copy of my answer to the question "How does OpenGL work at the lowest level?" (the question has been marked for deletion, so I add some redundancy here).
This question is almost impossible to answer because OpenGL by itself is just a front end API, and as long as an implementations adheres to the specification and the outcome conforms to this it can be done any way you like.
The question may have been: How does an OpenGL driver work on the lowest level. Now this is again impossible to answer in general, as a driver is closely tied to some piece of hardware, which may again do things however the developer designed it.
So the question should have been: "How does it look on average behind the scenes of OpenGL and the graphics system?". Let's look at this from the bottom up:
At the lowest level there's some graphics device. Nowadays these are GPUs which provide a set of registers controlling their operation (which registers exactly is device dependent) have some program memory for shaders, bulk memory for input data (vertices, textures, etc.) and an I/O channel to the rest of the system over which it recieves/sends data and command streams.
The graphics driver keeps track of the GPUs state and all the resources application programs that make use of the GPU. Also it is responsible for conversion or any other processing the data sent by applications (convert textures into the pixelformat supported by the GPU, compile shaders in the machine code of the GPU). Furthermore it provides some abstract, driver dependent interface to application programs.
Then there's the driver dependent OpenGL client library/driver. On Windows this gets
loaded by proxy through opengl32.dll, on Unix systems this resides in two places:
X11 GLX module and driver dependent GLX driver
and /usr/lib/libGL.so may contain some driver dependent stuff for direct rendering
On MacOS X this happens to be the "OpenGL Framework".
It is this part that translates OpenGL calls how you do it into calls to the driver specific functions in the part of the driver described in (2).
Finally the actual OpenGL API library, opengl32.dll in Windows, and on Unix /usr/lib/libGL.so; this mostly just passes down the commands to the OpenGL implementation proper.
How the actual communication happens can not be generalized:
In Unix the 3<->4 connection may happen either over Sockets (yes, it may, and does go over network if you want to) or through Shared Memory. In Windows the interface library and the driver client are both loaded into the process address space, so that's no so much communication but simple function calls and variable/pointer passing. In MacOS X this is similar to Windows, only that there's no separation between OpenGL interface and driver client (that's the reason why MacOS X is so slow to keep up with new OpenGL versions, it always requires a full operating system upgrade to deliver the new framework).
Communication betwen 3<->2 may go through ioctl, read/write, or through mapping some memory into process address space and configuring the MMU to trigger some driver code whenever changes to that memory are done. This is quite similar on any operating system since you always have to cross the kernel/userland boundary: Ultimately you go through some syscall.
Communication between system and GPU happen through the periphial bus and the access methods it defines, so PCI, AGP, PCI-E, etc, which work through Port-I/O, Memory Mapped I/O, DMA, IRQs.

How do GPUs support both OpenGL and DirectX API execution?

Im trying to understand how OpenGL and DirectX work with the graphic card.
If i write a program in OpenGL that do a triangle, and another one in DirectX that do the same thing, what exactly happen to the GPU side?
Does when we run the program, every call to the OpenGL library and every call to DirectX library will produce code for GPU, and the GPU's machine code produced from the two program will be the the same? (Like if the DirectX and OpenGL are like Java Bytecode, precompiled, then when its actually running, it produce the same thing)
Or does the GPU have 2 different instruction set, one for each. I mean, what is exaclty OpenGL and DirectX for the GPU, how can it do the difference between the 2 API?
Is this only different from programmer perspective?
I already answered those here On Windows, how does OpenGL differ from DirectX?
full quote of one of the answers follows
This question is almost impossible to answer because OpenGL by itself is just a front end API, and as long as an implementations adheres to the specification and the outcome conforms to this it can be done any way you like.
The question may have been: How does an OpenGL driver work on the lowest level. Now this is again impossible to answer in general, as a driver is closely tied to some piece of hardware, which may again do things however the developer designed it.
So the question should have been: "How does it look on average behind the scenes of OpenGL and the graphics system?". Let's look at this from the bottom up:
At the lowest level there's some graphics device. Nowadays these are GPUs which provide a set of registers controlling their operation (which registers exactly is device dependent) have some program memory for shaders, bulk memory for input data (vertices, textures, etc.) and an I/O channel to the rest of the system over which it recieves/sends data and command streams.
The graphics driver keeps track of the GPUs state and all the resources application programs that make use of the GPU. Also it is responsible for conversion or any other processing the data sent by applications (convert textures into the pixelformat supported by the GPU, compile shaders in the machine code of the GPU). Furthermore it provides some abstract, driver dependent interface to application programs.
Then there's the driver dependent OpenGL client library/driver. On Windows this gets
loaded by proxy through opengl32.dll, on Unix systems this resides in two places:
X11 GLX module and driver dependent GLX driver
and /usr/lib/libGL.so may contain some driver dependent stuff for direct rendering
On MacOS X this happens to be the "OpenGL Framework".
It is this part that translates OpenGL calls how you do it into calls to the driver specific functions in the part of the driver described in (2).
Finally the actual OpenGL API library, opengl32.dll in Windows, and on Unix /usr/lib/libGL.so; this mostly just passes down the commands to the OpenGL implementation proper.
How the actual communication happens can not be generalized:
In Unix the 3<->4 connection may happen either over Sockets (yes, it may, and does go over network if you want to) or through Shared Memory. In Windows the interface library and the driver client are both loaded into the process address space, so that's no so much communication but simple function calls and variable/pointer passing. In MacOS X this is similar to Windows, only that there's no separation between OpenGL interface and driver client (that's the reason why MacOS X is so slow to keep up with new OpenGL versions, it always requires a full operating system upgrade to deliver the new framework).
Communication betwen 3<->2 may go through ioctl, read/write, or through mapping some memory into process address space and configuring the MMU to trigger some driver code whenever changes to that memory are done. This is quite similar on any operating system since you always have to cross the kernel/userland boundary: Ultimately you go through some syscall.
Communication between system and GPU happen through the periphial bus and the access methods it defines, so PCI, AGP, PCI-E, etc, which work through Port-I/O, Memory Mapped I/O, DMA, IRQs.

How does libc work?

I'm writing a MIPS32 emulator and would like to make it possible to use the whole Standard C Library (maybe with the GNU extensions) when compiling C programs with gcc.
As I understand at this point, I/O is handled by syscalls on the MIPS32 architecture. To successfully run a program using libc/glibc, how can I tell what syscalls do I need to emulate? (without trial and error)
Edit: See this for an example of what I mean by syscalls.
(You can check out the project here if you are interested, any feedback is welcome. Keep in mind that it's in a very early stage)
Very Short Answer
Read the much longer answer.
Short Answer
If you intend to provide a custom libc that uses some feature of your emulator to have the host OS execute your system calls, you have to implement all of them.
Much Longer Answer
Step back for a minute and look at the way things are typically layered in a real (non-emulated) system:
The peripherals have some I/O interface (e.g., numbered ports or memory mapping) that the CPU can tickle to make them do whatever they do.
The CPU runs software that understands how to manipulate the hardware. This can be a single-purpose program or an operating system that runs other programs. Since libc is in the picture, let's assume there's an OS and that it's something Unix-y.
Userspace programs run by the OS use a defined interface between themselves and OS to ask for certain "system" functions to be carried out.
What you're trying to accomplish takes place between layers 3 and 2, where a function in libc or user code does whatever the OS defines as triggering a system call. This opens up numerous cans of worms:
What the OS defines as triggering a system call differs from OS to OS and (rarely) between versions of the same OS. This problem is mitigated on "real" systems by providing a dynamically-linkable libc that takes care of hiding those details. That aside, if you have a MIPS32 binary you want to run, does it use a system call convention that your emulator supports?
You would need to provide a custom libc that does something your emulator can recognize as making a particular system call and carry it out. Any program you wish to run will have to be cross-compiled to MIPS32 and statically linked with it, as would any other libraries the program requires (libm comes to mind). Alternately, your emulator package will need to provide a simulation of a dynamic linker plus dynamically-linkable copies of all required libraries, because opening those on the host won't work. If you have enough source to recompile the program from scratch, porting might be better than emulation.
Any code that makes assumptions about paths to files on a particular system or other assumptions about what they'll find in certain devices (which are themselves files) won't run correctly.
If you're providing layer 2, you're signing yourself up to provide a complete, correct simulation of the behavior of one particular version of an entire operating system. Some calls like read() and write() would be easy to deal with; others like fork(), uselib() and ioctl() would be much more difficult. There also isn't necessarily a one-to-one mapping of calls and behaviors your program uses with those your host OS provides. All of this assumes the host is Unix and the target program is, too. If the target is compiled for some other environment, all bets are off.
That last point is why most emulators provide just a CPU and the hardware behaviors of some target system (i.e., everything in layer 1). With those in place, you can run an original system's boot ROM, OS and user programs, all unaltered. There are a number of existing MIPS32 emulators that do just this and can run unaltered versions of the operating systems that ran on the hardware they emulate.
HTH and best of luck on your project.
Most of the ISO standard C library can be written in straight C. Only a few portions need access to lower level OS functionality.
At a minimum, you'll need to emulate basic I/O at the block or character level for fopen, fread, and fwrite. You could take the Unix approach, though, and implement those on top of the lower-level open, read, and write calls.
And you'll have to manage dynamic memory allocation for malloc and free.
And setjmp and longjmp, which needs access to the execution stack.
Also time and the signal.h functions.
I don't know exactly how MIPS works, but on Win32 then OS calls have to be explicitly imported in to a process via the DLL/EXE import table. There could be something similar in the executable format used by the MIPS system.
The usual approach is to emulate not only the CPU, but also a representive set of standard peripherals. Then you start an operating system in your emulator which comes with a libc and hardware drivers included. Libc will invoke the OSes drivers which invoke the virtual hardware in your emulator. For a popular example, see DosBox.
The other interpretation of your question is that you don't want to write a full emulator, but a binary compatibility layer that allows you to execute mips32 binaries on a non-mips32 system. A popular example of that is MacOsX (Intel) that can also execute PowerPC applications.
In the latter scenario you need to emulate either the OSes ABI (application binary interface) or maybe you can get away with libc's ABI. In both cases you need to implement stub code running on the emulator and proxy code running on the host:
The stub serializes the function call arguments
...and transmits them from emulator memory to host memory using some special virtual instructions
The proxy needs to patch the arguments (endianness, integer length, address space ...)
...and executes the function call on the host system
The proxy then paches and serializes the outgoing function arguments
...and transmits them back to the stub
...which returns the data to the caller
Most calls will not be able to work with generic stub/proxy, but need a specific solutions.
Good luck!