CUDA kernel invocation blocking? - c++

I'm running on Arch Linux:
I have read in multiple places that kernel invocation is asynchronous with respect to the CPU (will return immediately and allow CPU to continue). However, I'm not getting that behavior.
e.g.
kernel<<<blocks,threads>>>();
printf("print immediately\n");
check_cuda_error();
CPU seems to lock up and nothing is printed (likewise nothing else is executed) to the console until kernel is completed. Tested with kernels of all sorts of different execution times (1s, 2s, 3s, etc.) and calculations to make sure it wasn't my kernel.
Is this a driver issue? Or am I misinterpreting something

I found that when I run outside of X (in a non-graphical environment) I get the expected behavior. My hypothesis is that while my GPU was working hard in the kernel, it wasn't updating the on-screen graphics and therefore appeared to "hang" before printing to the console.
Running from the shell provided the expected results, so I'm considering my own question answered. Comment below with any more insight you might have

Related

How do I measure GPU time on Metal?

I want to see programmatically how much GPU time a part of my application consumes on macOS and iOS. On OpenGL and D3D I can use GPU timer query objects. I searched and couldn't find anything similar for Metal. How do I measure GPU time on Metal without using Instruments etc. I'm using Objective-C.
There are a couple of problems with this method:
1) You really want to know what is the GPU side latency within a command buffer most of the time, not round trip to CPU. This is better measured as the time difference between running 20 instances of the shader and 10 instances of the shader. However, that approach can add noise since the error is the sum of the errors associated with the two measurements.
2) Waiting for completion causes the GPU to clock down when it stops executing. When it starts back up again, the clock is in a low power state and may take quite a while to come up again, skewing your results. This can be a serious problem and may understate your performance in benchmark vs. actual by a factor of two or more.
3) if you start the clock on scheduled and stop on completed, but the GPU is busy running other work, then your elapsed time includes time spent on the other workload. If the GPU is not busy, then you get the clock down problems described in (2).
This problem is considerably harder to do right than most benchmarking cases I've worked with, and I have done a lot of performance measurement.
The best way to measure these things is to use on device performance monitor counters, as it is a direct measure of what is going on, using the machine's own notion of time. I favor ones that report cycles over wall clock time because that tends to weed out clock slewing, but there is not universal agreement about that. (Not all parts of the hardware run at the same frequency, etc.) I would look to the developer tools for methods to measure based on PMCs and if you don't find them, ask for them.
You can add scheduled and completed handler blocks to a command buffer. You can take timestamps in each and compare. There's some latency, since the blocks are executed on the CPU, but it should get you close.
With Metal 2.1, Metal now provides "events", which are more like fences in other APIs. (The name MTLFence was already used for synchronizing shared heap stuff.) In particular, with MTLSharedEvent, you can encode commands to modify the event's value at particular points in the command buffer(s). Then, you can either way for the event to have that value or ask for a block to be executed asynchronously when the event reaches a target value.
That still has problems with latency, etc. (as Ian Ollmann described), but is more fine grained than command buffer scheduling and completion. In particular, as Klaas mentions in a comment, a command buffer being scheduled does not indicate that it has started executing. You could put commands to set an event's value at the beginning and (with a different value) at the end of a sequence of commands, and those would only notify at actual execution time.
Finally, on iOS 10.3+ but not macOS, MTLCommandBuffer has two properties, GPUStartTime and GPUEndTime, with which you can determine how much time a command buffer took to execute on the GPU. This should not be subject to latency in the same way as the other techniques.
As an addition to Ken's comment above, GPUStartTime and GPUEndTime is now available on macOS too (10.15+):
https://developer.apple.com/documentation/metal/mtlcommandbuffer/1639926-gpuendtime?language=objc

OpenGL, measuring rendering time on gpu

I have some big performance issues here
So I would like to take some measurements on the gpu side.
By reading this thread I wrote this code around my draw functions, including the gl error check and the swapBuffers() (auto swapping is indeed disabled)
gl4.glBeginQuery(GL4.GL_TIME_ELAPSED, queryId[0]);
{
draw(gl4);
checkGlError(gl4);
glad.swapBuffers();
}
gl4.glEndQuery(GL4.GL_TIME_ELAPSED);
gl4.glGetQueryObjectiv(queryId[0], GL4.GL_QUERY_RESULT, frameGpuTime, 0);
And since OpenGL rendering commands are supposed to be asynchronous ( the driver can buffer up to X commands before sending them all together in one batch), my question regards essentially if:
the code above is correct
I am right assuming that at the begin of a new frame all the previous GL commands (from the previous frame) have been sent, executed and terminated on the gpu
I am right assuming that when I get query result with glGetQueryObjectiv and GL_QUERY_RESULT all the GL commands so far have been terminated? That is OpenGL will wait until the result become available (from the thread)?
Yes, when you query the timer it will block until the data is available, ie until the GPU is finished with everything that happened between beginning and ending the query. To avoid synchronising with the GPU, you can use GL_QUERY_RESULT_AVAILABLE to check if the results are already available and only then read them then. That might require less straightforward code to keep tabs on open queries and periodically checking them, but it will have the least performance impact. Waiting for the value every time is a sure way to kill your performance.
Edit: To address your second question, swapping the buffer doesn't necessarily mean it will block until the operation succeeds. You may see that behaviour, but it's just as likely that it is just an implicit glFlush and the command buffer is not empty yet. Which is also the more wanted behaviour because ideally you want to start with your next frame right away and keep the CPUs command buffer filled. Check the implementations documentation for more info though, as that is implementation defined.
Edit 2: Checking for errors might end up being an implicit synchronization by the way, so you will probably see the command buffer emptying when you wait for error checking in the command stream.

Long-running cuda kernel stops when TDR kicks in

I'm new in CPGPU, and I've got a small. My program needs a really big amount of computing, and so when the timout is reached, and the Windows TDR kicks in, it stops the execution.
Sadly I don't have administrator privileges on the computer my program is running on, so modifying the registry keys is not possible.I managed to decompose the problem into smaller ones by rows of the image being processed, and I tried to call the kernel repeatedly inside a for loop. To ensure that the card do have some time to answer to the OS I've put some sleep-time between the calls, like this:
for(int row = 0; row<image.y; row++){
printf("%d/%d\n", row, image.y);
cudaMemset(dev_matrixes, 0, image.x*image.y*sizeof(short));
countEnergyOfRow<<<B,BLOCK_DIM>>>(...);
Sleep(750);
}
At first it seemed to work fine, but at the 21st iteration the driver crushed, and TDR stroke again. After recovery the CPU kept calling the kernel, and the next 490 times it worked fine. I've ran it multiple times, and every time the 21th iteration was fatal. I also tried to start it from a different (18th) index, but again, the problem occured at the 21st iteration, (at 39th index).
What am I doing wrong, is there something I miss? Should I somehow make the GPU stop counting manually, or just increase the sleep period?
In addition to the Windows TDR, Windows WDDM systems are subject to batching of operations. So one possibility is that the operations are being batched in such a way that the timeout is exceeded, even if an individual kernel call does not exceed the timeout.
One thing you can try is to simply reduce the execution time of your kernel further. If the execution time of your kernel is reduced to say, 1/10 second, then even the batching operation should not cause the timeout to be exceeded.
Another thing you could try is to attempt to work around the batching by issuing a cudaStreamQuery(0); call after each kernel call.
You might also check to see if the 21st iteration is taking longer for some reason; you could add cudaEvent timing to measure the time of each kernel call; I'm sure this would be instructive.
The best solution is to work on a system/GPU that is not subject to WDDM/TDR.

Multiple CUDA streams crashing GPU

This is a continuation of this post.
It seems as though a special case has been solved by adding volitile but now something else has broken. If I add anything between the two kernel calls, the system reverts back to the old behavior, namely freezing and printing everything at once. This behavior is shown by adding sleep(2); between set_flag and read_flag. Also, when put in another program, this causes the GPU to lock up. What am I doing wrong now?
Thanks again.
There is an interaction with X and the display driver, as well as the standard output queue and it's interaction with the graphical display driver.
A few experiments you can try, (with the sleep(2); added between the set_flag and read_flag kernels):
Log into your machine over the network via ssh from another machine. I think your program will work. (X is not involved in the display in this case)
comment out the line that prints out "Starting..." I think your
program will then work. (This avoids the display driver/ print queue deadlock, see below).
add a sleep(2); in between the "Starting..." print line and the first kernel. I think your program will then work. (This allows the display driver to fully service the first printout before the first kernel is launched, so no CPU thread stall.)
Stop X and run from a console. I think your program will work.
When the GPU is both hosting an X display and also running CUDA tasks, it has to switch between the two. For the duration of the CUDA task, ordinary display processing is suspended. You can read more about this here.
The problem here is that when running X, the first printout is getting sent to the print queue but not actually displayed before the first kernel is launched. This is evident because you don't see the printout before the display freeze. After that, the CPU thread is getting stalled waiting for the display of the text. The second kernel is not starting. The intervening sleep(2); and it's interaction with the OS is enough for this stall to occur. And the executing first kernel has the display driver "stopped" for ordinary display tasks, so the OS never gets past it's stall, so the 2nd kernel doesn't get launched, leading to the apparent hang.
Note that options 1,2, or 3 in the linked custhelp article would be effective in your case. Option 4 would not.

What could cause my program to not use all cores after a while?

I have written a program that captures and displays video from three video cards. For every frame I spawn a thread that compresses the frame to Jpeg and then puts it in queue for writing to disk. I also have other threads that read from these files and decodes them in their own threads. Usually this works fine, it's a pretty CPU intensive program using about 70-80 percent of all six CPU cores. But after a while the encoding suddenly slows down and the program can't handle the video fast enough and starts dropping frames. If I check the CPU utilization I can see that one core (usually core 5) is not doing much anymore.
When this happens, it doesn't matter if I quit and restart my program. CPU 5 will still have a low utilization and the program starts dropping frames immediately. Deleting all saved video doesn't have any effect either. Restarting the computer is the only thing that helps. Oh, and if I set the affinity of my program to use all but the semi-idling core, it works until the same happens to another core. Here is my setup:
AMD X6 1055T (Cool & Quiet OFF)
GA-790FX-UD5 motherboard
4Gig RAM unganged 1333Mhz'
Blackmagic Decklink DUO capture cards (x2)
Linux - Ubuntu x64 10.10 with kernel 2.6.32.29
My app uses:
libjpeg-turbo
posix threads
decklink api
Qt
Written in C/C++
All libraries linked dynamically
It seems to me like it would be some kind of problem with the way Linux schedules threads on the cores. Or is there some way my program can mess up so bad that it doesn't help to restart the program?
Thank you for reading, any and all input is welcome. I'm stuck :)
First of all, make sure it's not your program - maybe you are running into a convoluted concurrency bug, even though it's not all that likely with your program architecture and the fact that restarting the kernel helps. I've found that, usually, a good way is a post-mortem debugging. Compile with debugging symbols, kill the program with -SEGV when it is behaving strangely, and examine the core dump with gdb.
I would try to choose a core round-robin a when new frame processing thread is spawned and pin the thread to this core. Keep statistics on how long it takes for the thread to run. If this in in fact a bug in Linux scheduler - your threads will take roughly the same time to run on any core. If the core is actually busy with something else - your threads pinned to this core will get less CPU time.