Multiple CUDA streams crashing GPU

Multiple CUDA streams crashing GPU - c++

This is a continuation of this post.
It seems as though a special case has been solved by adding volitile but now something else has broken. If I add anything between the two kernel calls, the system reverts back to the old behavior, namely freezing and printing everything at once. This behavior is shown by adding sleep(2); between set_flag and read_flag. Also, when put in another program, this causes the GPU to lock up. What am I doing wrong now?
Thanks again.

There is an interaction with X and the display driver, as well as the standard output queue and it's interaction with the graphical display driver.
A few experiments you can try, (with the sleep(2); added between the set_flag and read_flag kernels):
Log into your machine over the network via ssh from another machine. I think your program will work. (X is not involved in the display in this case)
comment out the line that prints out "Starting..." I think your
program will then work. (This avoids the display driver/ print queue deadlock, see below).
add a sleep(2); in between the "Starting..." print line and the first kernel. I think your program will then work. (This allows the display driver to fully service the first printout before the first kernel is launched, so no CPU thread stall.)
Stop X and run from a console. I think your program will work.
When the GPU is both hosting an X display and also running CUDA tasks, it has to switch between the two. For the duration of the CUDA task, ordinary display processing is suspended. You can read more about this here.
The problem here is that when running X, the first printout is getting sent to the print queue but not actually displayed before the first kernel is launched. This is evident because you don't see the printout before the display freeze. After that, the CPU thread is getting stalled waiting for the display of the text. The second kernel is not starting. The intervening sleep(2); and it's interaction with the OS is enough for this stall to occur. And the executing first kernel has the display driver "stopped" for ordinary display tasks, so the OS never gets past it's stall, so the 2nd kernel doesn't get launched, leading to the apparent hang.
Note that options 1,2, or 3 in the linked custhelp article would be effective in your case. Option 4 would not.

Related

C++ stdin occasionally garbled

I've been experiencing a strange occasionally occurring bug for the last few days.
I have a console application that also displays a window opened with SDL for graphical output continuously running three threads. The main thread runs the event loop, and processes the console input. The second thread uses std::cin.getline to get the console input. This second thread, however, is also responsible for outputting logging information, which can be produced when the user clicks somewhere on the SDL window.
These log messages are sent to a mutex-protected stringstream regularly checked by thread 2. If there are log messages it deletes the prompt, outputs them and then prints a new prompt. Due to this it can't afford to block on getline, so this thread spawns the third thread that peeks cin and signals via an atomic when there's data to be got from the input stream, at which point getline is called and the input is passed to the logic on the main thread.
Here's the bit I haven't quite worked out, about 1 in 30 of these fails since the program doesn't receive exactly the same input as was typed into the terminal. You can see what I mean in the images here, the first line is what was type and the second is the Lua stacktrace due to receiving different (incorrect) input.
This occurs whether I use rlwrap or not. Is this due to peek and getline hitting the input stream at the same time? (This is possible as the peek loop just looks like:
while(!exitRequested_)
{
if (std::cin.peek())
inputAvailable_ = true; // this is atomic
std::this_thread::sleep_for(std::chrono::milliseconds(10));
}
Any thoughts? I looked at curses quickly, but it looks like quite a lot of effort to use. I've never heard of get line garbling stuff before. But I also printed every string that was received for a while and they matched what Lua is reporting.

As #davmac suggested, peek appears to have been interfering with getline. My assumption would be that this is linked to peek taking a character and then putting it back at the same time as getline takes the buffer.
Whatever the underlying cause of the issue is, I am >98% sure that the problem has been fixed by implementing the fix davmac suggested.
In several hours of use I have had no issues.
Moral, don't concurrently access cin, even if one of the functions doesn't modify the stream.
(Note, the above happened on both g++ and clang++ so I assume that it's just linked to the way the std library is frequently implemented).
As #DavidSchwartz pointed out, concurrent access to streams is explicitly prohibited, so that clearly explains why the fix works.

OpenGL, measuring rendering time on gpu

I have some big performance issues here
So I would like to take some measurements on the gpu side.
By reading this thread I wrote this code around my draw functions, including the gl error check and the swapBuffers() (auto swapping is indeed disabled)
gl4.glBeginQuery(GL4.GL_TIME_ELAPSED, queryId[0]);
{
draw(gl4);
checkGlError(gl4);
glad.swapBuffers();
}
gl4.glEndQuery(GL4.GL_TIME_ELAPSED);
gl4.glGetQueryObjectiv(queryId[0], GL4.GL_QUERY_RESULT, frameGpuTime, 0);
And since OpenGL rendering commands are supposed to be asynchronous ( the driver can buffer up to X commands before sending them all together in one batch), my question regards essentially if:
the code above is correct
I am right assuming that at the begin of a new frame all the previous GL commands (from the previous frame) have been sent, executed and terminated on the gpu
I am right assuming that when I get query result with glGetQueryObjectiv and GL_QUERY_RESULT all the GL commands so far have been terminated? That is OpenGL will wait until the result become available (from the thread)?

Yes, when you query the timer it will block until the data is available, ie until the GPU is finished with everything that happened between beginning and ending the query. To avoid synchronising with the GPU, you can use GL_QUERY_RESULT_AVAILABLE to check if the results are already available and only then read them then. That might require less straightforward code to keep tabs on open queries and periodically checking them, but it will have the least performance impact. Waiting for the value every time is a sure way to kill your performance.
Edit: To address your second question, swapping the buffer doesn't necessarily mean it will block until the operation succeeds. You may see that behaviour, but it's just as likely that it is just an implicit glFlush and the command buffer is not empty yet. Which is also the more wanted behaviour because ideally you want to start with your next frame right away and keep the CPUs command buffer filled. Check the implementations documentation for more info though, as that is implementation defined.
Edit 2: Checking for errors might end up being an implicit synchronization by the way, so you will probably see the command buffer emptying when you wait for error checking in the command stream.

g_main_loop uses 100% CPU

I have built my first application using glibmm. I'm using a lot of threads as it does heavy processing. I have tried to follow the guidelines concerning multithreading, i.e. not doing any GUI updates from other threads than the one where g_main_loop is running.
I do a lot of graphics rendering in worker threads but I always only update a PixBuf which is later drawn by the widgets on_draw() from the main loop.
All was fine as long as the data I render was read from files. When I started streaming data from a server which I render at regular intervals then the problems started.
Every now and then, especially when executing multiple instances of my application simultaneously, I see that the main threads takes 100% CPU time. Running strace on the process shows that g_main_loop has ended up in an eternal loop calling poll:
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=10, events=POLLIN}, {fd=8, events=POLLIN}], 4, 100) = 1 ([{fd=10, revents=POLLIN}])
In proc I get this for file-descriptor 10: 10 -> socket:[1132750]
The poll always returns immediately as file-descriptor 10 has something to offer. This goes on forever so I assume that the file-descriptor is never read. The odd thing is that running 5 applications will almost always lead to all 5 ending up in the infinite poll loop after just a couple of minutes while running only instance one seems to work more than 30 minutes most of the times I try.
Why is this happening and is there any way to debug this?

My mistake was that I called queue_draw() from one of my worker threads. Given that the function is called "queue", I assumed it would queue a redraw which would later be executed by the g_main_loop. As it turned out, this was what broke the g_main_loop. I wish libgtkmm would have a little more detail about these multithreading restrictions in its reference manual.
My solution, to the problem was adding Glib::Dispatcher queueRedraw to my Widget and connecting it to the queue_draw() function:
queueRedraw.connect(sigc::mem_fun(*this, &MyWidgetClass::queue_draw))
Calling queueRedraw() signals the main thread to call the queue_draw() function.
I don't know if this is the best approach, but it solves the problem.

CUDA kernel invocation blocking?

I'm running on Arch Linux:
I have read in multiple places that kernel invocation is asynchronous with respect to the CPU (will return immediately and allow CPU to continue). However, I'm not getting that behavior.
e.g.
kernel<<<blocks,threads>>>();
printf("print immediately\n");
check_cuda_error();
CPU seems to lock up and nothing is printed (likewise nothing else is executed) to the console until kernel is completed. Tested with kernels of all sorts of different execution times (1s, 2s, 3s, etc.) and calculations to make sure it wasn't my kernel.
Is this a driver issue? Or am I misinterpreting something

I found that when I run outside of X (in a non-graphical environment) I get the expected behavior. My hypothesis is that while my GPU was working hard in the kernel, it wasn't updating the on-screen graphics and therefore appeared to "hang" before printing to the console.
Running from the shell provided the expected results, so I'm considering my own question answered. Comment below with any more insight you might have

Phantom Input When Running Green Hills Debugger

I'm running on a Marvell Monahans PXA320 under Green Hills INTEGRITY 5.0.10. I'm using MULTI 4.2.3 for development. I'm using an RTSERV connection for debugging, I've been asked to take over a menu-driven program.
I've noticed that if I halt the program (to modify breakpoints) and then resume it, the task gets into an infinite loop displaying the menu in the debugger I/O tab. After each instance of the menu that gets printed, it says that I have made an illegal selection. So, some input is apparently being fed into the task as if I had typed it in (and this input obviously corresponds to an invalid menu selection). I do not see on the display what this phantom input is.
Is there anything I can do to prevent a halt / resume from screwing up the I/O?
Thanks,
Dave

My first guess is that getc() (or your equivalent) is returning -1. This can happen if your input buffers overflowed as a result of halting the application. I/O keeps flowing while the application is halted...
It is generally not a good idea to halt the program when debugging with INTEGRITY. You're generally better off to attach the debugger to a single thread (something idle or infrequently used), set an "any-task" breakpoint in that thread, then resume the thread. (Don't close the window! Doing so will delete the breakpoint.) You'll see a "DebugBrk" status on the thread that hits the breakpoint -- then you can double-click and attach to that specific thread.
Following that alternate procedure should (hopefully!) prevent the I/O error.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js