I'm investigating GPGPU programming with OpenGL + GLSL. One problem is that if you have a shader that takes a long time to finish, it seems to be impossible to cancel it.
After setting up everything, I issue the final glReadPixels call which blocks until all pixels have been rendered to a framebuffer. Depending on the shader, this could take a long time, even seconds. Is there a way to cancel the call (from another thread) or even query the progress? What happens if you set up an infinite loop in a shader?
instead of glReadPixels you could use PixelBufferObjects which are not blocking. glReadPixels will wait (in your main thread) for the results, but PBO will continue... somewhere later in the code you can check if the data in PBO are available.
http://www.songho.ca/opengl/gl_pbo.html
http://www.opengl.org/wiki/Pixel_Buffer_Object
if you need some more advanced calculations then you may want to use OpenCL, that will give you more flexibility.
What happens if you set up an infinite loop in a shader?
I think you will get crash of video driver.
Related
I am developing an application in C++ (MVS2008) and I have problem like described in this thread:
NVIDIA OpenGL driver lost connection
what i want to ask is not for a solution, or why is this happenning (like in the posted thread), i want to ask if i can "catch" this error and do something before the application crushes, like an output log with some relevant information of the application state.
The error occurs every now and then while running the application without clear causes, therefore i would like such a thing.
Modern Windows versions put a number of hard constraints on the responsiveness of applications. If a process spends too much time in a graphics operation driver call or takes too long to fetch events from Windows a watchdog triggers and Windows assumes that the process got stuck in an infinite loop or violates the responsivenes demands and may "do" something about it.
Try what happens if you break down a single glDraw… call into a number of smaller batches. In general you want to minimize the number of glDraw… calls, but if a single glDraw… call takes more than 10ms or so to complete, you're far beyond OpenGL overhead territory anyway.
Note that due to the asynchronous nature of OpenGL that watchdog may bark in a OpenGL finishing call glFinish or SwapBuffers. In that case it may help to help to add glFlush commands between the draw batches. If that doesn't help try glFinish (which will have a performance impact). If that doesn't help, too. Create an auxiliary OpenGL context in a separate thread that renders to a texture using a FBO and have the main thread display only the contents of this intermediary texture. And if that doesn't help, create the auxiliary context on a PBuffer instead on the same window as your main context.
Is there a way to increase the speed of glReadPixels? Currently I do:
Gdx.gl.glReadPixels(0, 0, Gdx.graphics.getWidth(), Gdx.graphics.getHeight(), GL20.GL_RGBA, GL20.GL_UNSIGNED_BYTE, pixels);
The problem is that it blocks the rendering and is slow.
I have heard of Pixel Buffer Objects, but I am quite unsure on how to wire it up and whether it is faster or not.
Also is there any other solutation than glReadPixels?
Basically, I want to take a screenshot as fast as possible, without blocking the drawing of the next scene.
Is there a way to increase the speed of glReadPixels?
Well, the speed of that operation is actually not the main issue. It has to transfer a certain amount of bytes from the framebuffer to your system memory. In your typical desktop system with a discrete GPU, that involves sending the data over PCI-Express, and there is no way around that.
But as you already stated, the implicit synchronization is a big issue. If you need that pixel data as soon as possible, you can't really do much better than that synchronous readback. But if you can live with getting that data later, asynchronous readback via pixel buffer objects (PBOs) is the way to go.
The pseudo code for that is:
create PBO
bind PBO as GL_PIXEL_PACK_BUFFER
do the glReadPixels
do something else. Both work on the CPU and issuing new commands for the GPU is ideal.
Read back the data from PBO by either using glGetBufferSubData or by mapping the PBO for reading.
The crucial point is the timing of step 5. I you do that to early, you still blocking the client side, as it will wait for the data to become available. For some screenshots, It should not be hard to delay that step for even one or two frames. That way, it will have only a slight impact on the overall render performance, and it will neither stall the GPU nor the CPU.
I am using an FBO+RBO, and instead of regular double buffering on the default framebuffer, I am drawing to the RBO and then blit directly on the GL_FRONT buffer of the default FBO (0) in a single buffered OpenGL context.
It is fine and I dont get any flickering, but if the scene gets a bit complex, I experience a HUGE drop in fps, something so weird that I knew something had to be wrong. And I dont mean from 1/60 to 1/30 because of a skipped sync, I mean a sudden 90% fps drop.
I tried a glFlush() after the blit - no difference, then I tried a glFinish() after the blit, and I had a 10x fps boost.
So I used regular doble buffering on the default framebuffer and swapbuffers(), and the fps got a boost as well, as when using glFinish().
I cannot figure out what is happening. Why glFinish() makes so much of a difference when it should not? and, is it ok to use a RBO and blit directly on the front buffer, instead of using a swapbuffers call in a double buffering context? I know Im missing vsync but the composite manager will sync anyway (infact im not seeing any tearing), it is just as if the monitor is missing 9 out of 10 frames.
And just out of curiosity, does a native swapbuffers() use glFinish() on either windows or linux?
I believe it is a sync-related issue.
When rendering directly to the RBO and blitting to the front buffer, there is simply no sync whatsoever. Thus on complex scenes the GPU command queue will fill quite quickly, then the CPU driver queue will fill quickly as well, until a CPU sync will be forced by the driver during an OpenGL command. At that point the CPU thread will be halted.
What I mean is that, without any form of sync, complex renderings (renderings for which one or more OpenGL command will be put in a queue) will always cause the CPU thread to be halted at some point, since as the queues will fill, the CPU will be issuing more and more commands.
In order to get a smooth (more constant) user interaction, a sync is needed (either with a platform-specific swapbuffers() or a glFinish()) so to stop the CPU from making things worse issuing more and more commands (which in turn would bring the CPU thread to a stop later)
reference: OpenGL Synchronization
There are separate issues here, that are also a little bit connected.
1) Re-implementing double buffering yourself, while on spec the same thing, is not the same thing to the driver. Drivers are highly optimized for the common case. For example, many chips have distinct 2d and 3d units. The swap in swapBuffers is often handled by the 2d unit. Blitting a buffer is probably still done with the 3d unit.
2) glFlush (and Finish) are ignored by many drivers. Flush is a relic of client server rendering. Finish was intended for profiling. But it got abused to work around driver bugs. So now drivers often ignore it to improve the performance of legacy code that used Finish as a workaround.
3) Just don't do single buffered. There is no performance benefit and you are working off the "good" path of the driver. Window managers are super optimized for double buffered opengl.
4) What you are seeing looks a lot like you are simply leaking resources. Do you allocate buffers without freeing them? A quick and dirty way to check is if any glGen* functions return ever increasing ids.
I am developing an application that needs to read back the whole frame from the front buffer of an openGL application. I can hijack the application's opengl library and insert my code on swapbuffers. At the moment I am successfully using a simple but excruciating slow glReadPixels command without PBO's.
Now I read about using multiple PBO's to speed things up. While I think I've found enough resources to actually program that (isn't that hard), I have some operational questions left. I would do something like this:
create a series (e.g. 3) of PBO's
use glReadPixels in my swapBuffers override to read data from front buffer to a PBO (should be fast and non-blocking, right?)
Create a seperate thread to call glMapBufferARB, once per PBO after a glReadPixels, because this will block until the pixels are in client memory.
Process the data from step 3.
Now my main concern is of course in steps 2 and 3. I read about glReadPixels used on PBO's being non-blocking, will this be an issue if I issue new opengl commands after that very fast? Will those opengl commands block? Or will they continue (my guess), and if so, I guess only swapbuffers can be a problem, will this one stall or will glReadPixels from front buffer be many times faster than swapping (about each 15->30ms) or, worst case scenario, will swapbuffers be executed while glReadPixels is still reading data to the PBO? My current guess is this logic will do something like this: copy FRONT_BUFFER -> generic place in VRAM, copy VRAM->RAM. But I have no idea which of those 2 is the real bottleneck and more, what the influence on the normal opengl command stream is.
Then in step 3. Is it wise to do this asynchronously in a thread separated from normal opengl logic? At the moment I think not, It seems you have to restore buffer operations to normal after doing this and I can't install synchronization objects in the original code to temporarily block those. So I think my best option is to define a certain swapbuffer delay before reading them out, so e.g. calling glReadPixels on PBO i%3 and glMapBufferARB on PBO (i+2)%3 in the same thread, resulting in a delay of 2 frames. Also, when I call glMapBufferARB to use data in client memory, will this be the bottleneck or will glReadPixels (asynchronously) be the bottleneck?
And finally, if you have some better ideas to speed up frame readback from GPU in opengl, please tell me, because this is a painful bottleneck in my current system.
I hope my question is clear enough, I know the answer will probably also be somewhere on the internet but I mostly came up with results that used PBO's to keep buffers in video memory and do processing there. I really need to read back the front buffer to RAM and I do not find any clear explanations about performance in that case (which I need, I cannot rely on "it's faster", I need to explain why it's faster).
Thank you
Are you sure you want to read from the front buffer? You do not own this buffer, and depending on your OS it might be destroyed, e.g., by another window on top of it.
For your use case, people typically do
draw N
start PBO read N from back buffer
draw N+1
start PBO read N+1
sync PBO read N
process N
...
from a single thread.
Currently I am loading an image in to memory on a 2nd thread, and then during the display loop (if there is a texture load required), load the texture.
I discovered that I could not load the texture on the 2nd thread because OpenGL didn't like that; perhaps this is possible but I did something wrong - so please correct me if this is actually possible.
On the other hand, if my failure was valid - how do I load a texture without disrupting the rendering loop? Currently the textures take around 1 second to load from memory, and although this isn't a major issue, it can be slightly irritating for the user.
You can load a texture from disk to memory on any thread you like, using any tool you wish for reading the files.
However, when you bind it to OpenGL, it's going to need to be handled on the same thread as the rendering for that OpenGL context. That being said, this discussion suggests that using a PBO in a second thread is an option, and can speed up the process.
You can certainly load the texture from disk into RAM in any number of threads you like, but OpenGL won't upload to VRAM in multiple threads for the reason mentioned in Reed's answer.
Given the loading from disk is the slowest part, thats the bit you'll probably want to thread. The loading thread(s) build up a queue of textures to be uploaded, then this queue is consumed by the thread that owns the GL context (mind your access to that queue by the various threads however). You could also consider a non-threaded approach of uploading N textures per frame, where N is a number that doesn't slow the rendering down too much.