Getting data from the GPU seems to be a very slow task if you want to read it synchronized with the rest of your application. One possibility is to read asynchronously with the help of Pixel Buffer Objects. Unfortunately, I am not able to see how this is done.
First, I create a Pixel Buffer Object:
glGenBuffers(1, &pbo);
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo);
glBufferData(GL_PIXEL_PACK_BUFFER, pbo_size, 0, GL_DYNAMIC_READ);
glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);
Then I want to read pixels from a Frame Buffer Object:
glReadBuffer(GL_COLOR_ATTACHMENT0);
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo);
glReadPixels(0, 0, w, h, GL_RGBA, GL_FLOAT, 0);
GLfloat *ptr = glMapBufferRange(GL_PIXEL_PACK_BUFFER, 0, pbo_size, GL_MAP_READ_BIT);
memcpy(pixels, ptr, pbo_size);
glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);
But how is this asynchronous? Is glReadPixels or glMapBufferRange blocking the application until the GPU is 'ready'?
The glReadPixels call should start the copy to a cpu-visible buffer. (whenever it gets submitted to the GPU. You can force the submission with glFlush). You're starting the read asynchronously.
glMapBufferRange will force the glReadPixels call to finish if it wasn't (since you're now accessing the pointer on the CPU, there's no way around that).
So... don't do the 2 back-to-back, but rather significantly later.
To add on to Bahbar's answer:
After glReadPixels, if you plan on reading back the data, I believe you should call glMemoryBarrier(GL_PIXEL_BUFFER_BARRIER_BIT);.
After glReadPixels and glMemoryBarrier, you can create a Fence Sync with glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0), then either check that the GPU has finished executing all instructions before the sync with glGetSynciv(fence_sync, GL_SYNC_STATUS, sizeof(GLint), NULL, &result), or wait for GPU to finish executing all instructions before the sync with glClientWaitSync(sync, GL_SYNC_FLUSH_COMMANDS_BIT, nanosecond_timeout).
Related
As the question implies, I'm trying to transfer some buffer data from the gpu to the cpu and I want to do it fast.
Specifically, I'd like to transfer a 640x480 float buffer in less than 1ms.
Question 1: Is this possible in less than 1ms?
Whether it is or not, I'd like to find out what the fastest way is. Everything I've tried up to this point is by using FBOs. Here are the different methods and their respective average time for transfer. All of these are run right after binding to the FBO and rendering on the textures. As I'm no expert, there might be mistakes in my code or I might be doing unnecessary steps so please let me know. The transfers, however, have all been checked to be successful. I transfer everything to cv::Mat objects.
1)Using glReadPixels - < 3.1ms
glBindTexture(GL_TEXTURE_2D,depthTexture);
glReadPixels(0, 0, width, height, GL_DEPTH_COMPONENT, GL_FLOAT,mat.data);
2)Using glGetTexImage - <2.9ms
glBindTexture(GL_TEXTURE_2D,depthTexture);
glGetTexImage(GL_TEXTURE_2D, 0, GL_DEPTH_COMPONENT, GL_FLOAT, mat.data);
3)Using PBO with glGetTexImage - <2.3ms
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo);
glBindTexture(GL_TEXTURE_2D, depthTexture);
glGetTexImage(GL_TEXTURE_2D, 0, GL_DEPTH_COMPONENT, GL_FLOAT, 0);
mat.data = (uchar*)glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);
Before I go on, I understand that I don't use PBOs to their full potential since I immediately call glMapBuffer but there is no other process for the cpu to do at the moment. The texture is drawn the moment I have the necessary data and the texture data is necessary for me to move on. Despite all this, PBOs still seem faster.
Here's something interesting(to me at least). These are measured in debug mode. In release mode they are all 1ms slower.
Question 2: Why are they slower in release mode? Can I change this?
Question 3: Are there any other ways I can try to do the transfer?
Extra notes on Q3:
I read somewhere that the integrated graphics card can have faster access. Is this a thing? How would I make use of this? Is this connected to GL_INTEL_map_texture?
I barely know what CUDA is but it seems there is a way to do the transfer faster using it. Is this true?
Will reading the depth buffer instead of a texture be faster?
I am reading a single pixel's depth from the framebuffer to implement picking. Originally my glReadPixels() was taking a very long time (5ms or so) and on nVidia it would even burn 100% CPU during that time. On Intel it was slow as well, but with idle CPU.
Since then, I used the PixelBufferObject functionality, PBO, to make the glReadPixels asynchronous and also double buffered using this well known example.
This approach works well, and let's me make a glReadPixels() call asynchronous but only if I read RGBA values. If I use the same PBO approach to read depth values, the glReadPixels() blocks again.
Reading RGBA: glReadPixels() takes 12µs.
Reading DEPTH: glReadPixels() takes 5ms.
I tried this on nVidia and Intel drivers. With different format/type combinations. I tried:
glReadPixels( srcx, srcy, 1, 1, GL_DEPTH_COMPONENT, GL_FLOAT, 0 );
and:
glReadPixels( srcx, srcy, 1, 1, GL_DEPTH_STENCIL, GL_UNSIGNED_INT_24_8, 0 );
and:
glReadPixels( srcx, srcy, 1, 1, GL_DEPTH_STENCIL, GL_FLOAT_32_UNSIGNED_INT_24_8_REV, 0 );
None of these would result in an asynchronous glReadPixels() call. But if I read RGBA values with the following call:
glReadPixels( srcx, srcy, 1, 1, GL_RGBA, GL_UNSIGNED_BYTE, 0 );
then the glReadPixels() returns immediately, thus no longer blocks.
Before reading the single pixel, I do:
glReadBuffer( GL_FRONT );
glBindBuffer( GL_PIXEL_PACK_BUFFER, pboid );
And I create the double buffered PBO with:
glGenBuffers( NUMPBO, pboids );
for ( int i=0; i<NUMPBO; ++i )
{
const int pboid = pboids[i];
glBindBuffer( GL_PIXEL_PACK_BUFFER, pboid );
glBufferData( GL_PIXEL_PACK_BUFFER, DATA_SIZE, 0, GL_STREAM_READ );
...
I create my framebuffer using SDL2 with depth size 24, stencil size 8, and the default double buffer.
I am using OpenGL Core Profile 3.3 on Ubuntu LTS.
I don't actually read the pixel depth (via glMapBuffer) until the next frame so there is no synchronization going on. The glReadPixel should have triggered an async operation and return immediately (as it does for RGBA). But it does not, for reading depth.
That would require there to be two depth buffers. But there aren't. Multi-buffering refers to the number of color buffers, since those are what actually get displayed. Implementations pretty much never give you multiple depth buffers.
In order to service a read from the depth buffer, that read has to happen before "the next frame" takes place. So there would need to be synchronization.
Generally speaking, it's best to read from your own images. That way, you have complete control over things like format, when they get reused, and the like, so that you can control issues of synchronization. If you need two depth buffers so that you can read from one while using the other, then you need to create that.
And FYI: reading from the default framebuffer at all is dubious due to pixel ownership issues and such. But reading from the front buffer is pretty much always the wrong thing.
I have the following pipeline:
Render into texture attachment to custom FBO.
Bind that texture attachment as image.
Run compute shader ,sampling from the image above using imageLoad/Store.
Write the results into SSBO or image.
Map the SSBO (or image) as CUDA CUgraphicsResource and process the data from that buffer using CUDA.
Now,the problem is in synchronizing data between the stages 4 and 5. Here are the sync solutions I have tried.
glFlush - doesn't really work as it doesn't guarantee the completeness of the execution of all the commands.
glFinish - this one works. But it is not recommended as it finalizes all the commands submitted to the driver.
ARB_sync Here it is said it is not recommended because it heavily impacts performance.
glMemoryBarrier This one is interesting. But it simply doesn't work.
Here is example of the code:
glMemoryBarrier(GL_ALL_BARRIER_BITS);
And also tried:
glTextureBarrierNV()
The code execution goes like this:
//rendered into the fbo...
glBindFramebuffer(GL_READ_FRAMEBUFFER, fbo);
glBindImageTexture(imageUnit1, fboTex, 0, GL_FALSE, 0, GL_READ_ONLY,GL_RGBA8);
glBindImageTexture(imageUnit2, imageTex, 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA8));
glDispatchCompute(16, 16, 1);
glFinish(); // <-- must sync here,otherwise cuda buffer doesn't receive all the data
//cuda maps the image to CUDA buffer here..
Moreover, I tried unbinding FBOs and unbinding textures from the context before launching compute, I even tried to launch one compute after other with a glMemoryBarrier set between them, and fetching the target image from the first compute launch to CUDA. Still no synch. (Well,that makes sense as two computes also run out of sync with each other)
after the compute shader stage. It doesn't sync! Only when I replace with glFinish,or with any other operation which completely stall the pipeline.
Like glMapBuffer(), for example.
So should I just use glFinish(), or I am missing something here?
Why glMemoryBarrier() doesn't sync compute shader work before CUDA takes over the control?
UPDATE
I would like to refactor the question a little bit as the original one is pretty old. Nevertheless, even with the latest CUDA and Video Codec SDK (NVENC) the issue is still alive.So, I don't care about why glMemoryBarrier doesn't sync. What I want to know is:
If it is possible to synchronize OpenGL compute shader execution finish with CUDA's usage of that shared resource without stalling the whole rendering pipeline, which is in my case OpenGL image.
If the answer is 'yes', then how?
I know this is an old question, but if any poor soul stumbles upon this...
First, the reason glMemoryBarrier does not work: it requires the OpenGL driver to insert a barrier into the pipeline. CUDA does not care about the OpenGL pipeline at all.
Second, the only other way outside of glFinish is to use glFenceSync in combination with glClientWaitSync:
....
glBindFramebuffer(GL_READ_FRAMEBUFFER, fbo);
glBindImageTexture(imageUnit1, fboTex, 0, GL_FALSE, 0, GL_READ_ONLY,GL_RGBA8);
glBindImageTexture(imageUnit2, imageTex, 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA8));
glDispatchCompute(16, 16, 1);
GLsync fence = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
... other work you might want to do that does not impact the buffer...
GLenum res = glClientWaitSync(fence, GL_SYNC_FLUSH_COMMANDS_BIT, timeoutInNs);
if(res == GL_TIMEOUT_EXPIRED || res == GL_WAIT_FAILED) {
...handle timeouts and failures
}
cudaGraphicsMapResources(1, &gfxResource, stream);
...
This will cause the CPU to block until the GPU is done with all commands until the fence. This includes memory transfers and compute operations.
Unfortunately, there is no way to tell CUDA to wait on an OpenGL memory barrier/fence. If you really require the extra bit of asynchronicity, you'll have to switch to DirectX 12, for which CUDA supports importing fences/semaphores and waiting on as well as signaling them from a CUDA stream via cuImportExternalSemaphore, cuWaitExternalSemaphoresAsync, and cuSignalExternalSemaphoresAsync.
I want to copy parts of one texture that is already in video memory to a subregion of another texture that is also already in video memory. Fast. Without going through CPU side memory.
That's the way I try to do it:
glFramebufferTexture2D(GL_READ_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_TEXTURE_2D, src_texId, 0);
glReadBuffer(GL_COLOR_ATTACHMENT0);
glBindTexture(GL_TEXTURE_2D, dst_texId);
glCopyTexSubImage2D(GL_TEXTURE_2D, 0, dst_x, dst_y, src_x, src_y, width, height);
glBindTexture(GL_TEXTURE_2D, 0);
the code compiles, and my destination texture does receive an update, but it does not seem to work correctly as it is updated with blueish junk data. Is my approach wrong?
I want to manipulate a texture which I use in opengl using CUDA. Knowing that I need to use a PBO for this I wonder if I have to recreate the texture every time I make changes to the PBO like this:
// Select the appropriate buffer
glBindBuffer( GL_PIXEL_UNPACK_BUFFER, bufferID);
// Select the appropriate texture
glBindTexture( GL_TEXTURE_2D, textureID);
// Make a texture from the buffer
glTexSubImage2D( GL_TEXTURE_2D, 0, 0, 0, Width, Height,GL_BGRA, GL_UNSIGNED_BYTE, NULL);
Does glTexSubImage2D and the like copy the data from the PBO?
All pixel transfer operations work with buffer objects. Since glTexSubImage2D initiates a pixel transfer operation, it can be used with buffer objects.
There is no long-term connection made between buffer objects used for pixel transfers and textures. The buffer object is used much like a client memory pointer would be used for glTexSubImage2D calls. It's there to store the data while OpenGL formats and pulls it into the texture. Once it's done, you can do whatever you want with it.
The only difference is that, because OpenGL manages the buffer object, the upload from the buffer can be asynchronous. Well that and you get to play games like filling the buffer object from GPU operations (whether from OpenGL, OpenCL, or CUDA).