I am reading a single pixel's depth from the framebuffer to implement picking. Originally my glReadPixels() was taking a very long time (5ms or so) and on nVidia it would even burn 100% CPU during that time. On Intel it was slow as well, but with idle CPU.
Since then, I used the PixelBufferObject functionality, PBO, to make the glReadPixels asynchronous and also double buffered using this well known example.
This approach works well, and let's me make a glReadPixels() call asynchronous but only if I read RGBA values. If I use the same PBO approach to read depth values, the glReadPixels() blocks again.
Reading RGBA: glReadPixels() takes 12µs.
Reading DEPTH: glReadPixels() takes 5ms.
I tried this on nVidia and Intel drivers. With different format/type combinations. I tried:
glReadPixels( srcx, srcy, 1, 1, GL_DEPTH_COMPONENT, GL_FLOAT, 0 );
and:
glReadPixels( srcx, srcy, 1, 1, GL_DEPTH_STENCIL, GL_UNSIGNED_INT_24_8, 0 );
and:
glReadPixels( srcx, srcy, 1, 1, GL_DEPTH_STENCIL, GL_FLOAT_32_UNSIGNED_INT_24_8_REV, 0 );
None of these would result in an asynchronous glReadPixels() call. But if I read RGBA values with the following call:
glReadPixels( srcx, srcy, 1, 1, GL_RGBA, GL_UNSIGNED_BYTE, 0 );
then the glReadPixels() returns immediately, thus no longer blocks.
Before reading the single pixel, I do:
glReadBuffer( GL_FRONT );
glBindBuffer( GL_PIXEL_PACK_BUFFER, pboid );
And I create the double buffered PBO with:
glGenBuffers( NUMPBO, pboids );
for ( int i=0; i<NUMPBO; ++i )
{
const int pboid = pboids[i];
glBindBuffer( GL_PIXEL_PACK_BUFFER, pboid );
glBufferData( GL_PIXEL_PACK_BUFFER, DATA_SIZE, 0, GL_STREAM_READ );
...
I create my framebuffer using SDL2 with depth size 24, stencil size 8, and the default double buffer.
I am using OpenGL Core Profile 3.3 on Ubuntu LTS.
I don't actually read the pixel depth (via glMapBuffer) until the next frame so there is no synchronization going on. The glReadPixel should have triggered an async operation and return immediately (as it does for RGBA). But it does not, for reading depth.
That would require there to be two depth buffers. But there aren't. Multi-buffering refers to the number of color buffers, since those are what actually get displayed. Implementations pretty much never give you multiple depth buffers.
In order to service a read from the depth buffer, that read has to happen before "the next frame" takes place. So there would need to be synchronization.
Generally speaking, it's best to read from your own images. That way, you have complete control over things like format, when they get reused, and the like, so that you can control issues of synchronization. If you need two depth buffers so that you can read from one while using the other, then you need to create that.
And FYI: reading from the default framebuffer at all is dubious due to pixel ownership issues and such. But reading from the front buffer is pretty much always the wrong thing.
Related
I was trying to make ffmpeg decode and transform pixels into rgb8 format and write into a mapped pixel buffer and use streaming to update opengl texture, which is then rendered to a sdl window.
The decoding and uploading happens in a dedicated thread(make sws_scale writes to the mapped buffer), and the rendering is done in a the render thread in another context with sharing. (The PBO actually holds several frames, and the texture is a 2d array texture, to decouple the positions.)
Things works fine if I flush the mapped range in the decoding thread, and use glTextureSubImage3D in the render thread to update the texture at needed index. The integrated Intel gpu works pretty fast (it should) in this scenario, but the NV driver complains about Pixel-path performance warning: Pixel transfer is synchronized with 3D rendering.
I thought that might be that only glTextureSubImage3D actually does the upload, so I moved the glTextureSubImage3D right after the flush operation. This time the NV gpu works fine, and the warning disappears, whereas the intel gpu gives a black window, and only shows decoded content on closing.
The code is something like this:
//render thread
void RenderFrame(SDL_Window* window,GLobjects& glo, int index, int width, int height) {
glUniform1f(glo.index_location,index);
//The function in question
glTextureSubImage3D(glo.texture, 0, 0, 0, index, width, height, 1, GL_RGBA, GL_UNSIGNED_BYTE, (void*)(index * width * height * 4));
glClear(GL_COLOR_BUFFER_BIT);
glDrawArrays(GL_TRIANGLE_STRIP, 0, 4);
SDL_GL_SwapWindow(window);
}
//Decode thread
int DecodeFrameToPBO(GLobjects& glo, int index){
//fill the mapped range needed
glFlushMappedBufferRange(GL_PIXEL_UNPACK_BUFFER, index * width * height * 4, 4 * width * height);
//The function in question
//glTextureSubImage3D(glo.texture, 0, 0, 0, index, width, height, 1, GL_RGBA, GL_UNSIGNED_BYTE, (void*)(index * width * height * 4));
}
I'm really confused by the idea of client-side memory and how the driver asynchronously uploads the texture, where exactly is the upload is supposed to happen and what glTextureSubImage3D actually does when a GL_PIXEL_UNPACK_BUFFER is bound?
EDIT:
After adding a glFlush() command to flush the upload context's command queue after each upload, the intel version works properly without black screen.
UPDATE:
Adding the glFlush() seems to make the NV gpu emit warning 'Pixel-path performance warning: Pixel transfer is synchronized with 3D rendering.' again, and same video sample's GPU utilization grew from 8% to 10%. It seems the glFlush() triggers some internal synchronization that perhaps make things go under busy wait? Since the Intel GPU cannot work without the glFlush, even with the clientwaitsync version with the flush command bit set, and calling flush explicitly on the render side also does not work. So what should be done in order to make both driver happy (and reduce utilization)?
I think you are mislead by the nvidia warning. It does not imply that there is a CPU-GPU synchronization, it only tells you that the rendering of the texture is synchronized (has to wait for) uploading the texture, which is fine. See this answer for more details.
So my answer is: there is no issue, hence the solution is to not change anyhing.
I thought that might be that only glTextureSubImage3D actually does the upload, so I moved the glTextureSubImage3D right after the flush operation [into the decode thread].
If you do that, you now have to manually synchronize the rendering with the texture upload, or you will encounter half-written frames or even undefined content at times, and basically have a race condition.
You could do such synchronization with OpenGL Sync Objects. But in the end, you would not get more performance than in your original variant.
whereas the intel gpu gives a black window, and only shows decoded content on closing
It is not clear if this is only a result of the missing synchronization, a bug in your code, or even a driver bug.
I know Directx for Dx9 at least, has a texture object where you are able to get only a small portion of the texture to CPU accessible memory. It was a function called "LockRect" I believe. OpenGL has glGetTexImage() but it grabs the entire image and if the format isn't the same as the texture's then it is going to have to convert the entire texture into the new pixel format on top of transferring the entire texture. This function is also not in OpenGL ES. Framebuffers is another option but where I could potentially bind a framebuffer where a color attachment in connected to a texture. Then there is glReadPixels which reads from the framebuffer, so it should be reading from the texture. glReadPixels has limited pixel format options so a conversion is going to have to happen, but I can read the pixels I need (which is only 1 pixel). I haven't used this method but it seems like it is possible. If anyone can confirm the framebuffer method, that it is a working alternative. Then this method would also work for OpenGL ES 2+.
Are there any other methods? How efficient is the framebuffer method (if it works), does it end up having to convert the entire texture to the desired format before it reads the pixels or is it entirely implementation defined?
Edit: #Nicol_Bolas Please stop removing OpenGL from tags and adding OpenGL-ES, OpenGL-ES isn't applicable, OpenGL is. This is for OpenGL specifically but I would like it to be Open ES 2+ compatible if possible, though it doesn't have to be. If a OpenGL only solution is available then it is a consideration I will make if it is worth the trade off. Thank you.
Please note, I do not have that much experience with ES in particular, so there might be better ways to do this specifically in that context. The general gist applies in either plain OpenGL or ES, though.
First off, the most important performance consideration should be when you are doing the reading. If you request data from the video card while you are rendering, your program (the CPU end) will have to halt until the video card returns the data, which will slow rendering due to your inability to issue further render commands. As a general rule, you should always upload, render, download - do not mix any of these processes, it will impact speed immensely, and how much so can be very driver/hardware/OS dependent.
I suggest using glReadPixels( ) at the end of your render cycle. I suspect the limitations on formats for that function are connected to limitations on framebuffer formats; besides, you really should be using 8 bit unsigned or floating point, both of which are supported. If you have some fringe case not allowing any of those supported formats, you should explain what that is, as there may be a way to handle it specifically.
If you need the contents of the framebuffer at a specific point in rendering (rather than the end), create a second texture + framebuffer (again with the same format) to be an effective "backbuffer" and then copy from the target framebuffer to that texture. This occurs on the video card, so it does not impose the bus latency directly reading does. Here is something I wrote that does this operation:
glActiveTexture( GL_TEXTURE0 + unit );
glBindTexture( GL_TEXTURE_2D, backbufferTextureHandle );
glBindFramebuffer( GL_READ_FRAMEBUFFER, framebufferHandle );
glCopyTexSubImage2D(
GL_TEXTURE_2D,
0, // level
0, 0, // offset
0, 0, // x, y
screenX, screenY );
glBindFramebuffer( GL_DRAW_FRAMEBUFFER, framebufferHandle );
Then when you want the data, bind the backbuffer to GL_READ_FRAMEBUFFER and use glReadPixels( ) on it.
Finally, you should keep in mind that a download of data will still halt the CPU end. If you download before displaying the framebuffer, you will put off displaying the image until after you can again execute commands, which might result in visible latency. As such, I suggest still using a non-default framebuffer even if you only care about the final buffer state, and ending your render cycle to the effect of:
(1.) Blit to the default framebuffer:
glBindFramebuffer( GL_DRAW_FRAMEBUFFER, 0 ); // Default framebuffer
glBindFramebuffer( GL_READ_FRAMEBUFFER, framebufferHandle );
glBlitFramebuffer(
0, 0, screenX, screenY,
0, 0, screenX, screenY,
GL_COLOR_BUFFER_BIT,
GL_NEAREST );
(2.) Call whatever your swap buffers command may be in your given situation.
(3.) Your download call from the framebuffer (be it glReadPixels( ) or something else).
As for the speed impact of the blit/texcopy operations, it's quite good on most modern hardware and I have not found it to have a noticeable impact even done 10+ times a frame, but if you are dealing with antiquated hardware, it might be worth a second thought.
I have written an emulator which I am in the process of porting to Linux. At the moment to do the video I am using Direct3D 11, which I am porting to OpenGL (which I'm running on Windows for now). I render to a 1024x1024 texture which I upload to memory every frame (the original hardware doesn't really lend itself to modern hardware acceleration, so I just do it all in software). However, I have found that uploading the texture in OpenGL is a lot slower.
In Direct3D uploading the texture every frame drops the frame rate from 416 to 395 (a 5% drop). In OpenGL it drops from 427 to 297 (a 30% drop!).
Here's the relevant code from my draw function.
Direct3D:
D3D11_MAPPED_SUBRESOURCE resource;
deviceContext_->Map(texture, 0, D3D11_MAP_WRITE_DISCARD, 0, &resource);
uint32_t *buf = reinterpret_cast<uint32_t *>(resource.pData);
memcpy(buf, ...);
deviceContext_->Unmap(texture, 0);
OpenGL:
glBindTexture(GL_TEXTURE_2D, texture);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, 1024, 1024, 0, GL_RGBA,
GL_UNSIGNED_BYTE, textureBuffer);
Can anyone suggest what may be causing this slowdown?
If it makes any odds, I'm running Windows 7 x64 with an NVIDIA GeForce GTX 550 Ti.
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, textureBuffer);
You're doing several things wrong here. First, glTexImage2D is the equivalent of creating a Direct3D texture resource every frame. But you're not creating it; you're just uploading to it. You should use glTexImage2D only once per mipmap layer of interest; after that, all uploading should happen with glTexSubImage2D.
Second, your internal format (third parameter from the left) is GL_RGBA. You should always use explicit sizes for your image formats. So use GL_RGBA8. This isn't really a problem, but you should get into the habit now.
Third, you're using GL_RGBA ordering for your pixel transfer format (the third parameter from the right, not the left). This is generally not the most optimal pixel transfer format, as lots of hardware tends to prefer GL_BGRA ordering. But if you're not getting your data from whatever is producing it in that order, then there's not much that can be done.
Fourth, if you have something else you can do between starting the upload and actually rendering with it, you can employ asynchronous pixel transfer operations. You write your data to a buffer object (which can be mapped, so that you don't have to copy into it). Then you use glTexSubImage2D to transfer this data to OpenGL. Because the source data and the destination image are part of OpenGL's memory, it doesn't have to copy the data out of client memory before glTexSubImage2D returns.
Granted, that's probably not going to help you much, since you're already effectively doing that copy in the D3D case.
In OpenGL it drops from 427 to 297 (a 30% drop!)
The more important statistic is that it's a 1 millisecond difference. You should look at your timings in absolute time, not in frames-per-second, nor in percentage drops of FPS.
glTexImage2d does memory reallocation as well as update. Try to use glTexSubImage2d instead.
after having used PyOpenGL happily for some time, I'm now seriously stuck. I am working on a Python package that allows me to use GLSL shaders and OpenCL programs for image processing, using textures as the standardized way to get my data in and out of the GLSL shaders and OpenCL programs.
Everything works, except that I can not succeed in copying a texture into a pbo (pixel buffer object).
I'm using pbo's to get my texture data in/out of OpenCL and that works nice and fast in PyOpenCL: I can copy my OpenCL output from its
pbo to a texture and display it, and I also can load data from the cpu into a pbo. But I am hopelessly stuck trying to fill my pbo with texture data already on the GPU, which is what I need to do to load my images produced by GLSL shaders into OpenCL for further processing.
I've read about two ways to do this:
variant 1 binds the pbo, binds the texture and uses glGetTexImage()
variant 2 attaches the texture to a frame buffer object, binds the fbo and the pbo and uses glReadPixels()
I also read that the PyOpenGL versions of both glReadPixels() and glGetTexImage() have trouble with the 'Null'-pointers one should use when having a bound pbo, so for that reason I am using the OpenGL.raw.GL variants.
But in both these cases I get an 'Invalid Operation' error, and I really do not see what I am doing wrong. Below two versions
of the load_texture() method of my pixelbuffer Python class, I hope I didn't strip them down too far...
variant 1:
def _load_texture(self, texture):
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, self.id)
glEnable(texture.target)
glActiveTexture(GL_TEXTURE0_ARB)
glBindTexture(texture.target, texture.id)
OpenGL.raw.GL.glGetTexImage(texture.target, 0, texture.gl_imageformat,
texture.gl_dtype, ctypes.c_void_p(0))
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, 0)
glDisable(texture.target)
variant 2:
def _load_texture(self, texture):
fbo = FrameBufferObject.from_textures([texture])
glFramebufferTexture2D(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0,
texture.target, texture.id, 0)
glReadBuffer(GL_COLOR_ATTACHMENT0)
glBindFramebuffer(GL_FRAMEBUFFER, fbo.id)
glBindBuffer(GL_PIXEL_PACK_BUFFER, self.id)
OpenGL.raw.GL.glReadPixels(0, 0, self.size[0], self.size[1],
texture.gl_imageformat, texture.gl_dtype,
ctypes.c_void_p(0))
glBindBuffer(GL_PIXEL_PACK_BUFFER, 0)
glFramebufferTexture2D(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0,
GL_TEXTURE_RECTANGLE_ARB, 0, 0)
glBindFramebuffer(GL_FRAMEBUFFER, 0)
EDIT (adding some information about the error and initialization of my pbo):
the Error I am getting for variant 1 is:
OpenGL.error.GLError: GLError(
err = 1282,
description = 'invalid operation',
baseOperation = glGetTexImage,
cArguments = (
GL_TEXTURE_RECTANGLE_ARB,
0,
GL_RGBA,
GL_UNSIGNED_BYTE,
c_void_p(None),
)
and i'm initializing my pbo like this:
self.usage = usage
if isinstance(size, tuple):
size = size[0] * size[1] * self.imageformat.planecount
bytesize = self.imageformat.get_bytesize_per_plane() * size
glBindBuffer(self.arraytype, self.id)
glBufferData(self.arraytype, bytesize, None, self.usage)
glBindBuffer(self.arraytype, 0)
the 'self.arraytype' is GL_ARRAY_BUFFER, self.usage I have tried all possibilities just in case, but GL_STREAM_READ seemed the most logical for my type of use.
the size I am typically using is 1024 by 1024, 4 planes, 1 byte per plane since it is unisgned ints. This works fine when transferring pixel data from the host.
Also I am on Kubuntu 11.10, using a NVIDIA GeForce GTX 580 with 3Gb of memory on the GPU, using the proprietary driver, version 295.33
what am I missing ?
Found a solution myself without really understanding why it makes that huge difference.
The code I had (for both variants) was basically correct but needs the call to glBufferData in there for it to work. I already made that identical call when initializing my pbo in my original code, but my guess is that there was enough going on between that initialization and my attempt to load the texture, for the pbo memory somehow to become disallocated in the meantime.
Now I only moved that call closer to my glGetTexImage call and it works without changing anything else.
Strange, I'm not sure if that is a bug or a feature, if it is related to PyOpenGL, to the NVIDIA driver or to something else. It sure is not documented anywhere easy to find if it is expected behaviour.
The variant 1 code below works and is mighty fast too, variant 2 works fine as well when treated in the same way, but at about half the speed.
def _load_texture(self, texture):
bytesize = (self.size[0] * self.size[1] *
self.imageformat.planecount *
self.imageformat.get_bytesize_per_plane())
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, self.id)
glBufferData(GL_PIXEL_PACK_BUFFER_ARB,
bytesize,
None, self.usage)
glEnable(texture.target)
glActiveTexture(GL_TEXTURE0_ARB)
glBindTexture(texture.target, texture.id)
OpenGL.raw.GL.glGetTexImage(texture.target, 0, texture.gl_imageformat,
texture.gl_dtype, ctypes.c_void_p(0))
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, 0)
glDisable(texture.target)
I want to manipulate a texture which I use in opengl using CUDA. Knowing that I need to use a PBO for this I wonder if I have to recreate the texture every time I make changes to the PBO like this:
// Select the appropriate buffer
glBindBuffer( GL_PIXEL_UNPACK_BUFFER, bufferID);
// Select the appropriate texture
glBindTexture( GL_TEXTURE_2D, textureID);
// Make a texture from the buffer
glTexSubImage2D( GL_TEXTURE_2D, 0, 0, 0, Width, Height,GL_BGRA, GL_UNSIGNED_BYTE, NULL);
Does glTexSubImage2D and the like copy the data from the PBO?
All pixel transfer operations work with buffer objects. Since glTexSubImage2D initiates a pixel transfer operation, it can be used with buffer objects.
There is no long-term connection made between buffer objects used for pixel transfers and textures. The buffer object is used much like a client memory pointer would be used for glTexSubImage2D calls. It's there to store the data while OpenGL formats and pulls it into the texture. Once it's done, you can do whatever you want with it.
The only difference is that, because OpenGL manages the buffer object, the upload from the buffer can be asynchronous. Well that and you get to play games like filling the buffer object from GPU operations (whether from OpenGL, OpenCL, or CUDA).