I was trying to make ffmpeg decode and transform pixels into rgb8 format and write into a mapped pixel buffer and use streaming to update opengl texture, which is then rendered to a sdl window.
The decoding and uploading happens in a dedicated thread(make sws_scale writes to the mapped buffer), and the rendering is done in a the render thread in another context with sharing. (The PBO actually holds several frames, and the texture is a 2d array texture, to decouple the positions.)
Things works fine if I flush the mapped range in the decoding thread, and use glTextureSubImage3D in the render thread to update the texture at needed index. The integrated Intel gpu works pretty fast (it should) in this scenario, but the NV driver complains about Pixel-path performance warning: Pixel transfer is synchronized with 3D rendering.
I thought that might be that only glTextureSubImage3D actually does the upload, so I moved the glTextureSubImage3D right after the flush operation. This time the NV gpu works fine, and the warning disappears, whereas the intel gpu gives a black window, and only shows decoded content on closing.
The code is something like this:
//render thread
void RenderFrame(SDL_Window* window,GLobjects& glo, int index, int width, int height) {
glUniform1f(glo.index_location,index);
//The function in question
glTextureSubImage3D(glo.texture, 0, 0, 0, index, width, height, 1, GL_RGBA, GL_UNSIGNED_BYTE, (void*)(index * width * height * 4));
glClear(GL_COLOR_BUFFER_BIT);
glDrawArrays(GL_TRIANGLE_STRIP, 0, 4);
SDL_GL_SwapWindow(window);
}
//Decode thread
int DecodeFrameToPBO(GLobjects& glo, int index){
//fill the mapped range needed
glFlushMappedBufferRange(GL_PIXEL_UNPACK_BUFFER, index * width * height * 4, 4 * width * height);
//The function in question
//glTextureSubImage3D(glo.texture, 0, 0, 0, index, width, height, 1, GL_RGBA, GL_UNSIGNED_BYTE, (void*)(index * width * height * 4));
}
I'm really confused by the idea of client-side memory and how the driver asynchronously uploads the texture, where exactly is the upload is supposed to happen and what glTextureSubImage3D actually does when a GL_PIXEL_UNPACK_BUFFER is bound?
EDIT:
After adding a glFlush() command to flush the upload context's command queue after each upload, the intel version works properly without black screen.
UPDATE:
Adding the glFlush() seems to make the NV gpu emit warning 'Pixel-path performance warning: Pixel transfer is synchronized with 3D rendering.' again, and same video sample's GPU utilization grew from 8% to 10%. It seems the glFlush() triggers some internal synchronization that perhaps make things go under busy wait? Since the Intel GPU cannot work without the glFlush, even with the clientwaitsync version with the flush command bit set, and calling flush explicitly on the render side also does not work. So what should be done in order to make both driver happy (and reduce utilization)?
I think you are mislead by the nvidia warning. It does not imply that there is a CPU-GPU synchronization, it only tells you that the rendering of the texture is synchronized (has to wait for) uploading the texture, which is fine. See this answer for more details.
So my answer is: there is no issue, hence the solution is to not change anyhing.
I thought that might be that only glTextureSubImage3D actually does the upload, so I moved the glTextureSubImage3D right after the flush operation [into the decode thread].
If you do that, you now have to manually synchronize the rendering with the texture upload, or you will encounter half-written frames or even undefined content at times, and basically have a race condition.
You could do such synchronization with OpenGL Sync Objects. But in the end, you would not get more performance than in your original variant.
whereas the intel gpu gives a black window, and only shows decoded content on closing
It is not clear if this is only a result of the missing synchronization, a bug in your code, or even a driver bug.
Related
As the question implies, I'm trying to transfer some buffer data from the gpu to the cpu and I want to do it fast.
Specifically, I'd like to transfer a 640x480 float buffer in less than 1ms.
Question 1: Is this possible in less than 1ms?
Whether it is or not, I'd like to find out what the fastest way is. Everything I've tried up to this point is by using FBOs. Here are the different methods and their respective average time for transfer. All of these are run right after binding to the FBO and rendering on the textures. As I'm no expert, there might be mistakes in my code or I might be doing unnecessary steps so please let me know. The transfers, however, have all been checked to be successful. I transfer everything to cv::Mat objects.
1)Using glReadPixels - < 3.1ms
glBindTexture(GL_TEXTURE_2D,depthTexture);
glReadPixels(0, 0, width, height, GL_DEPTH_COMPONENT, GL_FLOAT,mat.data);
2)Using glGetTexImage - <2.9ms
glBindTexture(GL_TEXTURE_2D,depthTexture);
glGetTexImage(GL_TEXTURE_2D, 0, GL_DEPTH_COMPONENT, GL_FLOAT, mat.data);
3)Using PBO with glGetTexImage - <2.3ms
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo);
glBindTexture(GL_TEXTURE_2D, depthTexture);
glGetTexImage(GL_TEXTURE_2D, 0, GL_DEPTH_COMPONENT, GL_FLOAT, 0);
mat.data = (uchar*)glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);
Before I go on, I understand that I don't use PBOs to their full potential since I immediately call glMapBuffer but there is no other process for the cpu to do at the moment. The texture is drawn the moment I have the necessary data and the texture data is necessary for me to move on. Despite all this, PBOs still seem faster.
Here's something interesting(to me at least). These are measured in debug mode. In release mode they are all 1ms slower.
Question 2: Why are they slower in release mode? Can I change this?
Question 3: Are there any other ways I can try to do the transfer?
Extra notes on Q3:
I read somewhere that the integrated graphics card can have faster access. Is this a thing? How would I make use of this? Is this connected to GL_INTEL_map_texture?
I barely know what CUDA is but it seems there is a way to do the transfer faster using it. Is this true?
Will reading the depth buffer instead of a texture be faster?
I know Directx for Dx9 at least, has a texture object where you are able to get only a small portion of the texture to CPU accessible memory. It was a function called "LockRect" I believe. OpenGL has glGetTexImage() but it grabs the entire image and if the format isn't the same as the texture's then it is going to have to convert the entire texture into the new pixel format on top of transferring the entire texture. This function is also not in OpenGL ES. Framebuffers is another option but where I could potentially bind a framebuffer where a color attachment in connected to a texture. Then there is glReadPixels which reads from the framebuffer, so it should be reading from the texture. glReadPixels has limited pixel format options so a conversion is going to have to happen, but I can read the pixels I need (which is only 1 pixel). I haven't used this method but it seems like it is possible. If anyone can confirm the framebuffer method, that it is a working alternative. Then this method would also work for OpenGL ES 2+.
Are there any other methods? How efficient is the framebuffer method (if it works), does it end up having to convert the entire texture to the desired format before it reads the pixels or is it entirely implementation defined?
Edit: #Nicol_Bolas Please stop removing OpenGL from tags and adding OpenGL-ES, OpenGL-ES isn't applicable, OpenGL is. This is for OpenGL specifically but I would like it to be Open ES 2+ compatible if possible, though it doesn't have to be. If a OpenGL only solution is available then it is a consideration I will make if it is worth the trade off. Thank you.
Please note, I do not have that much experience with ES in particular, so there might be better ways to do this specifically in that context. The general gist applies in either plain OpenGL or ES, though.
First off, the most important performance consideration should be when you are doing the reading. If you request data from the video card while you are rendering, your program (the CPU end) will have to halt until the video card returns the data, which will slow rendering due to your inability to issue further render commands. As a general rule, you should always upload, render, download - do not mix any of these processes, it will impact speed immensely, and how much so can be very driver/hardware/OS dependent.
I suggest using glReadPixels( ) at the end of your render cycle. I suspect the limitations on formats for that function are connected to limitations on framebuffer formats; besides, you really should be using 8 bit unsigned or floating point, both of which are supported. If you have some fringe case not allowing any of those supported formats, you should explain what that is, as there may be a way to handle it specifically.
If you need the contents of the framebuffer at a specific point in rendering (rather than the end), create a second texture + framebuffer (again with the same format) to be an effective "backbuffer" and then copy from the target framebuffer to that texture. This occurs on the video card, so it does not impose the bus latency directly reading does. Here is something I wrote that does this operation:
glActiveTexture( GL_TEXTURE0 + unit );
glBindTexture( GL_TEXTURE_2D, backbufferTextureHandle );
glBindFramebuffer( GL_READ_FRAMEBUFFER, framebufferHandle );
glCopyTexSubImage2D(
GL_TEXTURE_2D,
0, // level
0, 0, // offset
0, 0, // x, y
screenX, screenY );
glBindFramebuffer( GL_DRAW_FRAMEBUFFER, framebufferHandle );
Then when you want the data, bind the backbuffer to GL_READ_FRAMEBUFFER and use glReadPixels( ) on it.
Finally, you should keep in mind that a download of data will still halt the CPU end. If you download before displaying the framebuffer, you will put off displaying the image until after you can again execute commands, which might result in visible latency. As such, I suggest still using a non-default framebuffer even if you only care about the final buffer state, and ending your render cycle to the effect of:
(1.) Blit to the default framebuffer:
glBindFramebuffer( GL_DRAW_FRAMEBUFFER, 0 ); // Default framebuffer
glBindFramebuffer( GL_READ_FRAMEBUFFER, framebufferHandle );
glBlitFramebuffer(
0, 0, screenX, screenY,
0, 0, screenX, screenY,
GL_COLOR_BUFFER_BIT,
GL_NEAREST );
(2.) Call whatever your swap buffers command may be in your given situation.
(3.) Your download call from the framebuffer (be it glReadPixels( ) or something else).
As for the speed impact of the blit/texcopy operations, it's quite good on most modern hardware and I have not found it to have a noticeable impact even done 10+ times a frame, but if you are dealing with antiquated hardware, it might be worth a second thought.
I have written an emulator which I am in the process of porting to Linux. At the moment to do the video I am using Direct3D 11, which I am porting to OpenGL (which I'm running on Windows for now). I render to a 1024x1024 texture which I upload to memory every frame (the original hardware doesn't really lend itself to modern hardware acceleration, so I just do it all in software). However, I have found that uploading the texture in OpenGL is a lot slower.
In Direct3D uploading the texture every frame drops the frame rate from 416 to 395 (a 5% drop). In OpenGL it drops from 427 to 297 (a 30% drop!).
Here's the relevant code from my draw function.
Direct3D:
D3D11_MAPPED_SUBRESOURCE resource;
deviceContext_->Map(texture, 0, D3D11_MAP_WRITE_DISCARD, 0, &resource);
uint32_t *buf = reinterpret_cast<uint32_t *>(resource.pData);
memcpy(buf, ...);
deviceContext_->Unmap(texture, 0);
OpenGL:
glBindTexture(GL_TEXTURE_2D, texture);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, 1024, 1024, 0, GL_RGBA,
GL_UNSIGNED_BYTE, textureBuffer);
Can anyone suggest what may be causing this slowdown?
If it makes any odds, I'm running Windows 7 x64 with an NVIDIA GeForce GTX 550 Ti.
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, textureBuffer);
You're doing several things wrong here. First, glTexImage2D is the equivalent of creating a Direct3D texture resource every frame. But you're not creating it; you're just uploading to it. You should use glTexImage2D only once per mipmap layer of interest; after that, all uploading should happen with glTexSubImage2D.
Second, your internal format (third parameter from the left) is GL_RGBA. You should always use explicit sizes for your image formats. So use GL_RGBA8. This isn't really a problem, but you should get into the habit now.
Third, you're using GL_RGBA ordering for your pixel transfer format (the third parameter from the right, not the left). This is generally not the most optimal pixel transfer format, as lots of hardware tends to prefer GL_BGRA ordering. But if you're not getting your data from whatever is producing it in that order, then there's not much that can be done.
Fourth, if you have something else you can do between starting the upload and actually rendering with it, you can employ asynchronous pixel transfer operations. You write your data to a buffer object (which can be mapped, so that you don't have to copy into it). Then you use glTexSubImage2D to transfer this data to OpenGL. Because the source data and the destination image are part of OpenGL's memory, it doesn't have to copy the data out of client memory before glTexSubImage2D returns.
Granted, that's probably not going to help you much, since you're already effectively doing that copy in the D3D case.
In OpenGL it drops from 427 to 297 (a 30% drop!)
The more important statistic is that it's a 1 millisecond difference. You should look at your timings in absolute time, not in frames-per-second, nor in percentage drops of FPS.
glTexImage2d does memory reallocation as well as update. Try to use glTexSubImage2d instead.
I render to a floating point texture in a FBO and need the average value of all pixels of that texture on the CPU. So I thought using mipmapping to calculate the average into the 1x1 mipmap is pretty convenient because I save CPU computation time and I only need to transfer 1 pixel to the CPU instad of lets say 1024x1024 pixels.
So I use this line:
glGetTexImage(GL_TEXTURE_2D, variableHighestMipMapLevel, GL_RGBA, GL_FLOAT, fPixel);
But despite the fact that i specifically request only the highest mipmap level, which is always 1x1 pixel in size, the time it takes for that line of code to complete depends on the size of the level 0 mipmap of the texture. Which makes no sense to me. In my tests, for example, this line takes around 12 times longer for a 1024x1024 base texture than for a 32x32 base texture.
The result in fPixel is correct and only contains the wanted pixel, but the time clearly tells that the whole texture set is transferred, which kills the main reason for me, because the transfer to the CPU is clearly the bottleneck for me.
I use Win7 and opengl and tested this on an ATI Radeon HD 4800 and a GeForce 8800 GTS.
Does anybody know anything about that problem or has a smart way to only transfer the one pixel of the highest mipmap to the CPU?
glGenerateMipmap( GL_TEXTURE_2D );
float *fPixel = new float[4];
Timer.resume();
glGetTexImage(GL_TEXTURE_2D, highestMipMapLevel, GL_RGBA, GL_FLOAT, fPixel);
Timer.stop();
Let this be a lesson to you: always provide complete information.
The reason it takes 12x longer is because you're measuring the time it takes to generate the mipmaps, not the time it takes to transfer the mipmap to the CPU. glGenerateMipmap, like most rendering commands, will not actually have finished by the time it returns. Indeed, odds are good that it won't have even started. This is good, because it allows OpenGL to run independently of the CPU. You issue a rendering command, and it completes sometime later.
However, the moment you start reading from that texture, OpenGL has to stall the CPU and wait until everything that will touch that texture has finished. Therefore, your timing is measuring the time it takes to perform all operations on the texture as well as the time to transfer the data back.
If you want a more accurate measurement, issue a glFinish before you start your timer.
More importantly, if you want to perform an asynchronous read of pixel data, you will need to do the read into a buffer object. This allows OpenGL to avoid the CPU stall, but it is only helpful if you have other work you could be doing in the meantime.
For example, if you're doing this to figure out the overall lighting for a scene for HDR tone mapping, you should be doing this for the previous frame's scene data, not the current one. Nobody will notice. So you render a scene, generate mipmaps, read into a buffer object, then render the next frame's scene, generate mipmaps, read into a different buffer object, then start reading from the previous scene's buffer.
That way, by the time you start reading the previous read's results, they will actually be there and no CPU stall will happen.
This question already has answers here:
How to use GLUT/OpenGL to render to a file?
(6 answers)
Closed 9 years ago.
My aim is to render OpenGL scene without a window, directly into a file. The scene may be larger than my screen resolution is.
How can I do this?
I want to be able to choose the render area size to any size, for example 10000x10000, if possible?
It all starts with glReadPixels, which you will use to transfer the pixels stored in a specific buffer on the GPU to the main memory (RAM). As you will notice in the documentation, there is no argument to choose which buffer. As is usual with OpenGL, the current buffer to read from is a state, which you can set with glReadBuffer.
So a very basic offscreen rendering method would be something like the following. I use c++ pseudo code so it will likely contain errors, but should make the general flow clear:
//Before swapping
std::vector<std::uint8_t> data(width*height*4);
glReadBuffer(GL_BACK);
glReadPixels(0,0,width,height,GL_BGRA,GL_UNSIGNED_BYTE,&data[0]);
This will read the current back buffer (usually the buffer you're drawing to). You should call this before swapping the buffers. Note that you can also perfectly read the back buffer with the above method, clear it and draw something totally different before swapping it. Technically you can also read the front buffer, but this is often discouraged as theoretically implementations were allowed to make some optimizations that might make your front buffer contain rubbish.
There are a few drawbacks with this. First of all, we don't really do offscreen rendering do we. We render to the screen buffers and read from those. We can emulate offscreen rendering by never swapping in the back buffer, but it doesn't feel right. Next to that, the front and back buffers are optimized to display pixels, not to read them back. That's where Framebuffer Objects come into play.
Essentially, an FBO lets you create a non-default framebuffer (like the FRONT and BACK buffers) that allow you to draw to a memory buffer instead of the screen buffers. In practice, you can either draw to a texture or to a renderbuffer. The first is optimal when you want to re-use the pixels in OpenGL itself as a texture (e.g. a naive "security camera" in a game), the latter if you just want to render/read-back. With this the code above would become something like this, again pseudo-code, so don't kill me if mistyped or forgot some statements.
//Somewhere at initialization
GLuint fbo, render_buf;
glGenFramebuffers(1,&fbo);
glGenRenderbuffers(1,&render_buf);
glBindRenderbuffer(render_buf);
glRenderbufferStorage(GL_RENDERBUFFER, GL_BGRA8, width, height);
glBindFramebuffer(GL_DRAW_FRAMEBUFFER​,fbo);
glFramebufferRenderbuffer(GL_DRAW_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER, render_buf);
//At deinit:
glDeleteFramebuffers(1,&fbo);
glDeleteRenderbuffers(1,&render_buf);
//Before drawing
glBindFramebuffer(GL_DRAW_FRAMEBUFFER​,fbo);
//after drawing
std::vector<std::uint8_t> data(width*height*4);
glReadBuffer(GL_COLOR_ATTACHMENT0);
glReadPixels(0,0,width,height,GL_BGRA,GL_UNSIGNED_BYTE,&data[0]);
// Return to onscreen rendering:
glBindFramebuffer(GL_DRAW_FRAMEBUFFER​,0);
This is a simple example, in reality you likely also want storage for the depth (and stencil) buffer. You also might want to render to texture, but I'll leave that as an exercise. In any case, you will now perform real offscreen rendering and it might work faster then reading the back buffer.
Finally, you can use pixel buffer objects to make read pixels asynchronous. The problem is that glReadPixels blocks until the pixel data is completely transfered, which may stall your CPU. With PBO's the implementation may return immediately as it controls the buffer anyway. It is only when you map the buffer that the pipeline will block. However, PBO's may be optimized to buffer the data solely on RAM, so this block could take a lot less time. The read pixels code would become something like this:
//Init:
GLuint pbo;
glGenBuffers(1,&pbo);
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo);
glBufferData(GL_PIXEL_PACK_BUFFER, width*height*4, NULL, GL_DYNAMIC_READ);
//Deinit:
glDeleteBuffers(1,&pbo);
//Reading:
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo);
glReadPixels(0,0,width,height,GL_BGRA,GL_UNSIGNED_BYTE,0); // 0 instead of a pointer, it is now an offset in the buffer.
//DO SOME OTHER STUFF (otherwise this is a waste of your time)
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo); //Might not be necessary...
pixel_data = glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);
The part in caps is essential. If you just issue a glReadPixels to a PBO, followed by a glMapBuffer of that PBO, you gained nothing but a lot of code. Sure the glReadPixels might return immediately, but now the glMapBuffer will stall because it has to safely map the data from the read buffer to the PBO and to a block of memory in main RAM.
Please also note that I use GL_BGRA everywhere, this is because many graphics cards internally use this as the optimal rendering format (or the GL_BGR version without alpha). It should be the fastest format for pixel transfers like this. I'll try to find the nvidia article I read about this a few monts back.
When using OpenGL ES 2.0, GL_DRAW_FRAMEBUFFER might not be available, you should just use GL_FRAMEBUFFER in that case.
I'll assume that creating a dummy window (you don't render to it; it's just there because the API requires you to make one) that you create your main context into is an acceptable implementation strategy.
Here are your options:
Pixel buffers
A pixel buffer, or pbuffer (which isn't a pixel buffer object), is first and foremost an OpenGL context. Basically, you create a window as normal, then pick a pixel format from wglChoosePixelFormatARB (pbuffer formats must be gotten from here). Then, you call wglCreatePbufferARB, giving it your window's HDC and the pixel buffer format you want to use. Oh, and a width/height; you can query the implementation's maximum width/heights.
The default framebuffer for pbuffer is not visible on the screen, and the max width/height is whatever the hardware wants to let you use. So you can render to it and use glReadPixels to read back from it.
You'll need to share you context with the given context if you have created objects in the window context. Otherwise, you can use the pbuffer context entirely separately. Just don't destroy the window context.
The advantage here is greater implementation support (though most drivers that don't support the alternatives are also old drivers for hardware that's no longer being supported. Or is Intel hardware).
The downsides are these. Pbuffers don't work with core OpenGL contexts. They may work for compatibility, but there is no way to give wglCreatePbufferARB information about OpenGL versions and profiles.
Framebuffer Objects
Framebuffer Objects are more "proper" offscreen rendertargets than pbuffers. FBOs are within a context, while pbuffers are about creating new contexts.
FBOs are just a container for images that you render to. The maximum dimensions that the implementation allows can be queried; you can assume it to be GL_MAX_VIEWPORT_DIMS (make sure an FBO is bound before checking this, as it changes based on whether an FBO is bound).
Since you're not sampling textures from these (you're just reading values back), you should use renderbuffers instead of textures. Their maximum size may be larger than those of textures.
The upside is the ease of use. Rather than have to deal with pixel formats and such, you just pick an appropriate image format for your glRenderbufferStorage call.
The only real downside is the narrower band of hardware that supports them. In general, anything that AMD or NVIDIA makes that they still support (right now, GeForce 6xxx or better [note the number of x's], and any Radeon HD card) will have access to ARB_framebuffer_object or OpenGL 3.0+ (where it's a core feature). Older drivers may only have EXT_framebuffer_object support (which has a few differences). Intel hardware is potluck; even if they claim 3.x or 4.x support, it may still fail due to driver bugs.
If you need to render something that exceeds the maximum FBO size of your GL implementation libtr works pretty well:
The TR (Tile Rendering) library is an OpenGL utility library for doing
tiled rendering. Tiled rendering is a technique for generating large
images in pieces (tiles).
TR is memory efficient; arbitrarily large image files may be generated
without allocating a full-sized image buffer in main memory.
The easiest way is to use something called Frame Buffer Objects (FBO). You will still have to create a window to create an opengl context though (but this window can be hidden).
The easiest way to fulfill your goal is using FBO to do off-screen render. And you don't need to render to texture, then get the teximage. Just render to buffer and use function glReadPixels. This link will be useful. See Framebuffer Object Examples