OpenGL glReadPixels Performance - opengl

I am trying to implement Auto Exposure for HDR Tone mapping and I am trying to reduce the cost of finding the average brightness of my scene and I've seemed to hit a choke point with glReadPixels. Here is my setup:
1: I create a downsampled FBO to reduce the cost of reading when using glReadPixelsusing only the GL_RED values and in GL_BYTE format.
private void CreateDownSampleExposure() {
DownFrameBuffer = glGenFramebuffers();
DownTexture = GL11.glGenTextures();
glBindFramebuffer(GL_FRAMEBUFFER, DownFrameBuffer);
GL11.glBindTexture(GL11.GL_TEXTURE_2D, DownTexture);
GL11.glTexImage2D(GL11.GL_TEXTURE_2D, 0, GL11.GL_RED, 1600/8, 1200/8,
0, GL11.GL_RED, GL11.GL_BYTE, (ByteBuffer) null);
GL11.GL_TEXTURE_2D, DownTexture, 0);
if (glCheckFramebufferStatus(GL_FRAMEBUFFER) != GL_FRAMEBUFFER_COMPLETE) {
} else {
GL11.glBindTexture(GL11.GL_TEXTURE_2D, 0);
glBindFramebuffer(GL_FRAMEBUFFER, 0);
2: Setting up the ByteBuffers and reading the texture of the FBO texture Above.
byte[] testByte = new byte[1600/8*1000/8];
ByteBuffer testByteBuffer = BufferUtils.createByteBuffer(testByte.length);
//Render scene and store result into downSampledFBO texture
GL11.glBindTexture(GL11.GL_TEXTURE_2D, DeferredFBO.getDownTexture());
//GL11.glGetTexImage(GL11.GL_TEXTURE_2D, 0, GL11.GL_RED, GL11.GL_BYTE,
//testByteBuffer); <- This is slower than readPixels.
GL11.glReadPixels(0, 0, DisplayManager.Width/8, DisplayManager.Height/8,
GL11.GL_RED, GL11.GL_BYTE, testByteBuffer);
int x = 0;
for(int i = 0; i <testByteBuffer.capacity(); i++){
x+= testByteBuffer.get(i);
System.out.println(x); <-Print out accumulated value of brightness.
//Adjust exposure depending on brightness.
The problem is, I can downsample my FBO texture by a factor of 100, so my glReadPixelsreads only 16x10 pixels and there is little to no performance gain. There is a substantial performance gain from no downsampling but once I get past around dividing the width and height by 8 it seems to fall off. It seems like there is such a huge overhead of just calling this function. Is there something I am doing incorrectly or not considering when calling glReadPixels?.

glReadPixels is slow because the CPU must wait until the GPU has finished all of it's rendering before it can give you the results. The dreaded sync point.
One way to make glReadPixels fast is to use some sort of double/triple buffering scheme, so that you only call glReadPixels on render-to-textures that you expect the GPU has already finished with. This is only viable if waiting a couple of frames before receiving the result of glReadPixels is acceptable in your application. For example, in a video game the latency could be justified as a simulation of the pupil's response time to a change in lighting conditions.
However, for your particular tone-mapping example, presumably you want to calculate the average brightness only to feed that information back into the GPU for another rendering pass. Instead of glReadPixels, calculate the average by copying your image to successively half-sized render targets with linear filtering (a box filter), until you're down to a 1x1 target.
That 1x1 target is now a texture containing your average brightness and can use that texture in your tone-mapping rendering pass. No sync points.


Non uniform pixel painting in low Frame rate

I am making an image editing program and when making the brush tool I have encountered a problem. The problem is when the frame rate is very low, since the program reads the mouse at that moment and paints the pixel below it. What solution could I use to fix this? I am using IMGUI and OpenGL.
Also Im using this code to update the image on the screen.
UpdateImage() {
if (this->CurrentImage->channels == 4) {
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, this->CurrentImage->width, this->CurrentImage->height, 0, GL_RGBA, GL_UNSIGNED_BYTE, this->CurrentImage->data);
else {
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGB, this->CurrentImage->width, this->CurrentImage->height, 0, GL_RGB, GL_UNSIGNED_BYTE, this->CurrentImage->data);
And the code for the pencil/brush
toolPencil(int _MouseImagePositionX, int _MouseImagePositionY) {
int index = ((_MouseImagePositionY - 1) * this->CurrentImage.width * this->CurrentImage.channels) + ((_MouseImagePositionX - 1) * this->CurrentImage.channels);
//Paint the pixel black[index] = 0;[index + 1] = 0;[index + 2] = 0;
if (CurrentImage.channels == 4) {[index + 3] = 255;
sample your mouse without redrawing in its event ...
redraw on mouse change when you can (depends on fps or architecture of your app)
instead of using mouse points directly use them as piecewise cubic curve control points
How can i produce multi point linear interpolation?
Catmull-Rom interpolation on SVG Paths
So you simply use the sampled points as cubic curve control points and interpolate/rasterize the missing pixels. Either sample each segment by 10 lines (increment parameter by 1.0/10.0) or sample it with small enough step so each step is smaller than pixel(based on distance between control points).

OpenGL, glMapNamedBuffer takes a long time

I've been writing an openGL program that generates vertices on the GPU using compute shaders, the problem is I need to read back the number of vertices from a buffer written to by one compute shader dispatch on the CPU so that I can allocate a buffer of the right size for the next compute shader dispatch to fill with vertices.
* Stage 1- Populate the 3d texture with voxel values
glBindTexture(GL_TEXTURE_3D, _RandomSeedTexture);
glBindImageTexture(2, _VoxelValuesTexture, 0, GL_TRUE, NULL, GL_READ_WRITE, GL_R32F);
_EvaluateVoxels.SetVec3("CellSize", voxelCubeDims);
_EvaluateVoxels.SetVec3("StartPos", chunkPosLL);
glDispatchCompute(voxelDim.x + 1, voxelDim.y + 1, voxelDim.z + 1);
* Stage 2 - Calculate the marching cube's case for each cube of 8 voxels,
* listing those that contain polygons and counting the no of vertices that will be produced
_GetNonEmptyVoxels.SetFloat("IsoLevel", isoValue);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 3, _IntermediateDataSSBO);
glBindBufferBase(GL_ATOMIC_COUNTER_BUFFER, 0, _AtomicCountersBuffer);
glDispatchCompute(voxelDim.x, voxelDim.y, voxelDim.z);
//printStage2(_IntermediateDataSSBO, true);
// this line takes a long time
unsigned int* vals = (unsigned int*)glMapNamedBuffer(_AtomicCountersBuffer, GL_READ_WRITE);
unsigned int vertex_counter = vals[1];
unsigned int index_counter = vals[0];
vals[0] = 0;
vals[1] = 0;
The image below shows times in milliseconds that each stage of the code takes to run, "timer Evaluate" refers to the method as a whole, IE the sum total of the previous stages. getvertexcounter refers to only the mapping, reading and unmapping of a buffer containing the number of vertices. Please see code for more detail.
I've found this to be by far the slowest stage in the process, and I gather it has something to do with the asynchronous nature of the communication between openGL and the GPU and the need to synchronise data that was written by the compute shader so it can be read by the CPU. My question is this: Is this delay avoidable? I don't think that the overall approach is flawed because I know that someone else has implemented the algorithm in a similar way, albeit using direct X (I think).
You can find my code at , the code in question is in the file ComputeShaderMarcher.cpp and the method unsigned int ComputeShaderMarcher::GenerateMesh(const glm::vec3& chunkPosLL, const glm::vec3& chunkDim, const glm::ivec3& voxelDim, float isoValue, GLuint VBO)
In order to access data from a buffer that you have had OpenGL write some data to, the CPU must halt execution until the GPU has actually written that data. Whatever process you use to access this data (glMapBufferRange, glGetBufferSubData, etc), that process must halt until the GPU has finished generating the data.
So don't try to access GPU-generated data until you're sure the GPU has actually generated it (or you have absolutely nothing better to do on the CPU than wait). Use fence sync objects to test whether the GPU has finished executing past a certain point.

SDL_GL_SwapWindow bad performance

I did some performance testing and came up with this:
for(U32 i=0;i<objectList.length();++i)
PC d("draw");
VoxelObject& obj = *objectList[i];
tmpM = usedView->projection * usedView->transform * obj.transform;
glUniformMatrix4fv(shader.modelViewMatrixLoc, 1, GL_FALSE,;
glBindTexture(GL_TEXTURE_2D, typesheet.tbo);
glUniform1i(shader.typesheetLoc, 0);
glDrawArrays(GL_TRIANGLES, 0, VoxelObject::VERTICES_PER_BOX*obj.getNumBoxes());
d.out(); // 2 calls 0.000085s and 0.000043s each
PC swap("swap");
SDL_GL_SwapWindow(mainWindow); // 1 call 0.007823s
The call to SDL_GL_SwapWindow(mainWindow); is taking 200 times longer than the draw calls! To my understanding i thought all that function was supposed to do was swap buffers. That would mean that the time it takes to swap would scale depending on the screen size right? No it scales based on the amount of geometry... I did some searching online, I have double buffering enable and vsync is turned off. I am stumped.
Your OpenGL driver is likely doing deferred rendering.
That means the calls to the glDrawArrays and friends don't draw anything. Instead they buffer all required information to perform the operation later on.
The actual rendering happens inside SDL_GL_SwapWindow.
This behavior is typical these days because you want to avoid having to synchronize between the CPU and the GPU as much as possible.

DirectX using multiple Render Targets as input to each other

I have a fairly simple DirectX 11 framework setup that I want to use for various 2D simulations. I am currently trying to implement the 2D Wave Equation on the GPU. It requires I keep the grid state of the simulation at 2 previous timesteps in order to compute the new one.
How I went about it was this - I have a class called FrameBuffer, which has the following public methods:
bool Initialize(D3DGraphicsObject* graphicsObject, int width, int height);
void BeginRender(float clearRed, float clearGreen, float clearBlue, float clearAlpha) const;
void EndRender() const;
// Return a pointer to the underlying texture resource
const ID3D11ShaderResourceView* GetTextureResource() const;
In my main draw loop I have an array of 3 of these buffers. Every loop I use the textures from the previous 2 buffers as inputs to the next frame buffer and I also draw any user input to change the simulation state. I then draw the result.
int nextStep = simStep+1;
if (nextStep > 2)
nextStep = 0;
ID3D11ShaderResourceView* texArray[2] = { mFrameArray[simStep]->GetTextureResource(),
mFrameArray[prevStep]->GetTextureResource() };
result = mWaveShader->Render(d3dGraphicsObj, mQuad->GetRenderer()->GetIndexCount(), texArray);
if (!result)
return false;
// perform any extra input
I_InputSystem *inputSystem = ServiceProvider::Instance().GetInputSystem();
if (inputSystem->IsMouseLeftDown()) {
int x,y;
int width,height;
float xPos = MapValue((float)x,0.0f,(float)width,-1.0f,1.0f);
float yPos = MapValue((float)y,0.0f,(float)height,-1.0f,1.0f);
mColorQuad->mTransform.position = Vector3f(xPos,-yPos,0);
result = mColorQuad->Render(&viewMatrix,&orthoMatrix);
if (!result)
return false;
prevStep = simStep;
simStep = nextStep;
ID3D11ShaderResourceView* currTexture = mFrameArray[nextStep]->GetTextureResource();
// Render texture to screen
result = mQuad->Render(&viewMatrix,&orthoMatrix);
if (!result)
return false;
The problem is nothing is happening. Whatever I draw appears on the screen(I draw using a small quad) but no part of the simulation is actually ran. I can provide the shader code if required, but I am certain it works since I've implemented this before on the CPU using the same algorithm. I'm just not certain how well D3D render targets work and if I'm just drawing wrong every frame.
Here is the code for the begin and end render functions of the frame buffers:
void D3DFrameBuffer::BeginRender(float clearRed, float clearGreen, float clearBlue, float clearAlpha) const {
ID3D11DeviceContext *context = pD3dGraphicsObject->GetDeviceContext();
context->OMSetRenderTargets(1, &(mRenderTargetView._Myptr), pD3dGraphicsObject->GetDepthStencilView());
float color[4];
// Setup the color to clear the buffer to.
color[0] = clearRed;
color[1] = clearGreen;
color[2] = clearBlue;
color[3] = clearAlpha;
// Clear the back buffer.
context->ClearRenderTargetView(mRenderTargetView.get(), color);
// Clear the depth buffer.
context->ClearDepthStencilView(pD3dGraphicsObject->GetDepthStencilView(), D3D11_CLEAR_DEPTH, 1.0f, 0);
void D3DFrameBuffer::EndRender() const {
Edit 2 Ok, I after I set up the DirectX debug layer I saw that I was using an SRV as a render target while it was still bound to the Pixel stage in out of the shaders. I fixed that by setting shader resources to NULL after I render with the wave shader, but the problem still persists - nothing actually gets ran or updated. I took the render target code from here and slightly modified it, if its any help:
Okay, as I understand correct you need a multipass-rendering to texture.
Basiacally you do it like I've described here: link
You creating SRVs with both D3D11_BIND_SHADER_RESOURCE and D3D11_BIND_RENDER_TARGET bind flags.
You ctreating render targets from textures
You set first texture as input (*SetShaderResources()) and second texture as output (OMSetRenderTargets())
You Draw()*
then you bind second texture as input, and third as output
Additional advices:
If your target GPU capable to write to UAVs from non-compute shaders, you can use it. It is much more simple and less error prone.
If your target GPU suitable, consider using compute shader. It is a pleasure.
Don't forget to enable DirectX debug layer. Sometimes we make obvious errors and debug output can point to them.
Use graphics debugger to review your textures after each draw call.
Edit 1:
As I see, you call BeginRender and OMSetRenderTargets only once, so, all rendering goes into mRenderTargetView. But what you need is to interleave:
Also, we don't know what is mRenderTargetView yet.
so, before
result = mColorQuad->Render(&viewMatrix,&orthoMatrix);
somewhere must be OMSetRenderTargets .
Probably, it s better to review your Begin()/End() design, to make resource binding more clearly visible.
Happy coding! =)

Removal of OpenGL rubber banding artefacts

I'm working with some OpenGL code for scientific visualization and I'm having issues getting its rubber banding working on newer hardware. The code is drawing a "Zoom Window" over an existing scene with one corner of the "Zoom Window" at the stored left-click location, and the other under the mouse as it is moved. On the second left-click the scene zooms into the selected window.
The symptoms I am seeing as the mouse is moved across the scene are:
Rubber banding artefacts appearing which are the lines used to create the "Zoom Window" not being removed from the buffer by the second "RenderLogic" pass (see code below)
I can clearly see the contents of the previous buffer flashing up and disappearing as the buffers are swapped
The above problem doesn't happen on low end hardware such as the integrated graphics on a netbook I have. Also, I can't recall this problem ~5 years ago when this code was written.
Here are the relevant code sections, trimmed down to focus on the relevant OpenGL:
// Called by every mouse move event
// Makes use of current device context (m_hDC) and rendering context (m_hRC)
void MyViewClass::DrawLogic()
// Make the rendering context current
if (!wglMakeCurrent(m_hDC, m_hRC))
// ... error handling
// Perform the logic rendering
// Draws the rectangle on the buffer using XOR op
bSwapRv = ::SwapBuffers(m_hDC);
// Removes the rectangle from the buffer via a second pass
// Release the rendering context
if (!wglMakeCurrent(NULL, NULL))
// ... error handling
void MyViewClass::RenderLogic(void)
glLineStipple(1, 0x0F0F);
// Uses custom "Point" class with Coords() method returning double*
// Draw rectangle with corners at clicked location and current location
glVertex2d(m_pntCurrLoc.X(), m_pntClickLoc.Y());
glVertex2d(m_pntClickLoc.X(), m_pntCurrLoc.Y());
// Setup code that might be relevant to the buffer configuration
bool MyViewClass::SetupPixelFormat()
1, // Version number (?)
PFD_DRAW_TO_WINDOW // Format must support window
| PFD_SUPPORT_OPENGL // Format must support OpenGL
| PFD_DOUBLEBUFFER, // Must support double buffering
PFD_TYPE_RGBA, // Request an RGBA format
32, // Select a 32 bit colour depth
0, 0, 0, 0, 0, 0, // Colour bits ignored (?)
8, // Alpha buffer bits
0, // Shift bit ignored (?)
0, // No accumulation buffer
0, 0, 0, 0, // Accumulation bits ignored
16, // 16 bit Z-buffer
0, // No stencil buffer
0, // No accumulation buffer (?)
PFD_MAIN_PLANE, // Main drawing layer
0, // Number of overlay and underlay planes
0, 0, 0 // Layer masks ignored (?)
memset(&chosen_pfd, 0, sizeof(PIXELFORMATDESCRIPTOR));
chosen_pfd.nSize = sizeof(PIXELFORMATDESCRIPTOR);
// Find the closest match to the pixel format
m_uPixelFormat = ::ChoosePixelFormat(m_hDC, &pfd);
// Make sure a pixel format could be found
if (!m_uPixelFormat)
return false;
::DescribePixelFormat(m_hDC, m_uPixelFormat, sizeof(PIXELFORMATDESCRIPTOR), &chosen_pfd);
// Set the pixel format for the view
::SetPixelFormat(m_hDC, m_uPixelFormat, &chosen_pfd);
return true;
Any pointers on how to remove the artefacts will be much appreciated.
#Krom - image below
With OpenGL it's canonical to redraw the whole viewport if just anything changes. Consider this: Modern system draw animates complex scenes at well over 30 FPS.
But I understand, that you may want to avoid this. So the usual approach is to first copy the frontbuffer in a texture, before drawing the first rubberband. Then for each rubberband redraw "reset" the image by drawing a framebuffer filling quad with the texture.
I know I'm posting to a year and half old question but in case anyone else comes across this.
I've had this happen to me myself is because you are trying to remove the lines off of the wrong buffer. For example you draw your rectangle on buffer A call swapBuffer and then try to remove the rectangle off of buffer B. What you would want to do is keep track of 2 "zoom window" rectangles while your doing the drawing one for buffer A and one for buffer B and then keep track of which one is the most recent.
If you're using Vista/7 and Aero, try switching to the Classic theme.