I'm basing my tests on this popular PBO example (see pboUnpack.zip from http://www.songho.ca/opengl/gl_pbo.html). Tests are done on PBO Mode 1 per the example.
Running the original sample, I found that on my NVIDIA 560GTX PCIe x16 (driver v334.89 Win7 PRO x64 Core i5 Ivy Bridge 3.6GHz), glMapBufferARB() blocks for 15ms even when the glBufferDataARB() preceding was meant to prevent it from blocking (i.e. discard PBO).
I then changed the image size from the original 1024*1024 to 400*400, thinking surely it would reduce the blocking time. To my surprise, it remained at 15ms! CPU utilization remained high.
Experimenting further, I increased the image size to 4000*4000 and yet again I was surprised - glBufferDataARB reduced from 15ms to 0.1ms and CPU utilization reduced tremendously at the same time.
I am at a lost to explain what is going on here and I am hoping someone familiar with such issue could shed some light.
Code of interest:
// bind PBO to update pixel values
glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, pboIds[nextIndex]);
// map the buffer object into client's memory
// Note that glMapBufferARB() causes sync issue.
// If GPU is working with this buffer, glMapBufferARB() will wait(stall)
// for GPU to finish its job. To avoid waiting (stall), you can call
// first glBufferDataARB() with NULL pointer before glMapBufferARB().
// If you do that, the previous data in PBO will be discarded and
// glMapBufferARB() returns a new allocated pointer immediately
// even if GPU is still working with the previous data.
// update data directly on the mapped buffer
updatePixels(ptr, DATA_SIZE);
glUnmapBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB); // release pointer to mapping buffer
This article is commonly referenced when anyone asks about video streaming textures in OpenGL.
It says:
To maximize the streaming transfer performance, you may use multiple pixel buffer objects. The diagram shows that 2 PBOs are used simultaneously; glTexSubImage2D() copies the pixel data from a PBO while the texture source is being written to the other PBO.
For nth frame, PBO 1 is used for glTexSubImage2D() and PBO 2 is used to get new texture source. For n+1th frame, 2 pixel buffers are switching the roles and continue to update the texture. Because of asynchronous DMA transfer, the update and copy processes can be performed simultaneously. CPU updates the texture source to a PBO while GPU copies texture from the other PBO.
They provide a simple bench-mark program which allows you to cycle between texture updates without PBO's, with a single PBO, and with two PBO's used as described above.
I see a slight performance improvement when enabling one PBO.
But the second PBO makes no real difference.
Right before the code glMapBuffer's the PBO, it calls glBufferData with the pointer set to NULL. It does this to avoid a sync-stall.
// map the buffer object into client's memory
// Note that glMapBufferARB() causes sync issue.
// If GPU is working with this buffer, glMapBufferARB() will wait(stall)
// for GPU to finish its job. To avoid waiting (stall), you can call
// first glBufferDataARB() with NULL pointer before glMapBufferARB().
// If you do that, the previous data in PBO will be discarded and
// glMapBufferARB() returns a new allocated pointer immediately
// even if GPU is still working with the previous data.
So, Here is my question...
Doesn't this make the second PBO completely useless? Just a waste of memory !?
With two PBO's the texture data is stored 3 times. 1 in the texture, and one in each PBO.
With a single PBO. There are two copies of the data. And temporarily only a 3rd in the event that glMapBuffer creates a new buffer because the existing one is presently being DMA'ed to the texture?
The comments seem to suggest that OpenGL drivers internally are capable to creating the second buffer IF and only WHEN it is required to avoid stalling the pipeline. The in-use buffer is being DMA'ed, and my call to map yields a new buffer for me to write to.
The Author of that article appears to be more knowledgeable in this area than myself. Have I completely mis-understood the point?
Answering my own question... But I wont accept it as an answer... (YET).
There are many problems with the benchmark program linked to in the question. It uses immediate mode. It uses GLUT!
The program was spending most of its time doing things we are not interested in profiling. Mainly rendering text via GLUT, and writing pretty stripes to the texture. So I have removed those functions.
I cranked the texture resultion up to 8K, and added more PBO Modes.
NO PBO (yeilds 6fps)
1 PBO. Orphan previous buffer. (yields 12.2 fps).
2 PBO's. Orpha previous buffer. (yields 12.2 fps).
1 PBO. DONT orphan previous PBO (possible stall - added by myself. yields 12.4 fps).
2 PBO's. DONT orphan previous PBO (possible stall - added by myself. yields 12.4 fps).
If anyone else would like to examine my code, it is vailable here
I have experimented with different texture sizes... and different updatePixels functions... I cannot, despite my best efforts get the double PBO implementation to perform any better than the single-PBO implementation.
Furthermore... NOT orphanning the previous buffer, actually vields better performance. Exactly opposite to what the article claims.
Perhaps modern drivers / hardware does not suffer the problem that this design is attemtping to fix...
Perhaps my graphics hardware / driver is buggy, and not taking advantage of the double-PBO...
Perhaps the commonly referenced article is completely wrong?
Who knows. . . .
My test hardware is Intel(R) HD Graphics 5500 (Broadwell GT2).
I am uploading image data into GL texture asynchronously.
In debug output I am getting these warnings during the rendering:
Source:OpenGL,type: Other, id: 131185, severity: Notification
Message: Buffer detailed info: Buffer object 1 (bound to
mapped WRITE_ONLY in SYSTEM HEAP memory (fast). Source:OpenGL,type:
Performance, id: 131154, severity: Medium Message: Pixel-path
performance warning: Pixel transfer is synchronized with 3D rendering.
I can't see any wrong usage of PBOs in my case or any errors.So the questions is, if these warnings are safe to discard, or I am actually doing smth wrong.
My code for that part:
//start copuying pixels into PBO from RAM:
const uint32_t buffSize = pipe->GetBufferSize();
GLubyte* ptr = (GLubyte*)mPBOs[mCurrentPBO].MapRange(0, buffSize, GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT);
if (ptr)
memcpy(ptr, pipe->GetBuffer(), buffSize);
//copy pixels from another already full PBO(except of first frame into texture //
//mCopyTex is bound to mCopyFBO as attachment
glTextureSubImage2D(mCopyTex->GetHandle(), 0, 0, 0, mClientSize.x, mClientSize.y,
mCurrentPBO = 1 - mCurrentPBO;
Then I just blit the result to default frame buffer. No rendering of geometry or anything like that.
0,//default FBO id
Running on NVIDIA GTX 960 card.
This performance warning is nividia-specific and it is intended as a hint to tell you that you're not going to use a separate hw transfer queue, which is no wonder since you use a single thread, single GL context model, where both rendering (at least your your blit) and transfer are carried out.
See this nvidia presentation for some details about how nvidia handles this. Page 22 also explains this specific warning. Note that this warnign does not mean that your transfer is not asynchronous. It is still fully asynchronous to the CPU thread. It will just be synchronously processed on the GPU, with respect to the render commands which are in the same command queue, and you're not using the asynchronous copy engine which could do these copies independent from the rendering commands in a separate command queue.
I can't see any wrong usage of PBOs in my case or any errors.So the questions is, if these warnings are safe to discard, or I am actually doing smth wrong.
There is nothing wrong with your PBO usage.
It is not clear if your specific application could even benefit from using a more elaborate separate transfer queue scheme.
What i'm used to from opengl was that inside the command buffer resources are bound to the shader, like glUniformMatrix4fv.
Now as far I can see the only equivalent is vkCmdPushConstants.
But there is also the option to create a large buffer with the data of all the objects. And then use vkCmdBindDescriptorSetsto change the offset so the shader gets uses the data for the corresponding object (correct me is something is wrong here, this is how i suppose it could be done).
But now what is the "right" way to get per-object data to your shader? And in what way does it depend on the amount of data the shader needs to change per object.
The the other question I have has to do with synchronized gpu and cpu.
You need to wait for a frame to be ready before you copy the data for the next frame onto the gpu.
So how can you let the buffer copy happen in a command buffer? something like vkFlushMappedMemoryRanges that takes a command buffer Then you could set semaphores and wait for the usages of the data to be complete before overwriting the old data on the gpu with new data for the next frame from RAM. And in RAM use a separate buffer for each image in the swapchain so you can already start writing the data for the next frames (upto the swapchain image count).
If you cannot synchronize that buffer copy it seems to me you would need a buffer on the gpu with per-object data for each swpachain image. And that seems like a waste of space to me.
The problem that i see a bit explained, if there is only 1 buffer containing shader data both in RAM on on GPU memory, and if you do not want for the gpu to be idle after each frame (I think you only want to wait if you already submitted the command buffers for all the frames that fit in the swapchain)
cpu pushes objects positions for frame 0 to the gpu
cpu commits the command buffer for frame 1
gpu starts rendering frame 0
cpu pushes object positions for frame 1 to the gpu
cpu commits the command buffer for frame 1
gpu finishes frame 0
gpu starts frame 1
gpu finishes frame 1
In the example the data for frame 1 is allready pushed to the gpu while it is still rendering frame 0, and this corrupts frame 0.
I'm sorry if my post is a bit incoherent or vague, but it's hard to explain a problem that you do not fully understand.
per-vertex should have been per-object.
A function i would be looking for is:
VkCmdFillGpuMemory(VkCommandBuffer commandbuffer, VkDeviceMemory myMemory, void* ramData).
Preferrably also with a range option to copy only a part of the data (so there is the option to copy data only for objects whose data changed)
Uniform data now all have to go through either pushbuffers or UBO-like buffers.
Per-vertex state (vertex attributes) is set with VkPipelineVertexInputStateCreateInfo in the VkGraphicsPipelineCreateInfo and you set the buffers to be used with vkCmdBindVertexBuffers.
There is a vkCmdCopyBuffer to copy data between buffers.
Is there a way to increase the speed of glReadPixels? Currently I do:
Gdx.gl.glReadPixels(0, 0, Gdx.graphics.getWidth(), Gdx.graphics.getHeight(), GL20.GL_RGBA, GL20.GL_UNSIGNED_BYTE, pixels);
The problem is that it blocks the rendering and is slow.
I have heard of Pixel Buffer Objects, but I am quite unsure on how to wire it up and whether it is faster or not.
Also is there any other solutation than glReadPixels?
Basically, I want to take a screenshot as fast as possible, without blocking the drawing of the next scene.
Is there a way to increase the speed of glReadPixels?
Well, the speed of that operation is actually not the main issue. It has to transfer a certain amount of bytes from the framebuffer to your system memory. In your typical desktop system with a discrete GPU, that involves sending the data over PCI-Express, and there is no way around that.
But as you already stated, the implicit synchronization is a big issue. If you need that pixel data as soon as possible, you can't really do much better than that synchronous readback. But if you can live with getting that data later, asynchronous readback via pixel buffer objects (PBOs) is the way to go.
The pseudo code for that is:
create PBO
do the glReadPixels
do something else. Both work on the CPU and issuing new commands for the GPU is ideal.
Read back the data from PBO by either using glGetBufferSubData or by mapping the PBO for reading.
The crucial point is the timing of step 5. I you do that to early, you still blocking the client side, as it will wait for the data to become available. For some screenshots, It should not be hard to delay that step for even one or two frames. That way, it will have only a slight impact on the overall render performance, and it will neither stall the GPU nor the CPU.
Okay , I read everything about PBO here : http://www.opengl.org/wiki/Pixel_Buffer_Object
and there http://www.songho.ca/opengl/gl_pbo.html , but I still have a question and I don't know if I'll get any benefit out of a PBO in my case :
I'm doing video-streaming,currently I have a function copying my data buffers to 3 different textures, and then I'm doing some maths in a fragment shader and I display the texture.
I thought PBO could increase the upload time CPU -> GPU, but here it is, let's say we have this example here taken from the second link above.
glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, pboIds[nextIndex]);
// map the buffer object into client's memory
// Note that glMapBufferARB() causes sync issue.
// If GPU is working with this buffer, glMapBufferARB() will wait(stall)
// for GPU to finish its job. To avoid waiting (stall), you can call
// first glBufferDataARB() with NULL pointer before glMapBufferARB().
// If you do that, the previous data in PBO will be discarded and
// glMapBufferARB() returns a new allocated pointer immediately
// even if GPU is still working with the previous data.
// update data directly on the mapped buffer
updatePixels(ptr, DATA_SIZE);
glUnmapBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB); // release pointer to mapping buffer
// measure the time modifying the mapped buffer
updateTime = t1.getElapsedTimeInMilliSec();
// it is good idea to release PBOs with ID 0 after use.
// Once bound with 0, all pixel operations behave normal ways.
Well, whatever is the behavior of the updatePixels function , it is still using CPU cycles to copy the data to the mapped buffer isn't it?
So let's say I wanted to use PBO in such a manner, that is, to update my frame pixels to the PBO in a function , and then in the display function to call glTexSubImage2D (which should return immediately)... Would I see any speed-up in term of performance?
I can't see why it would be faster... okay we're not waiting anymore during the glTex* call, but we're waiting during the function that uploads the frame to the PBO, aren't we?
Could someone clear that out for me please?
The point about Buffer Objects is, that they can be use asynchronously. You can map a BO and then have some other part of the program update it (think threads, think asynchronous IO) while you can keep issuing OpenGL commands. A typical usage scenario with triple buffered PBOs may look like this:
glUnmapBuffer buffer[k-2]
glTexSubImage2D from buffer[k-2]
buffer[k] = glMapBuffer
This allows your program to do usefull work and even upload data to OpenGL while its also used for rendering