Pixel-path performance warning: Pixel transfer is synchronized with 3D rendering - c++

I am uploading image data into GL texture asynchronously.
In debug output I am getting these warnings during the rendering:
Source:OpenGL,type: Other, id: 131185, severity: Notification
Message: Buffer detailed info: Buffer object 1 (bound to
GL_PIXEL_UNPACK_BUFFER_ARB, usage hint is GL_DYNAMIC_DRAW) has been
mapped WRITE_ONLY in SYSTEM HEAP memory (fast). Source:OpenGL,type:
Performance, id: 131154, severity: Medium Message: Pixel-path
performance warning: Pixel transfer is synchronized with 3D rendering.
I can't see any wrong usage of PBOs in my case or any errors.So the questions is, if these warnings are safe to discard, or I am actually doing smth wrong.
My code for that part:
//start copuying pixels into PBO from RAM:
mPBOs[mCurrentPBO].Bind(GL_PIXEL_UNPACK_BUFFER);
const uint32_t buffSize = pipe->GetBufferSize();
GLubyte* ptr = (GLubyte*)mPBOs[mCurrentPBO].MapRange(0, buffSize, GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT);
if (ptr)
{
memcpy(ptr, pipe->GetBuffer(), buffSize);
mPBOs[mCurrentPBO].Unmap();
}
//copy pixels from another already full PBO(except of first frame into texture //
mPBOs[1 - mCurrentPBO].Bind(GL_PIXEL_UNPACK_BUFFER);
//mCopyTex is bound to mCopyFBO as attachment
glTextureSubImage2D(mCopyTex->GetHandle(), 0, 0, 0, mClientSize.x, mClientSize.y,
GL_RGBA, GL_UNSIGNED_BYTE, 0);
mCurrentPBO = 1 - mCurrentPBO;
Then I just blit the result to default frame buffer. No rendering of geometry or anything like that.
glBlitNamedFramebuffer(
mCopyFBO,
0,//default FBO id
0,
0,
mViewportSize.x,
mViewportSize.y,
0,
0,
mViewportSize.x,
mViewportSize.y,
GL_COLOR_BUFFER_BIT,
GL_LINEAR);
Running on NVIDIA GTX 960 card.

This performance warning is nividia-specific and it is intended as a hint to tell you that you're not going to use a separate hw transfer queue, which is no wonder since you use a single thread, single GL context model, where both rendering (at least your your blit) and transfer are carried out.
See this nvidia presentation for some details about how nvidia handles this. Page 22 also explains this specific warning. Note that this warnign does not mean that your transfer is not asynchronous. It is still fully asynchronous to the CPU thread. It will just be synchronously processed on the GPU, with respect to the render commands which are in the same command queue, and you're not using the asynchronous copy engine which could do these copies independent from the rendering commands in a separate command queue.
I can't see any wrong usage of PBOs in my case or any errors.So the questions is, if these warnings are safe to discard, or I am actually doing smth wrong.
There is nothing wrong with your PBO usage.
It is not clear if your specific application could even benefit from using a more elaborate separate transfer queue scheme.

Related

Read pixel data from default framebuffer in OpenGL: Performance of FBO vs. PBO

My goal is to read the contents of the default OpenGL framebuffer and store the pixel data in a cv::Mat. Apparently there are two different ways of achieving this:
1) Synchronous: use FBO and glRealPixels
cv::Mat a = cv::Mat::zeros(cv::Size(1920, 1080), CV_8UC3);
glReadPixels(0, 0, 1920, 1080, GL_BGR, GL_UNSIGNED_BYTE, a.data);
2) Asynchronous: use PBO and glReadPixels
cv::Mat b = cv::Mat::zeros(cv::Size(1920, 1080), CV_8UC3);
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo_userImage);
glReadPixels(0, 0, 1920, 1080, GL_BGR, GL_UNSIGNED_BYTE, 0);
unsigned char* ptr = static_cast<unsigned char*>(glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY));
std::copy(ptr, ptr + 1920 * 1080 * 3 * sizeof(unsigned char), b.data);
glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);
From all the information I collected on this topic, the asynchronous version 2) should be much faster. However, comparing the elapsed time for both versions yields that the differences are often times minimal, and sometimes version 1) events outperforms the PBO variant.
For performance checks, I've inserted the following code (based on this answer):
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
....
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << std::endl;
I've also experimented with the usage hint when creating the PBO: I didn't find much of difference between GL_DYNAMIC_COPY and GL_STREAM_READ here.
I'd be happy for suggestions how to increase the speed of this pixel read operation from the framebuffer even further.
Your second version is not asynchronous at all, since you're mapping the buffer immediately after triggering the copy. The map call will then block until the contents of the buffer are available, effectively becoming synchronous.
Or: depending on the driver, it will block when actually reading from it. In other words the driver may implement the mapping in such a way that it causes a pagefault, and a subsequent synchronization. It doesn't really matter in your case, since you are still accessing that data straight away due to the std::copy.
The proper way of doing this is by using sync objects and fences.
Keep your PBO setup, but after issuing the glReadPixels into a PBO, insert a sync object into the stream via glFenceSync. Then, some time later, poll for that fence sync object to be complete (or just wait for it altogether) via glClientWaitSync.
If glClientWaitSync returns that the commands before the fence are complete, you can now read from the buffer without an expensive CPU/GPU sync. (If the driver is particularly stupid and didn't already move the buffer contents into mappable addresses, in spite of your usage hints on the PBO, you can use another thread to perform the map. glGetBufferSubData can be therefore cheaper, as the data doesn't need to be in a mappable range.)
If you need to do this on a frame-by-frame basis, you'll notice that it's very likely that you'll need more than one PBO, that is, have a small pool of them. This is because at the next frame the readback of the previous frame's data is not complete yet and the corresponding fence not signalled. (Yes, GPUs are massively pipelined these days, and they will be some frames behind your submission queue).

Compressed framebuffer offloading

I'm trying to write code that would offload framebuffer from one card to the other, and I'm wondering whether it's possible to efficiently use compression, since the memory seems to be bottlenecked in my case.
At the moment, I use simple readback & display routines:
readback:
glWaitsync(..);
glReadPixels(.., GL_BGRA, GL_UNISGNED_BYTE, NULL);
GLvoid *data = glMapBuffer(GL_PIXEL_PACK_BUFFER_EXT, GL_READ_ONLY);
display:
glGenBuffers(2, pbos);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_EXT, pbos[curr_ctx ^ 1]);
glBufferData(GL_PIXEL_UNPACK_BUFFER_EXT, width*height*4, NULL,GL_STREAM_DRAW);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_EXT, pbos[curr_ctx]);
glBufferData(GL_PIXEL_UNPACK_BUFFER_EXT, width*height*4, NULL,GL_STREAM_DRAW);
...
glBufferSubData(GL_PIXEL_UNPACK_BUFFER_EXT, 0, width*height*4, data);
glDrawPixels(width, height, GL_BGRA, GL_UNSIGNED_BYTE, NULL);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_EXT, pbos[cur_ctx ^= 1]);
...
glXSwapBuffers(...);
There is also some synchronization via mutexes and other miscellaneous code in there, but this is the main body of the current code.
Unfortunately, it seems that memory bandwidth is the biggest problem here (on display card side which is sort of a capable USB capture card).
Is there any way to optimize this via OpenGL compression (S3TC)?
Preferably, I would like to compress on render card, copy into RAM, and then send it downstream to capture (display) card.
I do believe that I've seen some people do this by copying framebuffer into texture, asking GL to compress it, but quite frankly I'm new to GL programming. So I thought I would ask here.

OpenGL PBO texture uploading oddities

I'm basing my tests on this popular PBO example (see pboUnpack.zip from http://www.songho.ca/opengl/gl_pbo.html). Tests are done on PBO Mode 1 per the example.
Running the original sample, I found that on my NVIDIA 560GTX PCIe x16 (driver v334.89 Win7 PRO x64 Core i5 Ivy Bridge 3.6GHz), glMapBufferARB() blocks for 15ms even when the glBufferDataARB() preceding was meant to prevent it from blocking (i.e. discard PBO).
I then changed the image size from the original 1024*1024 to 400*400, thinking surely it would reduce the blocking time. To my surprise, it remained at 15ms! CPU utilization remained high.
Experimenting further, I increased the image size to 4000*4000 and yet again I was surprised - glBufferDataARB reduced from 15ms to 0.1ms and CPU utilization reduced tremendously at the same time.
I am at a lost to explain what is going on here and I am hoping someone familiar with such issue could shed some light.
Code of interest:
// bind PBO to update pixel values
glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, pboIds[nextIndex]);
// map the buffer object into client's memory
// Note that glMapBufferARB() causes sync issue.
// If GPU is working with this buffer, glMapBufferARB() will wait(stall)
// for GPU to finish its job. To avoid waiting (stall), you can call
// first glBufferDataARB() with NULL pointer before glMapBufferARB().
// If you do that, the previous data in PBO will be discarded and
// glMapBufferARB() returns a new allocated pointer immediately
// even if GPU is still working with the previous data.
glBufferDataARB(GL_PIXEL_UNPACK_BUFFER_ARB, DATA_SIZE, 0, GL_STREAM_DRAW_ARB);
GLubyte* ptr = (GLubyte*)glMapBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, GL_WRITE_ONLY_ARB);
if(ptr)
{
// update data directly on the mapped buffer
updatePixels(ptr, DATA_SIZE);
glUnmapBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB); // release pointer to mapping buffer
}

OpenGL PBO for texture upload, can't understand one thing

Okay , I read everything about PBO here : http://www.opengl.org/wiki/Pixel_Buffer_Object
and there http://www.songho.ca/opengl/gl_pbo.html , but I still have a question and I don't know if I'll get any benefit out of a PBO in my case :
I'm doing video-streaming,currently I have a function copying my data buffers to 3 different textures, and then I'm doing some maths in a fragment shader and I display the texture.
I thought PBO could increase the upload time CPU -> GPU, but here it is, let's say we have this example here taken from the second link above.
glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, pboIds[nextIndex]);
// map the buffer object into client's memory
// Note that glMapBufferARB() causes sync issue.
// If GPU is working with this buffer, glMapBufferARB() will wait(stall)
// for GPU to finish its job. To avoid waiting (stall), you can call
// first glBufferDataARB() with NULL pointer before glMapBufferARB().
// If you do that, the previous data in PBO will be discarded and
// glMapBufferARB() returns a new allocated pointer immediately
// even if GPU is still working with the previous data.
glBufferDataARB(GL_PIXEL_UNPACK_BUFFER_ARB, DATA_SIZE, 0, GL_STREAM_DRAW_ARB);
GLubyte* ptr = (GLubyte*)glMapBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, GL_WRITE_ONLY_ARB);
if(ptr)
{
// update data directly on the mapped buffer
updatePixels(ptr, DATA_SIZE);
glUnmapBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB); // release pointer to mapping buffer
}
// measure the time modifying the mapped buffer
t1.stop();
updateTime = t1.getElapsedTimeInMilliSec();
///////////////////////////////////////////////////
// it is good idea to release PBOs with ID 0 after use.
// Once bound with 0, all pixel operations behave normal ways.
glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, 0);
Well, whatever is the behavior of the updatePixels function , it is still using CPU cycles to copy the data to the mapped buffer isn't it?
So let's say I wanted to use PBO in such a manner, that is, to update my frame pixels to the PBO in a function , and then in the display function to call glTexSubImage2D (which should return immediately)... Would I see any speed-up in term of performance?
I can't see why it would be faster... okay we're not waiting anymore during the glTex* call, but we're waiting during the function that uploads the frame to the PBO, aren't we?
Could someone clear that out for me please?
Thanks
The point about Buffer Objects is, that they can be use asynchronously. You can map a BO and then have some other part of the program update it (think threads, think asynchronous IO) while you can keep issuing OpenGL commands. A typical usage scenario with triple buffered PBOs may look like this:
wait_for_video_frame_load_complete(buffer[k-2])
glUnmapBuffer buffer[k-2]
glTexSubImage2D from buffer[k-2]
buffer[k] = glMapBuffer
start_load_next_video_frame(buffer[k]);
draw_texture
SwapBuffers
This allows your program to do usefull work and even upload data to OpenGL while its also used for rendering

GL/CL interoperability: Shared Texture [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 11 years ago.
I intend to make Graphics calculation with OpenCL such as ray casting, ray marching and others. And I want to use OpenGL to display result of this calculations (pixel images). I use texture buffer attached to frame buffer. OpenCL writes the result into the texture and then I use glBlitFrameBuffer function to copy texture data to application window framebuffer.
I met a CL/GL inter problem during the implementation of it. I wrote a simple example to show it. This example shows framebuffer object and texture object initialization, their conjunction, OpenCL buffer creation from GL texture buffer. At the end the main render loop is shown. It consists of texture writing with new data in each frame, framebuffer attachment and copying of this framebuffer.
Texture Initialization:
for (int i = 0; i < data.Length; i +=4) {
data [i] = 255;
}
GL.BindTexture (TextureTarget.Texture2D, tboID [0]);
GL.TexImage2D<byte>(TextureTarget.Texture2D, 0, PixelInternalFormat.Rgba8, w, h, 0,
PixelFormat.Rgba, PixelType.UnsignedByte, data);
GL.BindTexture (TextureTarget.Texture2D, 0)
TBO+FBO Initialization:
GL.BindFramebuffer (FramebufferTarget.FramebufferExt, fboID [0]);
GL.FramebufferTexture2D (FramebufferTarget.FramebufferExt, FramebufferAttachment.ColorAttachment0,
TextureTarget.Texture2D, tboID [0], 0);
GL.BindFramebuffer (FramebufferTarget.FramebufferExt, 0);
CL/GL Initialization:
bufferID = CL.CreateFromGLTexture2D (context, memInfo, textureTarget, mipLevel, glBufferObject, out errorCode);
Render Loop:
for (int i = 0; i < data.Length; i += 4) {
data [i] = tt;
}
tt++;
GL.BindTexture (TextureTarget.Texture2D, tboID [0]);
GL.TexImage2D<byte> (TextureTarget.Texture2D, 0, PixelInternalFormat.Rgba8, w, h, 0,
PixelFormat.Rgba, PixelType.UnsignedByte, data);
GL.BindTexture (TextureTarget.Texture2D, 0);
GL.BindFramebuffer (FramebufferTarget.FramebufferExt, fboID [0]);
GL.FramebufferTexture2D (FramebufferTarget.FramebufferExt, FramebufferAttachment.ColorAttachment0,
TextureTarget.Texture2D, tboID [0], 0);
GL.BindFramebuffer (FramebufferTarget.FramebufferExt, 0);GL.BindFramebuffer (FramebufferTarget.ReadFramebuffer, fboID [0]);
GL.ReadBuffer (ReadBufferMode.ColorAttachment0);
GL.DrawBuffer (DrawBufferMode.Back);
GL.BlitFramebuffer (0, 0, w, h, 0, 0, w, h, ClearBufferMask.ColorBufferBit, BlitFramebufferFilter.Nearest);
GL.BindFramebuffer (FramebufferTarget.ReadFramebuffer, 0);
At the first glance this code looks weird, but it completely shows my problem. CL does not work at all here. In this application OpenCL context is created and OpenCL buffer initialization is occured.
The work of this should be simple. The color of screen is being changed from black to red. And It does not work in this way. The color does not change from the initial red (texture initialization).
But it works normal when I comment the CL/GL Initialization (creation of CL buffer from GL texture).
Why is it so? Why the behavior of the GL buffer is changed depending on CL attachments? How to fix it and make it works?
EDIT 2:
Then you need to check why you are getting an InvalidImageFormatDescriptor. Check if the parameter order is okay and whether your image descriptor in the tbo is misleading (internal image structure - see OpenCL specification). From the spec:
CL_INVALID_IMAGE_FORMAT_DESCRIPTOR if the OpenGL texture internal format does not map to a supported OpenCL image format.
EDIT:
So I understand OpenCL functionality in OpenTK is provided by a separate project named Cloo. For ComputeImage2D their documentation states:
CreateFromGLTexture2D (ComputeContext context, ComputeMemoryFlags flags, int textureTarget, int mipLevel, int textureId)
Compared to yours:
CreateFromGLTexture2D (context, MemFlags.MemReadWrite, TextureTarget.Texture2D, ((uint[])tboID.Clone()) [0], 0);
Looking at that you have the mip level and the tbo in the wrong order. A false initialization might lead to some unknown behaviour.
Hard to tell what might be the issue from the code you're providing. Am looking at the Interop as well right now, just getting into it. First thing I would try is put a try/catch block around it and try to get a clue about any possible error code.
Have you verified the obvious: Is the cl_khr_gl_sharing extension available on your device?
Another guess, since you only provided the Texture/Image Initialisation of the actual OpenCl/OpenGL Interop within your sample code: Did you acquire the memory object?
cl_int clEnqueueAcquireGLObjects (cl_command_queue command_queue,
cl_uint num_objects.
const cl_mem *mem_objects,
cl_uint num_events_in_wait_list,
const cl_event *event_wait_list,
cl_event *event)
The OpenCL 1.1 specifiction states:
The function cl_int clEnqueueAcquireGLObjects is used to acquire OpenCL memory objects that have been created from OpenGL objects. These
objects need to be acquired before they can be used by any OpenCL commands queued to a
command-queue. The OpenGL objects are acquired by the OpenCL context associated with
command_queue and can therefore be used by all command-queues associated with the OpenCL
context.
So the issue might be that the memory object hasn't been bound to a specific command queue.
Also, one is responsible for issuing a glFinish() to ensure all the initialization on the OpenGL side is done before the memory object is acquired.
Finally we were able to run Itun' code on my system (Windows 7 / AMD Radeon HD 5870). Recall that Itun' code gradually changes color in texture from black to red by means of GL after activating GL/CL Interop on this texture.
The results are at least strange. On my system it works as intended. However, on Itun' system (Windows 7 / NVidia GeForce) the same code does not work at all and does not provide any exceptions or error codes. In addition, I would like to mention that CL works with this texture properly on both systems. Therefore, something is wrong with GL in this case.
We have no idea on what's going on - it could be either Itun' outdated GPU hardware or buggy NVidia drivers.