Fastest way to transfer buffer/texture data from GPU to CPU

Fastest way to transfer buffer/texture data from GPU to CPU - c++

As the question implies, I'm trying to transfer some buffer data from the gpu to the cpu and I want to do it fast.
Specifically, I'd like to transfer a 640x480 float buffer in less than 1ms.
Question 1: Is this possible in less than 1ms?
Whether it is or not, I'd like to find out what the fastest way is. Everything I've tried up to this point is by using FBOs. Here are the different methods and their respective average time for transfer. All of these are run right after binding to the FBO and rendering on the textures. As I'm no expert, there might be mistakes in my code or I might be doing unnecessary steps so please let me know. The transfers, however, have all been checked to be successful. I transfer everything to cv::Mat objects.
1)Using glReadPixels - < 3.1ms
glBindTexture(GL_TEXTURE_2D,depthTexture);
glReadPixels(0, 0, width, height, GL_DEPTH_COMPONENT, GL_FLOAT,mat.data);
2)Using glGetTexImage - <2.9ms
glBindTexture(GL_TEXTURE_2D,depthTexture);
glGetTexImage(GL_TEXTURE_2D, 0, GL_DEPTH_COMPONENT, GL_FLOAT, mat.data);
3)Using PBO with glGetTexImage - <2.3ms
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo);
glBindTexture(GL_TEXTURE_2D, depthTexture);
glGetTexImage(GL_TEXTURE_2D, 0, GL_DEPTH_COMPONENT, GL_FLOAT, 0);
mat.data = (uchar*)glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);
Before I go on, I understand that I don't use PBOs to their full potential since I immediately call glMapBuffer but there is no other process for the cpu to do at the moment. The texture is drawn the moment I have the necessary data and the texture data is necessary for me to move on. Despite all this, PBOs still seem faster.
Here's something interesting(to me at least). These are measured in debug mode. In release mode they are all 1ms slower.
Question 2: Why are they slower in release mode? Can I change this?
Question 3: Are there any other ways I can try to do the transfer?
Extra notes on Q3:
I read somewhere that the integrated graphics card can have faster access. Is this a thing? How would I make use of this? Is this connected to GL_INTEL_map_texture?
I barely know what CUDA is but it seems there is a way to do the transfer faster using it. Is this true?
Will reading the depth buffer instead of a texture be faster?

Related

PBO streaming, exactly when does the sync happen?

I was trying to make ffmpeg decode and transform pixels into rgb8 format and write into a mapped pixel buffer and use streaming to update opengl texture, which is then rendered to a sdl window.
The decoding and uploading happens in a dedicated thread(make sws_scale writes to the mapped buffer), and the rendering is done in a the render thread in another context with sharing. (The PBO actually holds several frames, and the texture is a 2d array texture, to decouple the positions.)
Things works fine if I flush the mapped range in the decoding thread, and use glTextureSubImage3D in the render thread to update the texture at needed index. The integrated Intel gpu works pretty fast (it should) in this scenario, but the NV driver complains about Pixel-path performance warning: Pixel transfer is synchronized with 3D rendering.
I thought that might be that only glTextureSubImage3D actually does the upload, so I moved the glTextureSubImage3D right after the flush operation. This time the NV gpu works fine, and the warning disappears, whereas the intel gpu gives a black window, and only shows decoded content on closing.
The code is something like this:
//render thread
void RenderFrame(SDL_Window* window,GLobjects& glo, int index, int width, int height) {
glUniform1f(glo.index_location,index);
//The function in question
glTextureSubImage3D(glo.texture, 0, 0, 0, index, width, height, 1, GL_RGBA, GL_UNSIGNED_BYTE, (void*)(index * width * height * 4));
glClear(GL_COLOR_BUFFER_BIT);
glDrawArrays(GL_TRIANGLE_STRIP, 0, 4);
SDL_GL_SwapWindow(window);
}
//Decode thread
int DecodeFrameToPBO(GLobjects& glo, int index){
//fill the mapped range needed
glFlushMappedBufferRange(GL_PIXEL_UNPACK_BUFFER, index * width * height * 4, 4 * width * height);
//The function in question
//glTextureSubImage3D(glo.texture, 0, 0, 0, index, width, height, 1, GL_RGBA, GL_UNSIGNED_BYTE, (void*)(index * width * height * 4));
}
I'm really confused by the idea of client-side memory and how the driver asynchronously uploads the texture, where exactly is the upload is supposed to happen and what glTextureSubImage3D actually does when a GL_PIXEL_UNPACK_BUFFER is bound?
EDIT:
After adding a glFlush() command to flush the upload context's command queue after each upload, the intel version works properly without black screen.
UPDATE:
Adding the glFlush() seems to make the NV gpu emit warning 'Pixel-path performance warning: Pixel transfer is synchronized with 3D rendering.' again, and same video sample's GPU utilization grew from 8% to 10%. It seems the glFlush() triggers some internal synchronization that perhaps make things go under busy wait? Since the Intel GPU cannot work without the glFlush, even with the clientwaitsync version with the flush command bit set, and calling flush explicitly on the render side also does not work. So what should be done in order to make both driver happy (and reduce utilization)?

I think you are mislead by the nvidia warning. It does not imply that there is a CPU-GPU synchronization, it only tells you that the rendering of the texture is synchronized (has to wait for) uploading the texture, which is fine. See this answer for more details.
So my answer is: there is no issue, hence the solution is to not change anyhing.
I thought that might be that only glTextureSubImage3D actually does the upload, so I moved the glTextureSubImage3D right after the flush operation [into the decode thread].
If you do that, you now have to manually synchronize the rendering with the texture upload, or you will encounter half-written frames or even undefined content at times, and basically have a race condition.
You could do such synchronization with OpenGL Sync Objects. But in the end, you would not get more performance than in your original variant.
whereas the intel gpu gives a black window, and only shows decoded content on closing
It is not clear if this is only a result of the missing synchronization, a bug in your code, or even a driver bug.

Memory barrier fails to sync between compute stage and data access by CUDA

I have the following pipeline:
Render into texture attachment to custom FBO.
Bind that texture attachment as image.
Run compute shader ,sampling from the image above using imageLoad/Store.
Write the results into SSBO or image.
Map the SSBO (or image) as CUDA CUgraphicsResource and process the data from that buffer using CUDA.
Now,the problem is in synchronizing data between the stages 4 and 5. Here are the sync solutions I have tried.
glFlush - doesn't really work as it doesn't guarantee the completeness of the execution of all the commands.
glFinish - this one works. But it is not recommended as it finalizes all the commands submitted to the driver.
ARB_sync Here it is said it is not recommended because it heavily impacts performance.
glMemoryBarrier This one is interesting. But it simply doesn't work.
Here is example of the code:
glMemoryBarrier(GL_ALL_BARRIER_BITS);
And also tried:
glTextureBarrierNV()
The code execution goes like this:
//rendered into the fbo...
glBindFramebuffer(GL_READ_FRAMEBUFFER, fbo);
glBindImageTexture(imageUnit1, fboTex, 0, GL_FALSE, 0, GL_READ_ONLY,GL_RGBA8);
glBindImageTexture(imageUnit2, imageTex, 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA8));
glDispatchCompute(16, 16, 1);
glFinish(); // <-- must sync here,otherwise cuda buffer doesn't receive all the data
//cuda maps the image to CUDA buffer here..
Moreover, I tried unbinding FBOs and unbinding textures from the context before launching compute, I even tried to launch one compute after other with a glMemoryBarrier set between them, and fetching the target image from the first compute launch to CUDA. Still no synch. (Well,that makes sense as two computes also run out of sync with each other)
after the compute shader stage. It doesn't sync! Only when I replace with glFinish,or with any other operation which completely stall the pipeline.
Like glMapBuffer(), for example.
So should I just use glFinish(), or I am missing something here?
Why glMemoryBarrier() doesn't sync compute shader work before CUDA takes over the control?
UPDATE
I would like to refactor the question a little bit as the original one is pretty old. Nevertheless, even with the latest CUDA and Video Codec SDK (NVENC) the issue is still alive.So, I don't care about why glMemoryBarrier doesn't sync. What I want to know is:
If it is possible to synchronize OpenGL compute shader execution finish with CUDA's usage of that shared resource without stalling the whole rendering pipeline, which is in my case OpenGL image.
If the answer is 'yes', then how?

I know this is an old question, but if any poor soul stumbles upon this...
First, the reason glMemoryBarrier does not work: it requires the OpenGL driver to insert a barrier into the pipeline. CUDA does not care about the OpenGL pipeline at all.
Second, the only other way outside of glFinish is to use glFenceSync in combination with glClientWaitSync:
....
glBindFramebuffer(GL_READ_FRAMEBUFFER, fbo);
glBindImageTexture(imageUnit1, fboTex, 0, GL_FALSE, 0, GL_READ_ONLY,GL_RGBA8);
glBindImageTexture(imageUnit2, imageTex, 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA8));
glDispatchCompute(16, 16, 1);
GLsync fence = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
... other work you might want to do that does not impact the buffer...
GLenum res = glClientWaitSync(fence, GL_SYNC_FLUSH_COMMANDS_BIT, timeoutInNs);
if(res == GL_TIMEOUT_EXPIRED || res == GL_WAIT_FAILED) {
...handle timeouts and failures
}
cudaGraphicsMapResources(1, &gfxResource, stream);
...
This will cause the CPU to block until the GPU is done with all commands until the fence. This includes memory transfers and compute operations.
Unfortunately, there is no way to tell CUDA to wait on an OpenGL memory barrier/fence. If you really require the extra bit of asynchronicity, you'll have to switch to DirectX 12, for which CUDA supports importing fences/semaphores and waiting on as well as signaling them from a CUDA stream via cuImportExternalSemaphore, cuWaitExternalSemaphoresAsync, and cuSignalExternalSemaphoresAsync.

Most Efficient Way to Retrieve Texture Pixel Data?

I know Directx for Dx9 at least, has a texture object where you are able to get only a small portion of the texture to CPU accessible memory. It was a function called "LockRect" I believe. OpenGL has glGetTexImage() but it grabs the entire image and if the format isn't the same as the texture's then it is going to have to convert the entire texture into the new pixel format on top of transferring the entire texture. This function is also not in OpenGL ES. Framebuffers is another option but where I could potentially bind a framebuffer where a color attachment in connected to a texture. Then there is glReadPixels which reads from the framebuffer, so it should be reading from the texture. glReadPixels has limited pixel format options so a conversion is going to have to happen, but I can read the pixels I need (which is only 1 pixel). I haven't used this method but it seems like it is possible. If anyone can confirm the framebuffer method, that it is a working alternative. Then this method would also work for OpenGL ES 2+.
Are there any other methods? How efficient is the framebuffer method (if it works), does it end up having to convert the entire texture to the desired format before it reads the pixels or is it entirely implementation defined?
Edit: #Nicol_Bolas Please stop removing OpenGL from tags and adding OpenGL-ES, OpenGL-ES isn't applicable, OpenGL is. This is for OpenGL specifically but I would like it to be Open ES 2+ compatible if possible, though it doesn't have to be. If a OpenGL only solution is available then it is a consideration I will make if it is worth the trade off. Thank you.

Please note, I do not have that much experience with ES in particular, so there might be better ways to do this specifically in that context. The general gist applies in either plain OpenGL or ES, though.
First off, the most important performance consideration should be when you are doing the reading. If you request data from the video card while you are rendering, your program (the CPU end) will have to halt until the video card returns the data, which will slow rendering due to your inability to issue further render commands. As a general rule, you should always upload, render, download - do not mix any of these processes, it will impact speed immensely, and how much so can be very driver/hardware/OS dependent.
I suggest using glReadPixels( ) at the end of your render cycle. I suspect the limitations on formats for that function are connected to limitations on framebuffer formats; besides, you really should be using 8 bit unsigned or floating point, both of which are supported. If you have some fringe case not allowing any of those supported formats, you should explain what that is, as there may be a way to handle it specifically.
If you need the contents of the framebuffer at a specific point in rendering (rather than the end), create a second texture + framebuffer (again with the same format) to be an effective "backbuffer" and then copy from the target framebuffer to that texture. This occurs on the video card, so it does not impose the bus latency directly reading does. Here is something I wrote that does this operation:
glActiveTexture( GL_TEXTURE0 + unit );
glBindTexture( GL_TEXTURE_2D, backbufferTextureHandle );
glBindFramebuffer( GL_READ_FRAMEBUFFER, framebufferHandle );
glCopyTexSubImage2D(
GL_TEXTURE_2D,
0, // level
0, 0, // offset
0, 0, // x, y
screenX, screenY );
glBindFramebuffer( GL_DRAW_FRAMEBUFFER, framebufferHandle );
Then when you want the data, bind the backbuffer to GL_READ_FRAMEBUFFER and use glReadPixels( ) on it.
Finally, you should keep in mind that a download of data will still halt the CPU end. If you download before displaying the framebuffer, you will put off displaying the image until after you can again execute commands, which might result in visible latency. As such, I suggest still using a non-default framebuffer even if you only care about the final buffer state, and ending your render cycle to the effect of:
(1.) Blit to the default framebuffer:
glBindFramebuffer( GL_DRAW_FRAMEBUFFER, 0 ); // Default framebuffer
glBindFramebuffer( GL_READ_FRAMEBUFFER, framebufferHandle );
glBlitFramebuffer(
0, 0, screenX, screenY,
0, 0, screenX, screenY,
GL_COLOR_BUFFER_BIT,
GL_NEAREST );
(2.) Call whatever your swap buffers command may be in your given situation.
(3.) Your download call from the framebuffer (be it glReadPixels( ) or something else).
As for the speed impact of the blit/texcopy operations, it's quite good on most modern hardware and I have not found it to have a noticeable impact even done 10+ times a frame, but if you are dealing with antiquated hardware, it might be worth a second thought.

Can OpenGL be used to draw real valued triangles into buffer?

I need to implement an image reconstruction which involves drawing triangles in a buffer representing the pixels in an image. These triangles are assigned some floating point value to be filled with. If triangles are drawn such that they overlap, the values of the overlapping regions must be added together.
Is it possible to accomplish this with OpenGL? I would like to take advantage of the fact that rasterizing triangles is a basic graphics task that can be accelerated on the graphics card. I have a cpu-only implementation of this algorithm already but it is not fast enough for my purposes. This is due to the huge number of triangles that need to be drawn.
Specifically my questions are:
Can I draw triangles with a real value using openGL? (Or can I come up with a hack using color etc?)
Can OpenGL add the values where triangles overlap? (Once again I could deal with a hack, like color mixing)
Can I recover the real values for the pixels as an array of floats or similar to be further processed?
Do I have misconceptions about the idea that drawing in OpenGL -> using GPU to draw -> likely faster execution?
Additionally, I would like to run this code on a virtual machine so getting acceleration to work with OpenGL is more feasible than rolling my own implementation in something like Cuda as far as I understand. Is this true?
EDIT: Is an accumulation buffer an option here?

If 32-bit floats are sufficient then it looks like the answer is yes: http://www.opengl.org/wiki/Image_Format#Required_formats
Even under the old fixed pipeline you could use a blending mode with the function GL_FUNC_ADD, though I'm sure fragment shaders can do it more easily now.
glReadPixels() will get you the data back out of the buffer after drawing.
There are software implementations of OpenGL, but you get to choose when you set up the context. Using the GPU should be much faster than the CPU.
No idea. I've never used OpenGL or CUDA on a VM. I've never used CUDA at all.

I guess giving pieces of code as an answer wouldn't be appropriate here as your question is extremely broad. So I'll simply answer your questions individually with bits of hints.
Yes, drawing triangles with openGL is a piece of cake. You provide 3 vertice per triangle and with the proper shaders your can draw triangles, with filling or just edges, whatever you want. You seem to require a large set (bigger than [0, 255]) since a lot of triangles may overlap, and the value of each may be bigger than one. This is not a problem. You can fill a 32bit precision one channel frame buffer. In your case only one channel may suffice.
Yes, the blending exists since forever on openGL. So whatever the version of openGL you choose to use, there will be a way to add up the value of the trianlges overlapping.
Yes, depending on you implement it you may have to use glGetSubData() or glReadPixels or something else. However, depending on the size of the matrix you're filling, it may be a bit long to download the full buffer (2000x1000 pixels for a one channel at 32bit would be around 4-5ms). It may be more efficient to do all your processing on the GPU and extract only few valuable information instead of continuing the processing on the CPU.
The execution will be undoubtedly faster. However, the download of data from the GPU memory is often not optimized (upload is). So the time you will win on the processing may be lost on the download. I've never worked with openGL on VM so the additional loss of performance is unknown to me.
//Struct definition
struct Triangle {
float[2] position;
float intensity;
};
//Init
glGenBuffers(1, &m_buffer);
glBindBuffer(GL_ARRAY_BUFFER, 0, m_buffer);
glBufferData(GL_ARRAY_BUFFER,
triangleVector.data() * sizeof(Triangle),
triangleVector.size(),
GL_DYNAMIC_DRAW);
glBindBufferBase(GL_ARRAY_BUFFER, 0, 0);
glGenVertexArrays(1, &m_vao);
glBindVertexArray(m_vao);
glBindBuffer(GL_ARRAY_BUFFER, m_buffer);
glVertexAttribPointer(
POSITION,
2,
GL_FLOAT,
GL_FALSE,
sizeof(Triangle),
0);
glVertexAttribPointer(
INTENSITY,
1,
GL_FLOAT,
GL_FALSE,
sizeof(Triangle),
sizeof(float)*2);
glBindBuffer(GL_ARRAY_BUFFER, 0);
glBindVertexArray(0);
glGenTextures(1, &m_texture);
glBindTexture(GL_TEXTURE_2D, m_texture);
glTexImage2D(
GL_TEXTURE_2D,
0,
GL_R32F,
width,
height,
0,
GL_RED,
GL_FLOAT,
NULL);
glBindTexture(GL_FRAMEBUFFER, 0);
glGenFrameBuffers(1, &m_frameBuffer);
glBindFrameBuffer(GL_FRAMEBUFFER, m_frameBuffer);
glFramebufferTexture(
GL_FRAMEBUFFER,
GL_COLOR_ATTACHMENT0,
m_texture,
0);
glBindFrameBuffer(GL_FRAMEBUFFER, 0);
After you just need to write your render function. A simple glDraw*() should be enough. Just remember to bind your buffers correctly. To enable the blending with the proper function. You might also want to disable the anti aliasing for your case. At first I'd say you need an ortho projection but I don't have all the element of your problem so it's up to you.
Long story short, if you never worked with openGL, the piece of code above will be relevant only after you read few documentation/tutorials on openGL/GLSL.

Uploading texture very slow in OpenGL

I have written an emulator which I am in the process of porting to Linux. At the moment to do the video I am using Direct3D 11, which I am porting to OpenGL (which I'm running on Windows for now). I render to a 1024x1024 texture which I upload to memory every frame (the original hardware doesn't really lend itself to modern hardware acceleration, so I just do it all in software). However, I have found that uploading the texture in OpenGL is a lot slower.
In Direct3D uploading the texture every frame drops the frame rate from 416 to 395 (a 5% drop). In OpenGL it drops from 427 to 297 (a 30% drop!).
Here's the relevant code from my draw function.
Direct3D:
D3D11_MAPPED_SUBRESOURCE resource;
deviceContext_->Map(texture, 0, D3D11_MAP_WRITE_DISCARD, 0, &resource);
uint32_t *buf = reinterpret_cast<uint32_t *>(resource.pData);
memcpy(buf, ...);
deviceContext_->Unmap(texture, 0);
OpenGL:
glBindTexture(GL_TEXTURE_2D, texture);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, 1024, 1024, 0, GL_RGBA,
GL_UNSIGNED_BYTE, textureBuffer);
Can anyone suggest what may be causing this slowdown?
If it makes any odds, I'm running Windows 7 x64 with an NVIDIA GeForce GTX 550 Ti.

glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, textureBuffer);
You're doing several things wrong here. First, glTexImage2D is the equivalent of creating a Direct3D texture resource every frame. But you're not creating it; you're just uploading to it. You should use glTexImage2D only once per mipmap layer of interest; after that, all uploading should happen with glTexSubImage2D.
Second, your internal format (third parameter from the left) is GL_RGBA. You should always use explicit sizes for your image formats. So use GL_RGBA8. This isn't really a problem, but you should get into the habit now.
Third, you're using GL_RGBA ordering for your pixel transfer format (the third parameter from the right, not the left). This is generally not the most optimal pixel transfer format, as lots of hardware tends to prefer GL_BGRA ordering. But if you're not getting your data from whatever is producing it in that order, then there's not much that can be done.
Fourth, if you have something else you can do between starting the upload and actually rendering with it, you can employ asynchronous pixel transfer operations. You write your data to a buffer object (which can be mapped, so that you don't have to copy into it). Then you use glTexSubImage2D to transfer this data to OpenGL. Because the source data and the destination image are part of OpenGL's memory, it doesn't have to copy the data out of client memory before glTexSubImage2D returns.
Granted, that's probably not going to help you much, since you're already effectively doing that copy in the D3D case.
In OpenGL it drops from 427 to 297 (a 30% drop!)
The more important statistic is that it's a 1 millisecond difference. You should look at your timings in absolute time, not in frames-per-second, nor in percentage drops of FPS.

glTexImage2d does memory reallocation as well as update. Try to use glTexSubImage2d instead.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js