CUDA OPENGL Interoperability: slow mapping

CUDA OPENGL Interoperability: slow mapping - opengl

My application is going to take the rendered results from openGL (both depth map and the rendered 2D image information)
to CUDA for processing.
One way I did is to retrieve image/depth map by glReadPixel(..., image_array_HOST/depth_array_Host)*, and then pass image_HOST/depth_HOST to CUDA
by cudaMemcpy(..., cudaMemcpyHostToDevice). I have done this part, although it sounds redundant. (from GPU>CPU>GPU).
*image_array_HOST/depth_array_Host are array I define on host.
Another way is to use openGL<>cuda interpol.
First step is to create one buffer in openGL, and then pass image/depth information to that pixel buffer.
Also one cuda token is registered and linked to that buffer. And then link the matrix on CUDA to that cuda token.
(as far as I know, seems there is no a direct way to link pixel buffer to cuda matrix, there should be a cudatoken for openGL to recognize. Please, correct me if I ma wrong.)
I have also done this part. It thought it should be fairly efficicent becasue the data CUDA is processing was
not transferred to anywhere, but just at where it is located on openGL. It is a data processing inside the device(GPU).
However, the spent time I got from the 2nd method is even (slightly) longerr than the first one (GPU>CPU>GPU).
That really confuses me.
I am not sure if I missed any part, or maybe I didn't do it in an efficient way.
One thing I am also not sure is glReadPixel(...,*data).
In my understanding, if *data is a pointer linking to memory on HOST, then it will do the data transferring from GPU>CPU.
If *data=0, and one buffer is bind, then the data will be transferred to that buffer, and it should be a GPU>GPU thing.
Maybe some other method can pass the data more efficiently then glReadPixel(..,0).
Hope some people can explain my question.
Following is my code:
--
// openGL has finished its rendering, and the data are all save in the openGL. It is ready to go.
...
// declare one pointer and memory location on cuda for later use.
float *depth_map_Device;
cudaMalloc((void**) &depth_map_Device, sizeof(float) * size);
// inititate cuda<>openGL
cudaGLSetGLDevice(0);
// generate a buffer, and link the cuda token to it -- buffer <>cuda token
GLuint gl_pbo;
cudaGraphicsResource_t cudaToken;
size_t data_size = sizeof(float)*number_data; // number_data is defined beforehand
void *data = malloc(data_size);
glGenBuffers(1, &gl_pbo);
glBindBuffer(GL_ARRAY_BUFFER, gl_pbo);
glBufferData(GL_ARRAY_BUFFER, size, data, GL_DYNAMIC_DRAW);
glBindBuffer(GL_ARRAY_BUFFER, 0);
cudaGraphicsGLRegisterBuffer(&cudaToken, gl_pbo, cudaGraphicsMapFlagsNone); // now there is a link between gl_buffer and cudaResource
free(data);
// now it start to map(link) the data on buffer to cuda
glBindBuffer(GL_PIXEL_PACK_BUFFER, gl_pbo);
glReadPixels(0, 0, width, height, GL_RED, GL_FLOAT, 0);
// map the rendered data to buffer, since it is glReadPixels(..,0), it should be still fast? (GPU>GPU)
// width & height are defined beforehand. It can be GL_DEPTH_COMPONENT or others as well, just an example here.
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, gl_pbo);
cudaGraphicsMapResources(1, &cudaToken, 0); // let cufaResource which has a link to gl_buffer to the the current CUDA windows
cudaGraphicsResourceGetMappedPointer((void **)&depth_map_Device, &data_size, cudaToken); // transfer data
cudaGraphicsUnmapResources(1, &cudaToken, 0); // unmap it, for the next round
// CUDA kernel
my_kernel <<<block_number, thread_number>>> (...,depth_map_Device,...);

I think I can answer my question partly now, and hope it is useful for some people.
I was binding pbo to a float cuda (GPU) memory, but seems the openGL raw image rendered data is unsigned char format, (following is my supposition) so this data need to be transformed to float and then pass to cuda memory. I think what openGL did is using CPU to do this format transformation, and that is why there is no big difference between with and without using pbo.
By using unsigned char (glreadpixel(..,GL_UNSIGNED_BYTE,0)), binding with pbo is quicker than without using pbo for reading RGB data. And then I pass it do a simple cuda kernel to do the format transformation, which is more efficient than what openGL did. By doing this the speed is much quicker.
However, it doesnt work for depth buffer.
For some reason, reading depth map by glreadpixel (no matter with/without pbo) is slow.
And then, I found two old discussions:
http://www.opengl.org/discussion_boards/showthread.php/153121-Reading-the-Depth-Buffer-Why-so-slow
http://www.opengl.org/discussion_boards/showthread.php/173205-Saving-Restoring-Depth-Buffer-to-from-PBO
They pointed out the format question, and that is exactly what I found for RGB. (unsigned char). But I have tried unsigned char/unsigned short and unsigned int, and float for reading depth buffer, all performance almost the same speed.
So I still have speed problem for reading depth.

Related

Is it possible to load a texture into OpenGL without first loading the entire texture into regular RAM?

Usually, when we load a texture into OpenGL, we more or less do it like this
open file
read all the bytes into system RAM
call glTexImage2D() which sends the bytes into VRAM
free the bytes in system RAM
Is there a way to stream/load the bytes into VRAM skipping the whole loading it into system RAM part. Something like this:
char chunk[1000];
f = openfile("some file")
glBindTexture(....)
glTexParameteri(...)
glTexParameteri(...)
glTexParameteri(...)
while (f.eof == false) {
chunk = f.read_chunk(1000);
glTexImage2D..... (here's the chunk)
}
closefile(f);
I'm wondering if there is some new OpenGL extension that can do this.

Most image file formats are either compressed or heavily coded. So at least some parts of them must first pass through regular system memory. However it is not necessary to allocate a buffer for the whole image in system RAM, if you use a OpenGL buffer object. It will however likely end up allocating the buffer object in VRAM, so you have the data there twice, at least for some time.
The cliff notes on how to use it is
size_t buffer_size = ...;
GLuint buffer_id;
glGenBuffers(1, &buffer_id);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, buffer_id);
glBufferData(GL_PIXEL_UNPACK_BUFFER, buffer_size, NULL, GL_STATIC_DRAW);
void *buffer_ptr = glMapBuffer(GL_PIXEL_UNPACK_BUFFER, GL_WRITE_ONLY);
size_t offset_into_buffer = ...;
read_image_pixels_into_buffer((char*)buffer_ptr + offset_into_buffer);
glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER);
glTexImage2D(..., (void*)((uintptr_t)offset_into_buffer));
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
glDeleteBuffers(1, &buffer_id);
Take note that performance and efficiency might suffer if buffer objects are used naively, as in the code above. Usually buffer objects are used as memory pools from which chunks are cut on as-required base and buffers not constantly created and deletes. However there's also streaming via buffer orphaning, which does in fact re-create buffers: https://www.khronos.org/opengl/wiki/Buffer_Object_Streaming
Update regarding #BDL's comment
With OpenGL you really don't have any control of how things will happen, but you can nudge the taken code path into a certain direction. One thing one always has to keep in mind is, that OpenGL doesn't have a memory model at all. Things are just supposed to work… somehow.
A side effect of that is, that in order to fulfill the requirements, a GPU based OpenGL implementation may at any moment be forced to evice some of the data in GPU memory somewhere else. …most likely system RAM, but maybe even disk swap space. Since modern operating systems tend to overcommit memory, it's this very likely, that a OpenGL implementation will simply keep the same amount of actually page-faulted memory around as is required for all allocated OpenGL objects. And for non-streaming data it makes a lot of sense, to have a copy of the data in system memory (or swap) all the time, because this turns eviction into clearing some VRAM memory allocation entry.
Now in the case of buffer objects, the very code path I outlined above may in fact avoid one extra copy, by buffer region renaming, i.e. when loading the data into the texture, no data is actually copied, but just the texture descriptor set to point at that particular region in the buffer object… until that region in the buffer is overwritten, which then triggers the creation of a copy. However deleting the buffer object (name) immediately after loading it into the texture may in fact relegate the whole memory of the buffer to the texture object. _This however is just a potential optimization, a OpenGL implementation could implement, but doesn't have to.__
The whole OpenGL specification is extremely vague, which makes developing a high-performance OpenGL implementation ming boggingly hard work.

It is possible to skip loading the entire texture at once, and load it chunk-by-chunk as you showed in your code. You just need to upload the chunks with glTexSubImage2D and use a file format that supports reading tiles efficiently (e.g. TIFF):
... open file ...
... create texture ...
for each tile in file {
tile = f.read_tile();
glTexSubImage2D(GL_TEXTURE_2D, 0, tilex, tiley, tilew, tileh, format, type, tile);
}
This, however, will probably be slower compared to loading the entire thing and transferring it in one GL call.
It can still be useful if your image loads progressively through a network connection, for example.

There is no way how you could pass a texture to OpenGL without loading it first into your RAM. OpenGL doesn't handle file access in any way.
Most of the time (except for uncompressed textures or textures with DDS compression), one has to do some processing anyway to get the compressed file content into a format OpenGL can understand.

Why is glReadPixels so slow and are there any alternative?

I need to take sceenshots at every frame and I need very high performance (I'm using freeGlut). What I figured out is that it can be done like this inside glutIdleFunc(thisCallbackFunction)
GLubyte *data = (GLubyte *)malloc(3 * m_screenWidth * m_screenHeight);
glReadPixels(0, 0, m_screenWidth, m_screenHeight, GL_RGB, GL_UNSIGNED_BYTE, data);
// and I can access pixel values like this: data[3*(x*512 + y) + color] or whatever
It does work indeed but I have a huge issue with it, it's really slow. When my window is 512x512 it runs no faster than 90 frames per second when only cube is being rendered, without these two lines it runs at 6500 FPS! If we compare it to irrlicht graphics engine, there I can do this
// irrlicht code
video::IImage *screenShot = driver->createScreenShot();
const uint8_t *data = (uint8_t*)screenShot->lock();
// I can access pixel values from data in a similar manner here
and 512x512 window runs at 400 FPS even with a huge mesh (Quake 3 Map) loaded! Take into account that I'm using openGL as driver inside irrlicht. To my inexperienced eye it seems like glReadPixels is copying every pixel data from one place to another while (uint8_t*)screenShot->lock() is just copying a pointer to already existent array. Can I do something similar to latter using freeGlut? I expect it to be faster than irrlicht.
Note that irrlicht uses openGL too (well it offers directX and other options as well but in the example I gave above I used openGL and by the way it was the fastest compared to other options)

OpenGL methods are used to manage the rendering pipeline. In its nature, while the graphics card is showing image to the viewer, computations of the next frame are being done. When you call glReadPixels; graphics card wait for the current frame to be done, reads the pixels and then starts computing the next frame. Therefore pipeline becomes stalled and becomes sequential.
If you can hold two buffers and tell to the graphics card to read data into these buffers interchanging each frame; then you can read-back from your buffer 1-frame late but without stalling the pipeline. This is called double buffering. You can also do triple buffering with 2 frame late read-back and so on.
There is a relatively old web page describing the phenomenon and implementation here: http://www.songho.ca/opengl/gl_pbo.html
Also there are a lot of tutorials about framebuffers and rendering into a texture on the web. One of them is here: http://www.opengl-tutorial.org/intermediate-tutorials/tutorial-14-render-to-texture/

Is it possible to test if buffer data was successfully loaded onto the GPU without fully defined shaders? (C++, OpenGL 4.4, GLFW)

I am brand new to OpenGL programming, and I know that graphics APIs are notoriously difficult to debug. My question is, I have a txt file with 3D vertex data. After I create the vertex and index buffers, is there some way to see if the data was loaded correctly? As of now, the only way I can think of is to create a shader to display the points, but to do that would involve a lot of math, and I want to make sure that the data is at least loaded correctly before I try to debug it, that way, if there is a problem, I will know whether or not the problem is in my math in my shader, or I didn't initialize the buffers properly.
Edit:
In case you're confused as to what I'm asking, I'm asking if there is some sort of function you can use that will display the buffer data on the GPU?

You can use glGetBufferSubData() to read back the current buffer content.
Another option is glMapBufferRange(), which allows you to obtain a pointer to the buffer content.
If you don't know the current size of the buffer, which you will need for the calls above, you can get it with glGetBufferParameteriv(). For example, for the currently bound array buffer:
GLint bufSize = 0;
glGetBufferParamteriv(GL_ARRAY_BUFFER, GL_BUFFER_SIZE, bufSize);

OpenGL - How to obtain information on currently attached Depth buffer?

So I'm working on a function that works similarly to OpenGL Profiler on OSX, which allows me to extract information on OpenGL's back buffers and what they currently contains. Due to the nature of the function, I do not have access to the application's variables containing the depth buffer Ids and need to relies on the GL function itself to provide me with this information.
I've already got another function to copy the actual FBO context image into normal GL texture and already successfully extracted normal Draw buffers and save them as image files using a series of glGetIntegerv() in a (sample) function calls below.
But I couldn't seem to find a constant/function that could be used to pull the buffer information (e.g. type, texture id) out of the Depth buffer (and I already look through them a few times), which I'm pretty sure has to be possible, considering it's been done before in other GL profiling tools.
So, this is the first time I find the need to ask a question here, and is wondering if anyone here know if it'd be possible, or if I really need to catch the that value while the application is setting them rather than trying to pull the current value out of GL context?
// ......
GLint savedReadFBO = GL_ZERO;
GLenum savedReadBuffer = GL_BACK;
glGetIntegerv(GL_READ_FRAMEBUFFER_BINDING, &savedReadFBO);
glGetIntegerv(GL_READ_BUFFER, (GLint *) &savedReadBuffer);
// Try to obtain current draw buffer
GLint currentDrawFBO;
GLenum currentDrawBuffer = GL_NONE;
glGetIntegerv(GL_DRAW_FRAMEBUFFER_BINDING, &currentDrawFBO);
glGetIntegerv(GL_DRAW_BUFFER0, (GLint *) &currentDrawBuffer);
// We'll temporarily bind the drawbuffer for reading to pull out the current data.
// Bind drawbuffer FBO to readbuffer
if (savedReadFBO != currentDrawFBO)
{
glBindFramebuffer(GL_READ_FRAMEBUFFER, currentDrawFBO);
}
// Bind the read buffer and copy image
glReadBuffer(currentDrawBuffer);
// ....commands to fetch actual buffer content here....
// Restore the old read buffer
glBindFramebuffer(GL_READ_FRAMEBUFFER, savedReadFBO);
glReadBuffer(savedReadBuffer);
// .......

OpenGL PBO for texture upload, can't understand one thing

Okay , I read everything about PBO here : http://www.opengl.org/wiki/Pixel_Buffer_Object
and there http://www.songho.ca/opengl/gl_pbo.html , but I still have a question and I don't know if I'll get any benefit out of a PBO in my case :
I'm doing video-streaming,currently I have a function copying my data buffers to 3 different textures, and then I'm doing some maths in a fragment shader and I display the texture.
I thought PBO could increase the upload time CPU -> GPU, but here it is, let's say we have this example here taken from the second link above.
glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, pboIds[nextIndex]);
// map the buffer object into client's memory
// Note that glMapBufferARB() causes sync issue.
// If GPU is working with this buffer, glMapBufferARB() will wait(stall)
// for GPU to finish its job. To avoid waiting (stall), you can call
// first glBufferDataARB() with NULL pointer before glMapBufferARB().
// If you do that, the previous data in PBO will be discarded and
// glMapBufferARB() returns a new allocated pointer immediately
// even if GPU is still working with the previous data.
glBufferDataARB(GL_PIXEL_UNPACK_BUFFER_ARB, DATA_SIZE, 0, GL_STREAM_DRAW_ARB);
GLubyte* ptr = (GLubyte*)glMapBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, GL_WRITE_ONLY_ARB);
if(ptr)
{
// update data directly on the mapped buffer
updatePixels(ptr, DATA_SIZE);
glUnmapBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB); // release pointer to mapping buffer
}
// measure the time modifying the mapped buffer
t1.stop();
updateTime = t1.getElapsedTimeInMilliSec();
///////////////////////////////////////////////////
// it is good idea to release PBOs with ID 0 after use.
// Once bound with 0, all pixel operations behave normal ways.
glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, 0);
Well, whatever is the behavior of the updatePixels function , it is still using CPU cycles to copy the data to the mapped buffer isn't it?
So let's say I wanted to use PBO in such a manner, that is, to update my frame pixels to the PBO in a function , and then in the display function to call glTexSubImage2D (which should return immediately)... Would I see any speed-up in term of performance?
I can't see why it would be faster... okay we're not waiting anymore during the glTex* call, but we're waiting during the function that uploads the frame to the PBO, aren't we?
Could someone clear that out for me please?
Thanks

The point about Buffer Objects is, that they can be use asynchronously. You can map a BO and then have some other part of the program update it (think threads, think asynchronous IO) while you can keep issuing OpenGL commands. A typical usage scenario with triple buffered PBOs may look like this:
wait_for_video_frame_load_complete(buffer[k-2])
glUnmapBuffer buffer[k-2]
glTexSubImage2D from buffer[k-2]
buffer[k] = glMapBuffer
start_load_next_video_frame(buffer[k]);
draw_texture
SwapBuffers
This allows your program to do usefull work and even upload data to OpenGL while its also used for rendering

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js