GL/CL interoperability: Shared Texture [closed] - opengl

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 11 years ago.
I intend to make Graphics calculation with OpenCL such as ray casting, ray marching and others. And I want to use OpenGL to display result of this calculations (pixel images). I use texture buffer attached to frame buffer. OpenCL writes the result into the texture and then I use glBlitFrameBuffer function to copy texture data to application window framebuffer.
I met a CL/GL inter problem during the implementation of it. I wrote a simple example to show it. This example shows framebuffer object and texture object initialization, their conjunction, OpenCL buffer creation from GL texture buffer. At the end the main render loop is shown. It consists of texture writing with new data in each frame, framebuffer attachment and copying of this framebuffer.
Texture Initialization:
for (int i = 0; i < data.Length; i +=4) {
data [i] = 255;
}
GL.BindTexture (TextureTarget.Texture2D, tboID [0]);
GL.TexImage2D<byte>(TextureTarget.Texture2D, 0, PixelInternalFormat.Rgba8, w, h, 0,
PixelFormat.Rgba, PixelType.UnsignedByte, data);
GL.BindTexture (TextureTarget.Texture2D, 0)
TBO+FBO Initialization:
GL.BindFramebuffer (FramebufferTarget.FramebufferExt, fboID [0]);
GL.FramebufferTexture2D (FramebufferTarget.FramebufferExt, FramebufferAttachment.ColorAttachment0,
TextureTarget.Texture2D, tboID [0], 0);
GL.BindFramebuffer (FramebufferTarget.FramebufferExt, 0);
CL/GL Initialization:
bufferID = CL.CreateFromGLTexture2D (context, memInfo, textureTarget, mipLevel, glBufferObject, out errorCode);
Render Loop:
for (int i = 0; i < data.Length; i += 4) {
data [i] = tt;
}
tt++;
GL.BindTexture (TextureTarget.Texture2D, tboID [0]);
GL.TexImage2D<byte> (TextureTarget.Texture2D, 0, PixelInternalFormat.Rgba8, w, h, 0,
PixelFormat.Rgba, PixelType.UnsignedByte, data);
GL.BindTexture (TextureTarget.Texture2D, 0);
GL.BindFramebuffer (FramebufferTarget.FramebufferExt, fboID [0]);
GL.FramebufferTexture2D (FramebufferTarget.FramebufferExt, FramebufferAttachment.ColorAttachment0,
TextureTarget.Texture2D, tboID [0], 0);
GL.BindFramebuffer (FramebufferTarget.FramebufferExt, 0);GL.BindFramebuffer (FramebufferTarget.ReadFramebuffer, fboID [0]);
GL.ReadBuffer (ReadBufferMode.ColorAttachment0);
GL.DrawBuffer (DrawBufferMode.Back);
GL.BlitFramebuffer (0, 0, w, h, 0, 0, w, h, ClearBufferMask.ColorBufferBit, BlitFramebufferFilter.Nearest);
GL.BindFramebuffer (FramebufferTarget.ReadFramebuffer, 0);
At the first glance this code looks weird, but it completely shows my problem. CL does not work at all here. In this application OpenCL context is created and OpenCL buffer initialization is occured.
The work of this should be simple. The color of screen is being changed from black to red. And It does not work in this way. The color does not change from the initial red (texture initialization).
But it works normal when I comment the CL/GL Initialization (creation of CL buffer from GL texture).
Why is it so? Why the behavior of the GL buffer is changed depending on CL attachments? How to fix it and make it works?

EDIT 2:
Then you need to check why you are getting an InvalidImageFormatDescriptor. Check if the parameter order is okay and whether your image descriptor in the tbo is misleading (internal image structure - see OpenCL specification). From the spec:
CL_INVALID_IMAGE_FORMAT_DESCRIPTOR if the OpenGL texture internal format does not map to a supported OpenCL image format.
EDIT:
So I understand OpenCL functionality in OpenTK is provided by a separate project named Cloo. For ComputeImage2D their documentation states:
CreateFromGLTexture2D (ComputeContext context, ComputeMemoryFlags flags, int textureTarget, int mipLevel, int textureId)
Compared to yours:
CreateFromGLTexture2D (context, MemFlags.MemReadWrite, TextureTarget.Texture2D, ((uint[])tboID.Clone()) [0], 0);
Looking at that you have the mip level and the tbo in the wrong order. A false initialization might lead to some unknown behaviour.
Hard to tell what might be the issue from the code you're providing. Am looking at the Interop as well right now, just getting into it. First thing I would try is put a try/catch block around it and try to get a clue about any possible error code.
Have you verified the obvious: Is the cl_khr_gl_sharing extension available on your device?
Another guess, since you only provided the Texture/Image Initialisation of the actual OpenCl/OpenGL Interop within your sample code: Did you acquire the memory object?
cl_int clEnqueueAcquireGLObjects (cl_command_queue command_queue,
cl_uint num_objects.
const cl_mem *mem_objects,
cl_uint num_events_in_wait_list,
const cl_event *event_wait_list,
cl_event *event)
The OpenCL 1.1 specifiction states:
The function cl_int clEnqueueAcquireGLObjects is used to acquire OpenCL memory objects that have been created from OpenGL objects. These
objects need to be acquired before they can be used by any OpenCL commands queued to a
command-queue. The OpenGL objects are acquired by the OpenCL context associated with
command_queue and can therefore be used by all command-queues associated with the OpenCL
context.
So the issue might be that the memory object hasn't been bound to a specific command queue.
Also, one is responsible for issuing a glFinish() to ensure all the initialization on the OpenGL side is done before the memory object is acquired.

Finally we were able to run Itun' code on my system (Windows 7 / AMD Radeon HD 5870). Recall that Itun' code gradually changes color in texture from black to red by means of GL after activating GL/CL Interop on this texture.
The results are at least strange. On my system it works as intended. However, on Itun' system (Windows 7 / NVidia GeForce) the same code does not work at all and does not provide any exceptions or error codes. In addition, I would like to mention that CL works with this texture properly on both systems. Therefore, something is wrong with GL in this case.
We have no idea on what's going on - it could be either Itun' outdated GPU hardware or buggy NVidia drivers.

Related

Pixel-path performance warning: Pixel transfer is synchronized with 3D rendering

I am uploading image data into GL texture asynchronously.
In debug output I am getting these warnings during the rendering:
Source:OpenGL,type: Other, id: 131185, severity: Notification
Message: Buffer detailed info: Buffer object 1 (bound to
GL_PIXEL_UNPACK_BUFFER_ARB, usage hint is GL_DYNAMIC_DRAW) has been
mapped WRITE_ONLY in SYSTEM HEAP memory (fast). Source:OpenGL,type:
Performance, id: 131154, severity: Medium Message: Pixel-path
performance warning: Pixel transfer is synchronized with 3D rendering.
I can't see any wrong usage of PBOs in my case or any errors.So the questions is, if these warnings are safe to discard, or I am actually doing smth wrong.
My code for that part:
//start copuying pixels into PBO from RAM:
mPBOs[mCurrentPBO].Bind(GL_PIXEL_UNPACK_BUFFER);
const uint32_t buffSize = pipe->GetBufferSize();
GLubyte* ptr = (GLubyte*)mPBOs[mCurrentPBO].MapRange(0, buffSize, GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT);
if (ptr)
{
memcpy(ptr, pipe->GetBuffer(), buffSize);
mPBOs[mCurrentPBO].Unmap();
}
//copy pixels from another already full PBO(except of first frame into texture //
mPBOs[1 - mCurrentPBO].Bind(GL_PIXEL_UNPACK_BUFFER);
//mCopyTex is bound to mCopyFBO as attachment
glTextureSubImage2D(mCopyTex->GetHandle(), 0, 0, 0, mClientSize.x, mClientSize.y,
GL_RGBA, GL_UNSIGNED_BYTE, 0);
mCurrentPBO = 1 - mCurrentPBO;
Then I just blit the result to default frame buffer. No rendering of geometry or anything like that.
glBlitNamedFramebuffer(
mCopyFBO,
0,//default FBO id
0,
0,
mViewportSize.x,
mViewportSize.y,
0,
0,
mViewportSize.x,
mViewportSize.y,
GL_COLOR_BUFFER_BIT,
GL_LINEAR);
Running on NVIDIA GTX 960 card.
This performance warning is nividia-specific and it is intended as a hint to tell you that you're not going to use a separate hw transfer queue, which is no wonder since you use a single thread, single GL context model, where both rendering (at least your your blit) and transfer are carried out.
See this nvidia presentation for some details about how nvidia handles this. Page 22 also explains this specific warning. Note that this warnign does not mean that your transfer is not asynchronous. It is still fully asynchronous to the CPU thread. It will just be synchronously processed on the GPU, with respect to the render commands which are in the same command queue, and you're not using the asynchronous copy engine which could do these copies independent from the rendering commands in a separate command queue.
I can't see any wrong usage of PBOs in my case or any errors.So the questions is, if these warnings are safe to discard, or I am actually doing smth wrong.
There is nothing wrong with your PBO usage.
It is not clear if your specific application could even benefit from using a more elaborate separate transfer queue scheme.

nVidia GPU returns-1000 for clEnqueueWriteImage

Tying to figure out what the issue (and error code) is for this call. First to preface this works just fine on AMD, it only has issues on nVidia.
unsigned char *buffer;
...
cl_int status;
cl::size_t<3> origin;
cl::size_t<3> region;
origin[0]=0;
origin[1]=0;
origin[2]=0;
region[0]=m_width;
region[1]=m_height;
region[2]=1;
status=clEnqueueWriteImage(m_commandQueue, m_image, CL_FALSE, origin, region, 0, 0, buffer, 0, NULL, NULL);
status returns -1000, which is not a standard openCl error code. All other functions related to the opening of the device, context, and command queue all succeed. The context is interop'ed with openGl and again this is all completely functional on AMD.
For future reference, seems the error happens if the image is interop'ed with an OpenGL texture and the call is made before the image is acquired using clEnqueueAcquireGLObjects. I had used the acquire later when images were used but not right before the image was set. Amd's driver does not appear to care about this little detail.

Compressed framebuffer offloading

I'm trying to write code that would offload framebuffer from one card to the other, and I'm wondering whether it's possible to efficiently use compression, since the memory seems to be bottlenecked in my case.
At the moment, I use simple readback & display routines:
readback:
glWaitsync(..);
glReadPixels(.., GL_BGRA, GL_UNISGNED_BYTE, NULL);
GLvoid *data = glMapBuffer(GL_PIXEL_PACK_BUFFER_EXT, GL_READ_ONLY);
display:
glGenBuffers(2, pbos);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_EXT, pbos[curr_ctx ^ 1]);
glBufferData(GL_PIXEL_UNPACK_BUFFER_EXT, width*height*4, NULL,GL_STREAM_DRAW);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_EXT, pbos[curr_ctx]);
glBufferData(GL_PIXEL_UNPACK_BUFFER_EXT, width*height*4, NULL,GL_STREAM_DRAW);
...
glBufferSubData(GL_PIXEL_UNPACK_BUFFER_EXT, 0, width*height*4, data);
glDrawPixels(width, height, GL_BGRA, GL_UNSIGNED_BYTE, NULL);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_EXT, pbos[cur_ctx ^= 1]);
...
glXSwapBuffers(...);
There is also some synchronization via mutexes and other miscellaneous code in there, but this is the main body of the current code.
Unfortunately, it seems that memory bandwidth is the biggest problem here (on display card side which is sort of a capable USB capture card).
Is there any way to optimize this via OpenGL compression (S3TC)?
Preferably, I would like to compress on render card, copy into RAM, and then send it downstream to capture (display) card.
I do believe that I've seen some people do this by copying framebuffer into texture, asking GL to compress it, but quite frankly I'm new to GL programming. So I thought I would ask here.

CUDA OPENGL Interoperability: slow mapping

My application is going to take the rendered results from openGL (both depth map and the rendered 2D image information)
to CUDA for processing.
One way I did is to retrieve image/depth map by glReadPixel(..., image_array_HOST/depth_array_Host)*, and then pass image_HOST/depth_HOST to CUDA
by cudaMemcpy(..., cudaMemcpyHostToDevice). I have done this part, although it sounds redundant. (from GPU>CPU>GPU).
*image_array_HOST/depth_array_Host are array I define on host.
Another way is to use openGL<>cuda interpol.
First step is to create one buffer in openGL, and then pass image/depth information to that pixel buffer.
Also one cuda token is registered and linked to that buffer. And then link the matrix on CUDA to that cuda token.
(as far as I know, seems there is no a direct way to link pixel buffer to cuda matrix, there should be a cudatoken for openGL to recognize. Please, correct me if I ma wrong.)
I have also done this part. It thought it should be fairly efficicent becasue the data CUDA is processing was
not transferred to anywhere, but just at where it is located on openGL. It is a data processing inside the device(GPU).
However, the spent time I got from the 2nd method is even (slightly) longerr than the first one (GPU>CPU>GPU).
That really confuses me.
I am not sure if I missed any part, or maybe I didn't do it in an efficient way.
One thing I am also not sure is glReadPixel(...,*data).
In my understanding, if *data is a pointer linking to memory on HOST, then it will do the data transferring from GPU>CPU.
If *data=0, and one buffer is bind, then the data will be transferred to that buffer, and it should be a GPU>GPU thing.
Maybe some other method can pass the data more efficiently then glReadPixel(..,0).
Hope some people can explain my question.
Following is my code:
--
// openGL has finished its rendering, and the data are all save in the openGL. It is ready to go.
...
// declare one pointer and memory location on cuda for later use.
float *depth_map_Device;
cudaMalloc((void**) &depth_map_Device, sizeof(float) * size);
// inititate cuda<>openGL
cudaGLSetGLDevice(0);
// generate a buffer, and link the cuda token to it -- buffer <>cuda token
GLuint gl_pbo;
cudaGraphicsResource_t cudaToken;
size_t data_size = sizeof(float)*number_data; // number_data is defined beforehand
void *data = malloc(data_size);
glGenBuffers(1, &gl_pbo);
glBindBuffer(GL_ARRAY_BUFFER, gl_pbo);
glBufferData(GL_ARRAY_BUFFER, size, data, GL_DYNAMIC_DRAW);
glBindBuffer(GL_ARRAY_BUFFER, 0);
cudaGraphicsGLRegisterBuffer(&cudaToken, gl_pbo, cudaGraphicsMapFlagsNone); // now there is a link between gl_buffer and cudaResource
free(data);
// now it start to map(link) the data on buffer to cuda
glBindBuffer(GL_PIXEL_PACK_BUFFER, gl_pbo);
glReadPixels(0, 0, width, height, GL_RED, GL_FLOAT, 0);
// map the rendered data to buffer, since it is glReadPixels(..,0), it should be still fast? (GPU>GPU)
// width & height are defined beforehand. It can be GL_DEPTH_COMPONENT or others as well, just an example here.
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, gl_pbo);
cudaGraphicsMapResources(1, &cudaToken, 0); // let cufaResource which has a link to gl_buffer to the the current CUDA windows
cudaGraphicsResourceGetMappedPointer((void **)&depth_map_Device, &data_size, cudaToken); // transfer data
cudaGraphicsUnmapResources(1, &cudaToken, 0); // unmap it, for the next round
// CUDA kernel
my_kernel <<<block_number, thread_number>>> (...,depth_map_Device,...);
I think I can answer my question partly now, and hope it is useful for some people.
I was binding pbo to a float cuda (GPU) memory, but seems the openGL raw image rendered data is unsigned char format, (following is my supposition) so this data need to be transformed to float and then pass to cuda memory. I think what openGL did is using CPU to do this format transformation, and that is why there is no big difference between with and without using pbo.
By using unsigned char (glreadpixel(..,GL_UNSIGNED_BYTE,0)), binding with pbo is quicker than without using pbo for reading RGB data. And then I pass it do a simple cuda kernel to do the format transformation, which is more efficient than what openGL did. By doing this the speed is much quicker.
However, it doesnt work for depth buffer.
For some reason, reading depth map by glreadpixel (no matter with/without pbo) is slow.
And then, I found two old discussions:
http://www.opengl.org/discussion_boards/showthread.php/153121-Reading-the-Depth-Buffer-Why-so-slow
http://www.opengl.org/discussion_boards/showthread.php/173205-Saving-Restoring-Depth-Buffer-to-from-PBO
They pointed out the format question, and that is exactly what I found for RGB. (unsigned char). But I have tried unsigned char/unsigned short and unsigned int, and float for reading depth buffer, all performance almost the same speed.
So I still have speed problem for reading depth.

Error when calling glGetTexImage (atioglxx.dll)

I'm experiencing a difficult problem on certain ATI cards (Radeon X1650, X1550 + and others).
The message is: "Access violation at address 6959DD46 in module 'atioglxx.dll'. Read of address 00000000"
It happens on this line:
glGetTexImage(GL_TEXTURE_2D,0,GL_RGBA,GL_FLOAT,P);
Note:
Latest graphics drivers are installed.
It works perfectly on other cards.
Here is what I've tried so far (with assertions in the code):
That the pointer P is valid and allocated enough memory to hold the image
Texturing is enabled: glIsEnabled(GL_TEXTURE_2D)
Test that the currently bound texture is the one I expect: glGetIntegerv(GL_TEXTURE_2D_BINDING)
Test that the currently bound texture has the dimensions I expect: glGetTexLevelParameteriv( GL_TEXTURE_WIDTH / HEIGHT )
Test that no errors have been reported: glGetError
It passes all those test and then still fails with the message.
I feel I've tried everything and have no more ideas. I really hope some GL-guru here can help!
EDIT:
After concluded it is probably a driver bug I posted about it here too: http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=295137#Post295137
I also tried GL_PACK_ALIGNMENT and it didn't help.
By some more investigation I found that it only happened on textures that I have previously filled with pixels using a call to glCopyTexSubImage2D. So I could produce a workaround by replacing the glCopyTexSubImage2d call with calls to glReadPixels and then glTexImage2D instead.
Here is my updated code:
{
glCopyTexSubImage2D cannot be used here because the combination of calling
glCopyTexSubImage2D and then later glGetTexImage on the same texture causes
a crash in atioglxx.dll on ATI Radeon X1650 and X1550.
Instead we copy to the main memory first and then update.
}
// glCopyTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, 0, 0, PixelWidth, PixelHeight); //**
GetMem(P, PixelWidth * PixelHeight * 4);
glReadPixels(0, 0, PixelWidth, PixelHeight, GL_RGBA, GL_UNSIGNED_BYTE, P);
SetMemory(P,GL_RGBA,GL_UNSIGNED_BYTE);
You might take care of the GL_PACK_ALIGNEMENT. This parameter told you the closest byte count to pack the texture. Ie, if you have a image of 645 pixels:
With GL_PACK_ALIGNEMENT to 4 (default value), you'll have 648 pixels.
With GL_PACK_ALIGNEMENT to 1, you'll have 645 pixels.
So ensure that the pack value is ok by doing:
glPixelStorei(GL_PACK_ALIGNMENT, 1)
Before your glGetTexImage(), or align your memory texture on the GL_PACK_ALIGNEMENT.
This is most likely a driver bug. Having written 3D apis myself it is even easy to see how. You are doing something that is really weird and rare to be covered by test: Convert float data to 8 bit during upload. Nobody is going to optimize that path. You should reconsider what you are doing in the first place. The generic conversion cpu conversion function probably kicks in there and somebody messed up a table that drives allocation of temp buffers for that. You should really reconsider using an external float format with an internal 8 bit format. Conversions like that in the GL api usually point to programming errors. If you data is float and you want to keep it as such you should use a float texture and not RGBA. If you want 8 bit why is your input float?