I have an application that decodes a video file using FFMPEG (in a separate thread) and renders this texture using PBOs in another. All the PBO do-hickey happens in the following function:
void DynamicTexture::update()
{
if(!_isDirty)
{
return;
}
/// \todo Check to make sure that PBOs are supported
if(_usePbo)
{
// In multi PBO mode, we keep swapping between the PBOs
// We use one PBO to actually set the texture data that we will upload
// and the other we use to update/modify. Once modification is complete,
// we simply swap buffers
// Unmap the PBO that was updated last so that it can be released for rendering
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, _pboIds[_currentPboIndex]);
glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER);
Util::GLErrorAssert();
// bind the texture
glBindTexture(GL_TEXTURE_2D, _textureId);
Util::GLErrorAssert();
// copy pixels from PBO to texture object
// Use offset instead of pointer.
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, _width, _height,
(_channelCount==4)?GL_RGBA:GL_RGB,
GL_UNSIGNED_BYTE, 0);
Util::GLErrorAssert();
// Now swap the pbo index
_currentPboIndex = (++_currentPboIndex) % _numPbos;
// bind PBO to update pixel values
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, _pboIds[_currentPboIndex]);
Util::GLErrorAssert();
// map the next buffer object into client's memory
// Note that glMapBuffer() causes sync issue.
// If GPU is working with this buffer, glMapBuffer() will wait(stall)
// for GPU to finish its job
GLubyte* ptr = (GLubyte*)glMapBuffer(GL_PIXEL_UNPACK_BUFFER, GL_WRITE_ONLY);
Util::GLErrorAssert();
if(ptr)
{
// update data directly on the mapped buffer
_currentBuffer = ptr;
Util::GLErrorAssert();
}
else
{
printf("Unable to map PBO!");
assert(false);
}
// It is good idea to release PBOs with ID 0 after use.
// Once bound with 0, all pixel operations behave normal ways.
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
Util::GLErrorAssert();
// If a callback was registered, call it
if(_renderCallback)
{
(*_renderCallback)(this);
}
}
else
{
glBindTexture(GL_TEXTURE_2D, _textureId);
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0,
_width, _height, (_channelCount==4)?GL_RGBA:GL_RGB,
GL_UNSIGNED_BYTE,
&(_buffer[0])
);
Util::GLErrorAssert();
}
// Reset the dirty flag after updating
_isDirty = false;
}
In the decoding thread, I simply update _currentBuffer and set the _isDirty flag to true. This function is called in the render thread.
When I use a single PBO, i.e. when _numPbos=1 in the above code, then the rendering works fine without any stutter. However, when I use more than one PBO, there is a visible stutter in the video. You can find a sample of me rendering 5 videos with _numPbos=2 here. The more number of PBOs I use, the worse the stutter becomes.
Theoretically, the buffer that I am updating and the buffer than I am using for render are different, so there should be no glitch of this sort. I want to use double/triple buffering so as to increase rendering performance.
I am looking for some pointers/hints as to what could be going wrong.
I dont know, if it is your problem, but after you call this:
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, _pboIds[_currentPboIndex]);
glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER);
Util::GLErrorAssert();
You are calling
glBindTexture
But you are still operating with buffer at index _currentPboIndex.
In my code, I have two indices - index and nextIndex
In init I set
index = 0;
nextIndex = 1;
Than my update pipeline is like this:
index = (index + 1) % 2;
nextIndex = (nextIndex + 1) % 2;
uint32 textureSize = sizeof(RGB) * width * height;
GL_CHECK( glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo[nextIndex]) );
GL_CHECK( glBufferData(GL_PIXEL_UNPACK_BUFFER, textureSize, 0, GL_STREAM_DRAW_ARB) );
GL_CHECK( gpuDataPtr = glMapBufferRange(GL_PIXEL_UNPACK_BUFFER, 0, textureSize, GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT) );
//update data gpuDataPtr
GL_CHECK( glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER_ARB) );
//bind texture
GL_CHECK( glBindBufferARB(GL_PIXEL_UNPACK_BUFFER, pbo[index]) );
GL_CHECK( glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0,
width, height, glFormat, GL_UNSIGNED_BYTE, 0) );
GL_CHECK( glBindBufferARB(type, 0) );
Related
I am trying to render to the OpenGL Framebuffer via an OpenGL Renderbuffer from an OpenCL kernel. The issue is: Even though I can (propably) render/write to the Renderbuffer from an OpenCL kernel, the screen stays empty (-> Black).
I am getting to my limits of what I can test in finite time, so I am asking someone with much more experience to give a tip, about what I am missing.
I personally suspect that I forgot to Bind a Buffer at the right point, but since I don't see which and where, this is practically impossible to check.
Now for some reduced code (So you don't have to look at all the error checking etc.)(This is the function that is called during the render routine):
void TestBuffer(){
GLubyte *buffer = (GLubyte *) malloc(1000 * 1000 * 4);
glReadBuffer(GL_COLOR_ATTACHMENT0);
error = glGetError();
if(error != GL_NO_ERROR){
printf("error with readBuffer, %i\n", error);
}
glReadPixels(0, 0, 1000, 1000, GL_RGBA, GL_UNSIGNED_BYTE, (GLvoid *)buffer);
error = glGetError();
if(error != GL_NO_ERROR){
printf("error with readpixels\n");
}
for(int i = 0; i < 1000*100; i++){
if(buffer[i] != 0){
printf("buffer was not empty # %i: %u\n", i, buffer[i]);
free(buffer);
return;
}
}
printf("buffer was empty\n");
free(buffer);
}
void runShader(){
glFinish(); //Make sure, that OpenGL isn't using our objects
ret = clEnqueueAcquireGLObjects(command_queue, 1, &cl_renderBuffer, 0, NULL, NULL);
// Execute the OpenCL kernel on the list
size_t global_item_size = 1000 * 1000; // Process the entire lists
size_t local_item_size = 1000; // Divide work items into groups of SceenWidth
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL);
ret = clEnqueueReleaseGLObjects(command_queue, 1, &cl_renderBuffer, 0, NULL, NULL);
clFlush(command_queue);
clFinish(command_queue);
// We are going to blit into the window (default framebuffer)
glBindFramebuffer (GL_DRAW_FRAMEBUFFER, 0);
glDrawBuffer (GL_BACK); // Use backbuffer as color dst.
// Read from your FBO
glBindFramebuffer (GL_READ_FRAMEBUFFER, gl_frameBuffer);
glReadBuffer (GL_COLOR_ATTACHMENT0); // Use Color Attachment 0 as color src.
// Copy the color and depth buffer from your FBO to the default framebuffer
glBlitFramebuffer (0,0, 1000, 1000, 0,0, 1000, 1000, GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT, GL_NEAREST);
TestBuffer();
}
My ideas where:
Blit the contents of the renderbuffer to the screenbuffer, in case I messed up with binding the new framebuffer object (created earlier), or attaching the renderbuffer (which you can see in the last few lines of the code)
Check, if I messed up with the double Buffer or sth.: this is the TestBuffer() function
Flushing before Finishing, just in case
The shader/kernel code is simple on purpose, to see if the other stuff actually works (.w should be alpha, which should be opaque, so we can see the result, the rest is just a gray rainbow):
#pragma OPENCL EXTENSION all : enable
#define ScreenWidth 1000
#define ScreenHight 1000
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_NONE | CLK_FILTER_NEAREST;
__kernel void rainbow(__write_only image2d_t asd) {
int i = get_global_id(0);
unsigned int x = i%ScreenWidth;
unsigned int y = i/ScreenHight;
uint4 pixel; //I wish, I could access this as an array
pixel.x = i;
pixel.y = i;
pixel.z = i;
pixel.w = 255;
write_imageui(asd, (int2)(x, y), pixel);
}
Some further information:
I am only rendering stuff to the COLOR_ATTACHMENT0, since I don't care about the depth or stencil buffer in my usecase. This could be an issue though. (I didn't even generate buffers for them)
I am compiling for Windows 10
The format of the Renderbuffer is RGBA8, but I think the natural format is RGBA24. It once was just RGBA as you can see in the TestBuffer Routine, but I think this should be fine.
What could cause the screen to stay black/empty?
As per subject I have the following pseudo-code to setup window capture in X (Linux):
xdisplay = XOpenDisplay(NULL);
win_capture = ...find the window to capture...
XCompositeRedirectWindow(xdisplay, win_capture, CompositeRedirectAutomatic);
XGetWindowAttributes(xdisplay, win_capture, &win_attr); // attributes used later
GLXFBConfig *configs = glXChooseFBConfig(xdisplay, win_attr.root, config_attrs, &nelem);
// cycle through the configs to
// find a valid one
...
win_pixmap = XCompositeNameWindowPixmap(xdisplay, win_capture);
const int pixmap_attrs[] = {GLX_TEXTURE_TARGET_EXT, GLX_TEXTURE_2D_EXT,
GLX_TEXTURE_FORMAT_EXT,
GLX_TEXTURE_FORMAT_RGBA_EXT, None};
gl_pixmap = glXCreatePixmap(xdisplay, config, win_pixmap, pixmap_attrs);
gl_ctx = glXCreateNewContext(xdisplay, config, GLX_RGBA_TYPE, 0, 1);
glXMakeCurrent(xdisplay, gl_pixmap, gl_ctx);
glEnable(GL_TEXTURE_2D);
glGenTextures(1, &gl_texmap);
glBindTexture(GL_TEXTURE_2D, gl_texmap);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, win_attr.width, win_attr.height, 0, GL_RGBA, GL_UNSIGNED_BYTE, 0);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
Then, much later on, this would be the loop to capture the frames:
glXMakeCurrent(xdisplay, gl_pixmap, gl_ctx);
glBindTexture(GL_TEXTURE_2D, gl_texmap);
glXBindTexImageEXT(xdisplay, gl_pixmap, GLX_FRONT_LEFT_EXT, NULL);
glGetTexImage(GL_TEXTURE_2D, 0, GL_RGBA, GL_UNSIGNED_BYTE, data); // data is output RGBA buffer
glXReleaseTexImageEXT(xdisplay, gl_pixmap, GLX_FRONT_LEFT_EXT);
I basically do glXBindTexImageEXT -> glGetTexImage -> glXReleaseTexImageEXT so that I get an updated picture.
It does work, but not sure I'm doing the right/optimal thing.
Is there a better/more optimized way to get such picture/context?
As of now I've found a slightly better way to implement fetching the composite window through OpenGL, via PBO; the advantages of this way is that you could initiate the command asynchronously and then retrieve the RGBA buffer from system memory, whilst the OpenGL driver does data transfer.
Sample pseudocode:
// setup a PBO
GLuint cur_pbo;
glGenBuffers(1, &cur_pbo);
glBindBuffer(GL_PIXEL_PACK_BUFFER, cur_pbo);
glBufferData(GL_PIXEL_PACK_BUFFER, size, NULL, GL_STREAM_READ);
Then much later on
glXMakeCurrent(xdisplay, gl_pixmap, gl_ctx);
glBindTexture(GL_TEXTURE_2D, gl_texmap);
glXBindTexImageEXT(xdisplay, gl_pixmap, GLX_FRONT_LEFT_EXT, NULL);
glBindBuffer(GL_PIXEL_PACK_BUFFER, cur_pbo);
// This will initiate the data transfer, the previous
// buffer pointer is now an offset in the index bound by previous
// glBufferData call
glGetTexImage(GL_TEXTURE_2D, 0, GL_RGBA, GL_UNSIGNED_BYTE, 0);
// do something else
...
...
...
// then later on when we _really_ need to get the data
// perform this call which will make wait if the RGBA
// data is not avilable yet
void* rgba_ptr = glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);
// Then when finished to use rgba_ptr, release it
glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
glXReleaseTexImageEXT(xdisplay, gl_pixmap, GLX_FRONT_LEFT_EXT);
This approach is definitely better than original approach (in the question) if you can use the CPU/same thread to do something between the calls to glGetTexImage and glMapBuffer.
It's worth thinking it may be still better even if you perform these calls sequentially (instead of glGetTexImage without PBO) because the driver may still optimize the transfer and would manage the system memory buffer itself.
I'm trying to do high-throughput video streaming using OpenGL. I thought I'd figured it all out with my genius programming architecture, but - surprise - when doing more serious tests, I've been stonewalled with a performance problem.
The story goes like this:
It all starts by reserving a stack of PBO's (say, a hundred+ or so):
glGenBuffers(1, &index);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, index);
glBufferData(GL_PIXEL_UNPACK_BUFFER, size, 0, GL_STREAM_DRAW); // reserve n_payload bytes to index/handle pbo_id
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0); // unbind (not mandatory)
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, index); // rebind (not mandatory)
payload = (GLubyte*)glMapBuffer(GL_PIXEL_UNPACK_BUFFER, GL_WRITE_ONLY);
glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER); // release pointer to mapping buffer ** MANDATORY **
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0); // unbind ** MANDATORY **
YUV pixel data is copied into PBOs by separate decoder/uploader threads that use a common stack of available PBOs. The "payload" pointers you see above, are accessed from these threads and data is copied (with memcpy) "directly" to the gpu. Once a PBO is used, it is returned to the stack.
I also pre-reserve textures for each separate video stream. I reserve three textures (y, u and v), like this:
glEnable(GL_TEXTURE_2D);
glGenTextures(1, &index);
glBindTexture(GL_TEXTURE_2D, index);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
glTexImage2D(GL_TEXTURE_2D, 0, format, w, h, 0, format, GL_UNSIGNED_BYTE, 0); // no upload, just reserve
glBindTexture(GL_TEXTURE_2D, 0); // unbind
Rendering is done in a "master thread" (remember, the decoder / uploader threads are separate beasts) that reads frames from a fifo queue.
A critical step in rendering is to copy data from PBOs to textures (tex->format is GL_RED):
// y
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo->y_index);
glBindTexture(GL_TEXTURE_2D, tex->y_index); // this is the texture we will manipulate
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, tex->w, tex->h, tex->format, GL_UNSIGNED_BYTE, 0); // copy from pbo to texture
// u
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo->u_index);
glBindTexture(GL_TEXTURE_2D, tex->u_index); // this is the texture we will manipulate
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, tex->w/2, tex->h/2, tex->format, GL_UNSIGNED_BYTE, 0); // copy from pbo to texture
// v
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo->v_index);
glBindTexture(GL_TEXTURE_2D, tex->v_index); // this is the texture we will manipulate
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, tex->w/2, tex->h/2, tex->format, GL_UNSIGNED_BYTE, 0); // copy from pbo to texture
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0); // unbind // important!
glBindTexture(GL_TEXTURE_2D, 0); // unbind
And finally, the image is drawn using the OpenGL shading language (which is another story).
The Question : Do you see any OpenGL performance bottlenecks here?
Step (3) seems like a bottleneck, as it starts to consume too much time (up to 10+ milliseconds)!, when I'm trying to do this with several cameras.
Of course, this could be due to something else clogging the OpenGL pipeline - but everything else (glDrawElements, etc.) seems to take max. 1 millisecond.
I've been reading about problems people are having with glTexSubImage2D, but in my case, I'm simply filling the textures from PBOs. This should be lightning fast - right? Could the GL_RED format pose a problem by being non-optimal for the driver?
Another thing: I'm not doing de/reallocating here (I am using the same stack of pre-reserved PBO's), but re-allocating seems to be fast as well.. if I understood correctly this one .. ?
https://www.khronos.org/opengl/wiki/Buffer_Object_Streaming
Any insight highly appreciated..!
P. S. The complete project is here: https://github.com/elsampsa/valkka-core
EDIT 1:
I did some profiling: Every now and then during the streaming, both the PBO=>texture loading (as shown in the code snippet) and glXMakeCurrent go completely crazy and they both consume 10+ milliseconds (!) This happens quite sporadically. I tried to add some glFinish calls after each PBO=>texture load, but with little success (it seemed to stabilize things a bit .. but actually I'm not sure)
EDIT 2:
I am slowly getting there .. Ran some tests where I (a) upload with PBO to GPU and then (b) copy from PBO to texture (like in that sample code). The speed seems to depend on the texture format in "glTexImage2D". I try to match the texture's format and OpenGL internal format, by setting them to GL_RED and GL_RED (or GL_R8), respectively. But that is slow. Instead, if I use GL_RGBA for both, PBO=>TEX is lightning fast.. 100x faster !
Here:
https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/glTexImage2D.xhtml
it says that
GL_RED : Each element is a single red component. The GL converts it to floating point and assembles it into an RGBA element by attaching 0 for green and blue, and 1 for alpha. Each component is clamped to the range [0,1].
.. but I don't want OpenGL to do that! How can I tell it that it's just plain LUMA, i.e. one-byte-per-pixel and no need to convert/fill it, cause I will just use it in the shader program.
Maybe this is impossible and I should use buffer textures instead (as suggested in the comments) .. ? Buffer textures don't try to convert anything.. they just handle it as raw payload, right?
EDIT 3:
I'm trying to get dma to the texture buffer object:
// let's reserve a TBO
glGenBuffers(1, &tbo_index); // a buffer
glBindBuffer(GL_TEXTURE_BUFFER, tbo_index); // .. what is it
glBufferData(GL_TEXTURE_BUFFER, size, 0, GL_STREAM_DRAW); // .. how much
std::cout << "tbo " << tbo_index << std::endl;
glBindBuffer(GL_TEXTURE_BUFFER, 0); // unbind
// generate a texture
glGenTextures(1, &tex_index);
std::cout << "texture " << tex_index << std::endl;
// let's try to get dma to the texture buffer
glBindBuffer(GL_TEXTURE_BUFFER, tbo_index); // bind
payload = (GLubyte*)glMapBuffer(GL_TEXTURE_BUFFER, GL_WRITE_ONLY); // ** TODO: doesn't work
glUnmapBuffer(GL_TEXTURE_BUFFER); // release pointer to mapping buffer
glBindBuffer(GL_TEXTURE_BUFFER, 0); // unbind
std::cout << "tbo " << tbo_index << " at " << (long unsigned int)payload << std::endl;
Doesn't work.. payload is always a null pointer. glMapBuffer works ok with PBOs though. It should work with TBO's as well.
I am trying to make DirectX - OpenGL interop to work, with no success so far. In my case rendering is done in OpenGL (by OSG library), and I would like to have the rendered image as DirectX Texture2D. What I am trying so far:
Initialization:
ID3D11Device *dev3D;
// init dev3D with D3D11CreateDevice
ID3D11Texture2D *dxTexture2D;
// init dxTexture2D with CreateTexture2D, with D3D11_USAGE_DEFAULT, D3D11_BIND_SHADER_RESOURCE
HANDLE hGlDev = wglDXOpenDeviceNV(dev3D);
GLuint glTex;
glGenTextures(1, &glTex);
HANDLE hGLTx = wglDXRegisterObjectNV(hGlDev, (void*) dxTexture2D, glTex, GL_TEXTURE_2D, WGL_ACCESS_READ_WRITE_NV);
On every frame rendered by OSG camera I am getting a callback. First I start with glReadBuffer(GL_FRONT), and it seems to be OK till that point, as I am able to read the rendered buffer into memory with glReadPixels. The problem is that I can't copy the pixels to previously created GL_TEXTURE_2D:
BOOL lockOK = wglDXLockObjectsNV(hGlDev, 1, &hGLTx);
glBindTexture(GL_TEXTURE_2D, glTex);
glCopyTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, 0, 0, width, height, 0);
auto err = glGetError();
The last call to glCopyTexImage2D creates an error 0x502 (GL_INVALID_OPERATION), and I can't figure out why. Until this point everything else looks fine.
Any help is appreciated.
Found the problem. Instead of the call to glCopyTexImage2D (which creates a new texture), needed to use glCopyTexSubImage2D:
glCopyTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, 0, 0, width, height);
I have a WinForms application with a panel (500x500 pixels) that I want to render something in. At this point I am just trying to fill it in with a specific color. I want to use OpenGL/CUDA interop to do this.
I got the panel configured to be the region to render stuff in, however when I run my code, the panel just gets filled with the glClear(..) color, and nothing assigned by the kernel is displayed. It sort of worked this morning (inconsistently), and in my attempt to sort out the SwapBuffers() mess, I think I screwed it up.
Here is the pixel format initialization for OpenGL. It seems to work fine, I have the two buffers as I expected, and the context is correct:
static PIXELFORMATDESCRIPTOR pfd=
{
sizeof(PIXELFORMATDESCRIPTOR), // Size Of This Pixel Format Descriptor
1, // Version Number
PFD_DRAW_TO_WINDOW | // Format Must Support Window
PFD_SUPPORT_OPENGL | // Format Must Support OpenGL
PFD_DOUBLEBUFFER, // Must Support Double Buffering
PFD_TYPE_RGBA, // Request An RGBA Format
16, // Select Our Color Depth
0, 0, 0, 0, 0, 0, // Color Bits Ignored
0, // No Alpha Buffer
0, // Shift Bit Ignored
0, // No Accumulation Buffer
0, 0, 0, 0, // Accumulation Bits Ignored
16, // 16Bit Z-Buffer (Depth Buffer)
0, // No Stencil Buffer
0, // No Auxiliary Buffer
PFD_MAIN_PLANE, // Main Drawing Layer
0, // Reserved
0, 0, 0 // Layer Masks Ignored
};
GLint iPixelFormat;
// get the device context's best, available pixel format match
if((iPixelFormat = ChoosePixelFormat(hdc, &pfd)) == 0)
{
MessageBox::Show("ChoosePixelFormat Failed");
return 0;
}
// make that match the device context's current pixel format
if(SetPixelFormat(hdc, iPixelFormat, &pfd) == FALSE)
{
MessageBox::Show("SetPixelFormat Failed");
return 0;
}
if((m_hglrc = wglCreateContext(m_hDC)) == NULL)
{
MessageBox::Show("wglCreateContext Failed");
return 0;
}
if((wglMakeCurrent(m_hDC, m_hglrc)) == NULL)
{
MessageBox::Show("wglMakeCurrent Failed");
return 0;
}
After this is done, I set up the ViewPort as such:
glViewport(0,0,iWidth,iHeight); // Reset The Current Viewport
glMatrixMode(GL_MODELVIEW); // Select The Modelview Matrix
glLoadIdentity(); // Reset The Modelview Matrix
glEnable(GL_DEPTH_TEST);
Then I set up the clear color and do a clear:
glClearColor(1.0f, 0.0f, 0.0f, 1.0f);
glClear(GL_COLOR_BUFFER_BIT| GL_DEPTH_BUFFER_BIT);
Now I set up the CUDA/OpenGL interop:
cudaDeviceProp prop; int dev;
memset(&prop, 0, sizeof(cudaDeviceProp));
prop.major = 1; prop.minor = 0;
checkCudaErrors(cudaChooseDevice(&dev, &prop));
checkCudaErrors(cudaGLSetGLDevice(dev));
glBindBuffer = (PFNGLBINDBUFFERARBPROC)GET_PROC_ADDRESS("glBindBuffer");
glDeleteBuffers = (PFNGLDELETEBUFFERSARBPROC)GET_PROC_ADDRESS("glDeleteBuffers");
glGenBuffers = (PFNGLGENBUFFERSARBPROC)GET_PROC_ADDRESS("glGenBuffers");
glBufferData = (PFNGLBUFFERDATAARBPROC)GET_PROC_ADDRESS("glBufferData");
GLuint bufferID;
cudaGraphicsResource * resourceID;
glGenBuffers(1, &bufferID);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, bufferID);
glBufferData(GL_PIXEL_UNPACK_BUFFER_ARB, fWidth*fHeight*4, NULL, GL_DYNAMIC_DRAW_ARB);
checkCudaErrors(cudaGraphicsGLRegisterBuffer( &resourceID, bufferID, cudaGraphicsMapFlagsNone ));
Now I try to call my kernel (which just paints each pixel a specific color) and have that displayed.
uchar4* devPtr;
size_t size;
// First clear the back buffer:
glClearColor(1.0f, 0.5f, 0.0f, 0.0f); // orange
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
checkCudaErrors(cudaGraphicsMapResources(1, &resourceID, NULL));
checkCudaErrors(cudaGraphicsResourceGetMappedPointer((void**)&devPtr, &size, resourceID));
animate(devPtr); // This will call the kernel and do a sync (see later)
checkCudaErrors(cudaGraphicsUnmapResources(1, &resourceID, NULL));
// Swap buffers to bring back buffer forward:
SwapBuffers(m_hDC);
At this point I expect to see the kernel colors on the screen, but no! I see orange, which is the clear color that I just set.
Here is the call to the kernel:
void animate(uchar4* dispPtr)
{
checkCudaErrors(cudaDeviceSynchronize());
animKernel<<<blocks, threads>>>(dispPtr, envdim);;
checkCudaErrors(cudaDeviceSynchronize());
}
Here envdim is just the dimensions (so 500x500). The kernel itself:
__global__ void animKernel(uchar4 *optr, dim3 matdim)
{
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * matdim.x;
if (x < matdim.x && y < matdim.y)
{
// BLACK:
optr[offset].x = 0; optr[offset].y = 0; optr[offset].z = 0;
}
}
Things I've done:
The value returned by cudaGraphicsResourceGetMappedPointer's size is 1000000, which corresponds to the 500x500 matrix of uchar4, so that's good.
Each kernel printed the value and location that it was writing to, and that seemed ok.
Played with the alpha value for the clear color, but that doesn't seem to do anything (yet?)
Ran the animate() function several times. Don't know why I thought that would help, but I tried it...
So I guess I'm missing something, but I'm going kind of crazy looking for it. Any advice? Help?
It's another one of those questions I answer myself! Hmph, as I figured, it was a one line issue. The problem resides in the rendering call itself.
The configuration is fine, the one issue I have with the code above is:
I never called glDrawPixels(), which is necessary in order for the OpenGL driver to copy the shared buffer (GL_PiXEL_UNPACK_BUFFER_ARB) source to the display buffer. The correct rendering sequence is then:
uchar4* devPtr;
size_t size;
// First clear the back buffer:
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
checkCudaErrors(cudaGraphicsMapResources(1, &resourceID, NULL));
checkCudaErrors(cudaGraphicsResourceGetMappedPointer((void**)&devPtr, &size, resourceID));
animate(devPtr); // This will call the kernel and do a sync (see later)
checkCudaErrors(cudaGraphicsUnmapResources(1, &resourceID, NULL));
// This is necessary to copy the shared buffer to display
glDrawPixels(fWidth, fHeight, GL_RGBA, GL_UNSIGNED_BYTE, 0);
// Swap buffers to bring back buffer forward:
SwapBuffers(m_hDC);
I'd like to thank the Acade-- uh, CUDA By Example, once again for helping me. Even though the example code from the book used GLUT (which was completely useless for this...), the book referenced normal gl functions.