Streaming several (YUV) videos using OpenGL - c++

I'm trying to do high-throughput video streaming using OpenGL. I thought I'd figured it all out with my genius programming architecture, but - surprise - when doing more serious tests, I've been stonewalled with a performance problem.
The story goes like this:
It all starts by reserving a stack of PBO's (say, a hundred+ or so):
glGenBuffers(1, &index);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, index);
glBufferData(GL_PIXEL_UNPACK_BUFFER, size, 0, GL_STREAM_DRAW); // reserve n_payload bytes to index/handle pbo_id
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0); // unbind (not mandatory)
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, index); // rebind (not mandatory)
payload = (GLubyte*)glMapBuffer(GL_PIXEL_UNPACK_BUFFER, GL_WRITE_ONLY);
glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER); // release pointer to mapping buffer ** MANDATORY **
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0); // unbind ** MANDATORY **
YUV pixel data is copied into PBOs by separate decoder/uploader threads that use a common stack of available PBOs. The "payload" pointers you see above, are accessed from these threads and data is copied (with memcpy) "directly" to the gpu. Once a PBO is used, it is returned to the stack.
I also pre-reserve textures for each separate video stream. I reserve three textures (y, u and v), like this:
glEnable(GL_TEXTURE_2D);
glGenTextures(1, &index);
glBindTexture(GL_TEXTURE_2D, index);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
glTexImage2D(GL_TEXTURE_2D, 0, format, w, h, 0, format, GL_UNSIGNED_BYTE, 0); // no upload, just reserve
glBindTexture(GL_TEXTURE_2D, 0); // unbind
Rendering is done in a "master thread" (remember, the decoder / uploader threads are separate beasts) that reads frames from a fifo queue.
A critical step in rendering is to copy data from PBOs to textures (tex->format is GL_RED):
// y
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo->y_index);
glBindTexture(GL_TEXTURE_2D, tex->y_index); // this is the texture we will manipulate
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, tex->w, tex->h, tex->format, GL_UNSIGNED_BYTE, 0); // copy from pbo to texture
// u
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo->u_index);
glBindTexture(GL_TEXTURE_2D, tex->u_index); // this is the texture we will manipulate
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, tex->w/2, tex->h/2, tex->format, GL_UNSIGNED_BYTE, 0); // copy from pbo to texture
// v
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo->v_index);
glBindTexture(GL_TEXTURE_2D, tex->v_index); // this is the texture we will manipulate
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, tex->w/2, tex->h/2, tex->format, GL_UNSIGNED_BYTE, 0); // copy from pbo to texture
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0); // unbind // important!
glBindTexture(GL_TEXTURE_2D, 0); // unbind
And finally, the image is drawn using the OpenGL shading language (which is another story).
The Question : Do you see any OpenGL performance bottlenecks here?
Step (3) seems like a bottleneck, as it starts to consume too much time (up to 10+ milliseconds)!, when I'm trying to do this with several cameras.
Of course, this could be due to something else clogging the OpenGL pipeline - but everything else (glDrawElements, etc.) seems to take max. 1 millisecond.
I've been reading about problems people are having with glTexSubImage2D, but in my case, I'm simply filling the textures from PBOs. This should be lightning fast - right? Could the GL_RED format pose a problem by being non-optimal for the driver?
Another thing: I'm not doing de/reallocating here (I am using the same stack of pre-reserved PBO's), but re-allocating seems to be fast as well.. if I understood correctly this one .. ?
https://www.khronos.org/opengl/wiki/Buffer_Object_Streaming
Any insight highly appreciated..!
P. S. The complete project is here: https://github.com/elsampsa/valkka-core
EDIT 1:
I did some profiling: Every now and then during the streaming, both the PBO=>texture loading (as shown in the code snippet) and glXMakeCurrent go completely crazy and they both consume 10+ milliseconds (!) This happens quite sporadically. I tried to add some glFinish calls after each PBO=>texture load, but with little success (it seemed to stabilize things a bit .. but actually I'm not sure)
EDIT 2:
I am slowly getting there .. Ran some tests where I (a) upload with PBO to GPU and then (b) copy from PBO to texture (like in that sample code). The speed seems to depend on the texture format in "glTexImage2D". I try to match the texture's format and OpenGL internal format, by setting them to GL_RED and GL_RED (or GL_R8), respectively. But that is slow. Instead, if I use GL_RGBA for both, PBO=>TEX is lightning fast.. 100x faster !
Here:
https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/glTexImage2D.xhtml
it says that
GL_RED : Each element is a single red component. The GL converts it to floating point and assembles it into an RGBA element by attaching 0 for green and blue, and 1 for alpha. Each component is clamped to the range [0,1].
.. but I don't want OpenGL to do that! How can I tell it that it's just plain LUMA, i.e. one-byte-per-pixel and no need to convert/fill it, cause I will just use it in the shader program.
Maybe this is impossible and I should use buffer textures instead (as suggested in the comments) .. ? Buffer textures don't try to convert anything.. they just handle it as raw payload, right?
EDIT 3:
I'm trying to get dma to the texture buffer object:
// let's reserve a TBO
glGenBuffers(1, &tbo_index); // a buffer
glBindBuffer(GL_TEXTURE_BUFFER, tbo_index); // .. what is it
glBufferData(GL_TEXTURE_BUFFER, size, 0, GL_STREAM_DRAW); // .. how much
std::cout << "tbo " << tbo_index << std::endl;
glBindBuffer(GL_TEXTURE_BUFFER, 0); // unbind
// generate a texture
glGenTextures(1, &tex_index);
std::cout << "texture " << tex_index << std::endl;
// let's try to get dma to the texture buffer
glBindBuffer(GL_TEXTURE_BUFFER, tbo_index); // bind
payload = (GLubyte*)glMapBuffer(GL_TEXTURE_BUFFER, GL_WRITE_ONLY); // ** TODO: doesn't work
glUnmapBuffer(GL_TEXTURE_BUFFER); // release pointer to mapping buffer
glBindBuffer(GL_TEXTURE_BUFFER, 0); // unbind
std::cout << "tbo " << tbo_index << " at " << (long unsigned int)payload << std::endl;
Doesn't work.. payload is always a null pointer. glMapBuffer works ok with PBOs though. It should work with TBO's as well.

Related

Is there a better/more efficient way to capture composite X windows in Linux?

As per subject I have the following pseudo-code to setup window capture in X (Linux):
xdisplay = XOpenDisplay(NULL);
win_capture = ...find the window to capture...
XCompositeRedirectWindow(xdisplay, win_capture, CompositeRedirectAutomatic);
XGetWindowAttributes(xdisplay, win_capture, &win_attr); // attributes used later
GLXFBConfig *configs = glXChooseFBConfig(xdisplay, win_attr.root, config_attrs, &nelem);
// cycle through the configs to
// find a valid one
...
win_pixmap = XCompositeNameWindowPixmap(xdisplay, win_capture);
const int pixmap_attrs[] = {GLX_TEXTURE_TARGET_EXT, GLX_TEXTURE_2D_EXT,
GLX_TEXTURE_FORMAT_EXT,
GLX_TEXTURE_FORMAT_RGBA_EXT, None};
gl_pixmap = glXCreatePixmap(xdisplay, config, win_pixmap, pixmap_attrs);
gl_ctx = glXCreateNewContext(xdisplay, config, GLX_RGBA_TYPE, 0, 1);
glXMakeCurrent(xdisplay, gl_pixmap, gl_ctx);
glEnable(GL_TEXTURE_2D);
glGenTextures(1, &gl_texmap);
glBindTexture(GL_TEXTURE_2D, gl_texmap);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, win_attr.width, win_attr.height, 0, GL_RGBA, GL_UNSIGNED_BYTE, 0);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
Then, much later on, this would be the loop to capture the frames:
glXMakeCurrent(xdisplay, gl_pixmap, gl_ctx);
glBindTexture(GL_TEXTURE_2D, gl_texmap);
glXBindTexImageEXT(xdisplay, gl_pixmap, GLX_FRONT_LEFT_EXT, NULL);
glGetTexImage(GL_TEXTURE_2D, 0, GL_RGBA, GL_UNSIGNED_BYTE, data); // data is output RGBA buffer
glXReleaseTexImageEXT(xdisplay, gl_pixmap, GLX_FRONT_LEFT_EXT);
I basically do glXBindTexImageEXT -> glGetTexImage -> glXReleaseTexImageEXT so that I get an updated picture.
It does work, but not sure I'm doing the right/optimal thing.
Is there a better/more optimized way to get such picture/context?
As of now I've found a slightly better way to implement fetching the composite window through OpenGL, via PBO; the advantages of this way is that you could initiate the command asynchronously and then retrieve the RGBA buffer from system memory, whilst the OpenGL driver does data transfer.
Sample pseudocode:
// setup a PBO
GLuint cur_pbo;
glGenBuffers(1, &cur_pbo);
glBindBuffer(GL_PIXEL_PACK_BUFFER, cur_pbo);
glBufferData(GL_PIXEL_PACK_BUFFER, size, NULL, GL_STREAM_READ);
Then much later on
glXMakeCurrent(xdisplay, gl_pixmap, gl_ctx);
glBindTexture(GL_TEXTURE_2D, gl_texmap);
glXBindTexImageEXT(xdisplay, gl_pixmap, GLX_FRONT_LEFT_EXT, NULL);
glBindBuffer(GL_PIXEL_PACK_BUFFER, cur_pbo);
// This will initiate the data transfer, the previous
// buffer pointer is now an offset in the index bound by previous
// glBufferData call
glGetTexImage(GL_TEXTURE_2D, 0, GL_RGBA, GL_UNSIGNED_BYTE, 0);
// do something else
...
...
...
// then later on when we _really_ need to get the data
// perform this call which will make wait if the RGBA
// data is not avilable yet
void* rgba_ptr = glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);
// Then when finished to use rgba_ptr, release it
glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
glXReleaseTexImageEXT(xdisplay, gl_pixmap, GLX_FRONT_LEFT_EXT);
This approach is definitely better than original approach (in the question) if you can use the CPU/same thread to do something between the calls to glGetTexImage and glMapBuffer.
It's worth thinking it may be still better even if you perform these calls sequentially (instead of glGetTexImage without PBO) because the driver may still optimize the transfer and would manage the system memory buffer itself.

CUDA/OpenGL Interop: Writing to surface object does not erase previous contents

I am attempting to use a CUDA kernel to modify an OpenGL texture, but am having a strange issue where my calls to surf2Dwrite() seem to blend with the previous contents of the texture, as you can see in the image below. The wooden texture in the back is what's in the texture before modifying it with my CUDA kernel. The expected output would include ONLY the color gradients, not the wood texture behind it. I don't understand why this blending is happening.
Possible Problems / Misunderstandings
I'm new to both CUDA and OpenGL. Here I'll try to explain the thought process that led me to this code:
I'm using a cudaArray to access the texture (rather than e.g. an array of floats) because I read that it's better for cache locality when reading/writing a texture.
I'm using surfaces because I read somewhere that it's the only way to modify a cudaArray
I wanted to use surface objects, which I understand to be the newer way of doing things. The old way is to use surface references.
Some possible problems with my code that I don't know how to check/test:
Am I being inconsistent with image formats? Maybe I didn't specify the correct number of bits/channel somewhere? Maybe I should use floats instead of unsigned chars?
Code Summary
You can find a full minimum working example in this GitHub Gist. It's quite long because of all the moving parts, but I'll try to summarize. I welcome suggestions on how to shorten the MWE. The overall structure is as follows:
create an OpenGL texture from a file stored locally
register the texture with CUDA using cudaGraphicsGLRegisterImage()
call cudaGraphicsSubResourceGetMappedArray() to get a cudaArray that represents the texture
create a cudaSurfaceObject_t that I can use to write to the cudaArray
pass the surface object to a kernel that writes to the texture with surf2Dwrite()
use the texture to draw a rectangle on-screen
OpenGL Texture Creation
I am new to OpenGL, so I'm using the "Textures" section of the LearnOpenGL tutorials as a starting point. Here's how I set up the texture (using the image library stb_image.h)
GLuint initTexturesGL(){
// load texture from file
int numChannels;
unsigned char *data = stbi_load("img/container.jpg", &g_imageWidth, &g_imageHeight, &numChannels, 4);
if(!data){
std::cerr << "Error: Failed to load texture image!" << std::endl;
exit(1);
}
// opengl texture
GLuint textureId;
glGenTextures(1, &textureId);
glBindTexture(GL_TEXTURE_2D, textureId);
// wrapping
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_MIRRORED_REPEAT);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_MIRRORED_REPEAT);
// filtering
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR_MIPMAP_LINEAR);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
// set texture image
glTexImage2D(
GL_TEXTURE_2D, // target
0, // mipmap level
GL_RGBA8, // internal format (#channels, #bits/channel, ...)
g_imageWidth, // width
g_imageHeight, // height
0, // border (must be zero)
GL_RGBA, // format of input image
GL_UNSIGNED_BYTE, // type
data // data
);
glGenerateMipmap(GL_TEXTURE_2D);
// unbind and free image
glBindTexture(GL_TEXTURE_2D, 0);
stbi_image_free(data);
return textureId;
}
CUDA Graphics Interop
After calling the function above, I register the texture with CUDA:
void initTexturesCuda(GLuint textureId){
// register texture
HANDLE(cudaGraphicsGLRegisterImage(
&g_textureResource, // resource
textureId, // image
GL_TEXTURE_2D, // target
cudaGraphicsRegisterFlagsSurfaceLoadStore // flags
));
// resource description for surface
memset(&g_resourceDesc, 0, sizeof(g_resourceDesc));
g_resourceDesc.resType = cudaResourceTypeArray;
}
Render Loop
Every frame, I run the following to modify the texture and render the image:
while(!glfwWindowShouldClose(window)){
// -- CUDA --
// map
HANDLE(cudaGraphicsMapResources(1, &g_textureResource));
HANDLE(cudaGraphicsSubResourceGetMappedArray(
&g_textureArray, // array through which to access subresource
g_textureResource, // mapped resource to access
0, // array index
0 // mipLevel
));
// create surface object (compute >= 3.0)
g_resourceDesc.res.array.array = g_textureArray;
HANDLE(cudaCreateSurfaceObject(&g_surfaceObj, &g_resourceDesc));
// run kernel
kernel<<<gridDim, blockDim>>>(g_surfaceObj, g_imageWidth, g_imageHeight);
// unmap
HANDLE(cudaGraphicsUnmapResources(1, &g_textureResource));
// --- OpenGL ---
// clear
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
// use program
shader.use();
// triangle
glBindVertexArray(vao);
glBindTexture(GL_TEXTURE_2D, textureId);
glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_INT, 0);
glBindVertexArray(0);
// glfw: swap buffers and poll i/o events
glfwSwapBuffers(window);
glfwPollEvents();
}
CUDA Kernel
The actual CUDA kernel is as follows:
__global__ void kernel(cudaSurfaceObject_t surface, int nx, int ny){
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if(x < nx && y < ny){
uchar4 data = make_uchar4(x % 255,
y % 255,
0, 255);
surf2Dwrite(data, surface, x * sizeof(uchar4), y);
}
}
If I understand correctly, you initially register the texture, map it once, create a surface object for the array representing the mapped texture, and then unmap the texture. Every frame, you then map the resource again, ask for the array representing the mapped texture, and then completely ignore that one and use the surface object created for the array you got back when you first mapped the resource. From the documentation:
[…] The value set in array may change every time that resource is mapped.
You have to create a new surface object every time you map the resource because you might get a different array every time. And, in my experience, you will actually get a different one every so often. It may be a valid thing to do to only create a new surface object whenever the array actually changes. The documentation seems to allow for that, but I never tried, so I can't tell whether that works for sure…
Apart from that: You generate mipmaps for your texture. You only overwrite mip level 0. You then render the texture using mipmapping with trilinear interpolation. So my guess would be that you just happen to render the texture at a resolution that does not match the resolution of mip level 0 exactly and, thus, you will end up interpolating between level 0 (in which you wrote) and level 1 (which was generated from the original texture)…
It turns out the problem is that I had mistakenly generated mipmaps for the original wood texture, and my CUDA kernel was only modifying the level-0 mipmap. The blending I noticed was the result of OpenGL interpolating between my modified level-0 mipmap and a lower-resolution version of the wood texture.
Here's the correct output, obtained by disabling mipmap interpolation. Lesson learned!

What's wrong with my framebuffer?

I'm creating a framebuffer object to be my gbuffer for deferred shading. I mainly learned from http://ogldev.atspace.co.uk/, and modified to be a little more... sane. Here's the source code where I create the framebuffer:
/* Create the FBO */
glGenFramebuffers(1, &fbo);
glBindFramebuffer(GL_FRAMEBUFFER, fbo);
/* Create the gbuffer textures */
glGenTextures(GBUFFER_NUM_TEXTURES, tex);
/* Create the color buffer */
glBindTexture(GL_TEXTURE_2D, tex[GBUFFER_COLOR]);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, width, height, 0, GL_RGBA, GL_UNSIGNED_BYTE, NULL);
glFramebufferTexture2D(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_TEXTURE_2D, tex[GBUFFER_COLOR], 0);
/* Create the normal buffer */
glBindTexture(GL_TEXTURE_2D, tex[GBUFFER_NORMAL]);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RG16F, width, height, 0, GL_RG, GL_FLOAT, NULL);
glFramebufferTexture2D(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT1, GL_TEXTURE_2D, tex[GBUFFER_NORMAL], 0);
/* Create the depth-stencil buffer */
glBindTexture(GL_TEXTURE_2D, tex[GBUFFER_DEPTH_STENCIL]);
glTexImage2D(GL_TEXTURE_2D, 0, GL_DEPTH24_STENCIL8, width, height, 0, GL_DEPTH_STENCIL, GL_FLOAT, NULL);
glFramebufferTexture2D(GL_FRAMEBUFFER, GL_DEPTH_STENCIL_ATTACHMENT, GL_TEXTURE_2D, tex[GBUFFER_DEPTH_STENCIL], 0);
GLenum drawBuffers[] = {GL_COLOR_ATTACHMENT0, GL_COLOR_ATTACHMENT1};
glDrawBuffers(2, drawBuffers);
glReadBuffer(GL_NONE);
/* Check for errors */
GLenum status = glCheckFramebufferStatus(GL_FRAMEBUFFER);
if (status != GL_FRAMEBUFFER_COMPLETE)
{
error("In GBuffer::init():\n");
errormore("Failed to create Framebuffer, status: 0x%x\n", status);
fbo = 0;
return;
}
// restore default FBO
glBindFramebuffer(GL_FRAMEBUFFER, 0);
When I run this, however, status returns GL_FRAMEBUFFER_INCOMPLETE_ATTACHMENT. If it's not clear, I'm trying to create 3 gbuffers:
a 32-bit RGBA color buffer (I'd use 24-bit but I'm scared of alignment penalties),
a 32-bit RG normal buffer (each component using a 16-bit float, but I might get away with a signed short?)
a 24-bit Depth buffer packed with an 8-bit Stencil buffer
(total of 96 bits, or 12 bytes)
Possible problem areas that I can see might be using GL_FLOAT for the normal buffer, and GL_FLOAT for the depth-stencil buffer. I'd imagine GL_HALF_FLOAT would be more appropriate for normals, but that's not on the list of types that I can use with glTexImage2D. Similarly, I have no idea what type is most appropriate to use for a depth-stencil buffer.
What am I doing wrong?
Your use of GL_FLOAT is mostly irrelevant, since no pixel transfer actually happens.
You can supply anything you want there as long as it is a meaningful data type. While no pixel transfer happens when you pass NULL for data, GL still validates the pixel transfer data type against the set of valid types and will raise an error if you do something wrong. To that end, if it raises an error the texture will be incomplete and thus cannot be used as an FBO attachment.
Here is where the problem lies, GL_FLOAT is not a meaningful data type for pixel transfer into a packed GL_DEPTH_STENCIL image format... it expects a packed data type such as GL_UNSIGED_INT_24_8 or something really exotic like GL_FLOAT_32_UNSIGNED_INT_24_8_REV​ (64-bit packed Floating-Point Depth + Stencil format).
In any event, there are actually two components that need to be packed into your data type. GL_FLOAT can only describe one of the two components, because floating-point stencil buffers are meaningless.
By the way, this whole confusing mess about pixel transfer data type can be completely avoided if you use something like glTexStorage2D (...) to only allocate storage for the texture. glTexImage2D (...) does double-duty, allocating storage for a texture LOD and providing a mechanism to initialize it with data. You really do not care about the later part if you are drawing into the texture with an FBO, since that is the only place it gets any data from.

Clearing color of GL_TEXTURE_2D_ARRAY with PBO

I have a texture 2d array of TEXTURE_2D.I need to clear the content of the textures before each draw pass.I am trying to do it with PBO.But I am getting INVALID_OPERATION error.
Here is how I create the array of images:
glGenTextures(1,&_texID);
glBindTexture (GL_TEXTURE_2D_ARRAY,_texID);
glTexStorage3D(GL_TEXTURE_2D_ARRAY,1,GL_RGBA32F,width,height,numTextures);
glBindTexture (GL_TEXTURE_2D_ARRAY,0);
glBindImageTexture(0, _texID, 0, GL_FALSE, 0, GL_READ_WRITE, GL_RGBA32F);
Here is how I clear it:
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, clearBuffer);
glBindTexture(GL_TEXTURE_2D_ARRAY, itexArray->GetTexID());
for(int i =0; i <numTextures ;++i) {
glTexSubImage3D(GL_TEXTURE_2D_ARRAY,1,0, 0, 0, _viewportWidth, _viewportHeight, i , GL_RGBA, GL_FLOAT, NULL);
}
glBindTexture(GL_TEXTURE_2D_ARRAY, 0);
I have numTextures = 8,so 8 texture layers in the array.When I start clearing them in the loop,first 4 are cleared without errors but from the forth on I ma getting INVALID_OPERATION.
UPDATE:
I solved PBO INVALID_OPERATION issue by enlarging PBO size from 2048x2048 to 4096x4096 but the result is that the textures of texture array are still not cleared properly.For example,at startup of the program leftovers can be still seen which disappear only after the rendered objects start moving around the viewport.
Here is the setup for clearing PBO:
GLint frameSize =MAX_FRAMEBUFFER_WIDTH * MAX_FRAMEBUFFER_HEIGHT * sizeof(float);
glGenBuffers(1, &clearBuffer);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER,clearBuffer);
glBufferData(GL_PIXEL_UNPACK_BUFFER,frameSize,NULL,GL_STATIC_DRAW);
//fill the buffer with color:
vec4* data = (vec4*)glMapBuffer(GL_PIXEL_UNPACK_BUFFER,GL_WRITE_ONLY);
memset(data,0x00,frameSize);
glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
Where MAX_FRAMEBUFFER_WIDTH and MAX_FRAMEBUFFER_HEIGHT are both 4096
Level is level of detail, i.e. mipmap level, in most cases it is 0, depth would be array index in your case.
Your glTexSubImage3D call is broken.
glTexSubImage3D(GL_TEXTURE_2D_ARRAY, 1,
0, 0, 0, //offset (first image)
_viewportWidth, _viewportHeight, i, //size (getting larger)
GL_RGBA, GL_FLOAT, NULL);
First of all, of course Vasaka is right in that you shouldn't write to mipmap level 1 (which doesn't even exist), but 0. But even then this call will try to put a 3D image of size _viewportWidth * _viewportHeight * i at the first array index, which is surely not what you want. Instead you want to clear a 2D image of size _viewportWidth * _viewportHeight at position i. So your call should actually look this way:
glTexSubImage3D(GL_TEXTURE_2D_ARRAY, 0,
0, 0, i, //offset (ith image)
_viewportWidth, _viewportHeight, 1, //size (proper 2D image)
GL_RGBA, GL_FLOAT, NULL);
And your problem with needing a larger PBO than neccessary is easily solved by including a 4 in the computation of frameSize. Your PBO is treated (and explained by you) as containing 4-vectors of floats, yet you compute the size in bytes of it as if it just contained single floats. That's why it magically works for a doubled dimension, since this would properly increase the size of the PBO 4 times, as neccessary, but it only hides the actual problem of forgetting the component count in the size computation.
EDIT: By the way, instead of maintaining a huge PBO which contains nothing but 0s, you could also try to attach the respective image layer to an FBO and do a simple glClear in each loop iteration. Don't know which one is more efficient (but I'd guess glClear being more optimized than a whole image copy), but it at least makes this large PBO obsolete.

Reading the pixels values from the Frame Buffer Object (FBO) using Pixel Buffer Object (PBO)

Can I use Pixel Buffer Object (PBO) to directly read the pixels values (i.e. using glReadPixels) from the FBO (i.e. while FBO is still attached)?
If yes,
What are the advantages and disadvantages of using PBO with FBO?
What is the problem with following code
{
//DATA_SIZE = WIDTH * HEIGHT * 3 (BECAUSE I AM USING 3 CHANNELS ONLY)
// FBO and PBO status is good
.
.
glBindFramebufferEXT(GL_FRAMEBUFFER_EXT, fboId);
//Draw the objects
Following glReadPixels works fine
glReadPixels(0, 0, screenWidth, screenHeight, GL_BGR_EXT, GL_UNSIGNED_BYTE, (uchar*)cvimg->imageData);
Following glReadPixels DOES NOT WORK :(
glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, pboId);
//yes glWriteBuffer has also same target and I also checked with every possible values
glReadBuffer(GL_COLOR_ATTACHMENT0_EXT);
glReadPixels(0, 0, screenWidth, screenHeight, GL_BGR_EXT, GL_UNSIGNED_BYTE, (uchar*)cvimg->imageData);
.
.
glBindFramebufferEXT(GL_FRAMEBUFFER_EXT, 0); //back to window framebuffer
When using a PBO as target for glReadPixels you have to specify a byte offset into the buffer (0, I suppose) instead of (uchar*)cvimg->imageData as target address. It is similar to using a buffer offset in glVertexPointer when using VBOs.
EDIT: When a PBO is bound to the GL_PIXEL_PACK_BUFFER, the last argument to glReadPixels is not treated as a pointer into system memory but as a byte offset into the bound buffer's memory. So to write the pixels into the buffer just pass a 0 (write them to the start of the buffer memory). You can then later acces the buffer memory (to get the pixels) by means of glMapBuffer. The example link you provided in your comment does that, too, just read it extensively. I also suggest reading the part about vertex buffer objects they mention at the start, as these lay the ground to understand buffer objects.
Yes, we can use FBO and PBO together.
Answer 1:
For synchronous reading: 'glReadPixels' without PBO is fast.
For asynchronous reading: 'glReadPixels' with 2/n PBOs is better- one for reading pixels from framebuffer to PBO (n) by GPU and another PBO (n+1) to process pixels by CPU. However fast is not granted, it is problem and design spefic.
Answer 2:
Christian Rau's explanation is correct and revised code is below
glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, pboId);
glReadBuffer(GL_COLOR_ATTACHMENT0_EXT);
//glReadBuffer(GL_DEPTH_ATTACHMENT_EXT);
glReadPixels(0, 0, screenWidth, screenHeight, GL_BGR, GL_UNSIGNED_BYTE, 0);
//GLubyte* src = (GLubyte*)glMapBufferARB(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY_ARB);
//OR
cvimg->imageData = (char*) glMapBufferARB(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY_ARB);
if(cvimg_predict_contour->imageData)
{
//Process src OR cvim->imageData
glUnmapBufferARB(GL_PIXEL_PACK_BUFFER_ARB); // release pointer to the mapped buffer
}
glBindBufferARB(GL_PIXEL_PACK_BUFFER_ARB, 0);