How could OpenGL buffers' state persist between program runs? - c++

I'm writing an OpenGL program that draws into an Auxiliary Buffer, then the content of the Auxiliary Buffer is accumulated to the Accumulation Buffer before being GL_RETURN-ed to the Back buffer (essentially to be composited to the screen). In short, I'm doing sort of a motion blur. However the strange thing is, when I recompile and rerun my program, I was seeing the content of the Auxiliary/Accumulation Buffer from the previous program runs. This does not make sense. Am I misunderstanding something, shouldn't OpenGL's state be completely reset when the program restarts?
I'm writing an SDL/OpenGL program in Gentoo Linux nVidia Drivers 195.36.31 on GeForce Go 6150.

No - there's no reason for your GPU to ever clear its memory. It's your responsibility to clear out (or initialize) your textures before using them.

Actually, the OpenGL state is initialized to well-defined values.
However, the GL state consists of settings like all binary switches (glEnable), blending, depth test mode... etc, etc. Each of those has its default settings, which are described in OpenGL specs and you can be sure that they will be enforced upon context creation.
The point is, the framebuffer (or texture data or vertex buffers or anything) is NOT a part of what is called "GL state". GL state "exists" in your driver. What is stored in the GPU memory is totally different thing and it is uninitialized until you ask the driver (via GL calls) to initialize it. So it's completely possible to have the remains of previous run in texture memory or even in the frame buffer itself if you don't clear or initialize it at startup.


How to access framebuffer from CPU in Direct3D 11?

I am creating a simple framework for teaching fundamental graphics concepts under C++/D3D11. The framework is required to enable direct manipulation of the screen raster contents via a simple interface function (e.g. Putpixel( x,y,r,g,b )).
Under D3D9 this was a relatively simple goal achieved by allocating a surface buffer on the heap where the CPU would compose a surface. Then the backbuffer would be locked and the heap buffer's contents transferred to the backbuffer. As I understand it, it is not possible to access the backbuffer directly from the CPU under D3D11. One must prepare a texture resource and then draw it to the backbuffer via some fullscreen geometry.
I have considered two systems for such a procedure. The first comprises a D3D11_USAGE_DEFAULT texture and a D3D11_USAGE_STAGING texture. The staging texture is first mapped and then drawn to from the CPU. When the scene is complete, the staging texture is unmapped and copied to the default texture with CopyResource (which uses the GPU to perform the copy if I am not mistaken), and then the default texture is drawn to the backbuffer via a fullscreen textured quad.
The second system comprises a D3D11_USAGE_DYNAMIC texture and a frame buffer allocated on the heap. When the scene is composed, the dynamic texture is mapped, the contents of the heap buffer are copied over to the dynamic texture via the CPU, the dynamic texture is unmapped, and then it is drawn to the backbuffer via a fullscreen textured quad.
I was under the impression that textures created with read and write access and D3D11_USAGE_STAGING would reside in system memory, but the performance tests I have run seem to indicate that this is not the case. Namely, drawing a simple 200x200 filled rectangle via CPU is about 3x slower with the staging texture than with the heap buffer (exact same disassembly for both cases (a tight rep stos loop)), strongly hinting that the staging texture resides in the graphics adapter memory.
I would prefer to use the staging texture system, since it would allow both the work of rendering to the backbuffer and the work of copying from system memory to graphics memory to be offloaded onto the GPU. However, I would like to prioritize CPU access speed over such an ability in any case.
So what method method would be optimal for this usage case? Any hints, modifications of my two approaches, or suggestions of altogether different approaches would be greatly appreciated.
The dynamic and staging are both likely to be in system memory, but their is good chance that your issue, is write combined memory. It is a cache mode where single writes are coalesced together, but if you attempt to read, because it is un-cached, each load pay the price of a full memory access. You even have to be very careful, because a c++ *data=something; may sometime also leads to unwanted reads.
There is nothing wrong with a dynamic texture, the GPU can read system memory, but you need to be careful, create a few of them, and cycle each frame with a map_nooverwrite, to inhibit the costly driver buffer renaming of the discard. Of course, never do a map in read and write, only write, or you will introduce gpu/cpu sync and kill the parallelism.
Last, if you want a persistent surface and only a few putpixel a frame (or even a lot of them), i would go with an unordered access view and a compute shader that consume a buffer of pixel position with colors to update. That buffer would be a dynamic buffer with nooverwrite mapping, once again. With that solution, the main surface will reside in video memory.
On a personal note, i would not even bother to teach cpu surface manipulation, this is almost always a bad practice and a performance killer, and not the way to go in a modern gpu architecture. This was not a fundamental graphic concept a decade ago already.

Common OpenGL cleanup operation WITHOUT destroying context

Recently, I've been doing offscreen GPU acceleration for my real-time program.
I want to create a context and reuse it several times (100+). And I'm using OpenGL 2.1 and GLSL version 1.20.
Each time I reuse the context, I'm going to do the following things:
Compile shaders, link program then glUsePrograme (Question 1: should I relink the program or re-create the program each time?)
Generate FBO and Texture, then bind them so I can do offscreen rendering. (Question2: should I destroy those FBO and Texture)
Generate GL_Array_BUFFER and put some vertices data in it. (Question3: Do I even need to clean this?)
glDrawArray bluh bluh...
Call glFinish() then copy data from GPU to CPU by calling glReadPixels.
And is there any other necessary cleanup operation that I should consider?
If you can somehow cache or otherwise keep the OpenGL object IDs, then you should not delete them and instead just reuse them on the next run. Unless you acquire new IDs reusing the old ones will either replace the existing objects (properly releasing their allocations) or just change their data.
The call to glFinish before glReadPixels is superfluous, because glReadPixels causes an implicit synchronization and finish.

Cuda and/or OpenGL for geometric image transformation

My question concerns the most efficient way of performing geometric image transformations on the GPU. The goal is essentially to remove lens distortion from aquired images in real time. I can think of several ways to do it, e.g. as a CUDA kernel (which would be preferable) doing an inverse transform lookup + interpolation, or the same in an OpenGL shader, or rendering a forward transformed mesh with the image texture mapped to it. It seems to me the last option could be the fastest because the mesh can be subsampled, i.e. not every pixel offset needs to be stored but can be interpolated in the vertex shader. Also the graphics pipeline really should be optimized for this. However, the rest of the image processing is probably going to be done with CUDA. If I want to use the OpenGL pipeline, do I need to start an OpenGL context and bring up a window to do the rendering, or can this be achieved anyway through the CUDA/OpenGL interop somehow? The aim is not to display the image, the processing will take place on a server, potentially with no display attached. I've heard this could crash OpenGL if bringing up a window.
I'm quite new to GPU programming, any insights would be much appreciated.
Using the forward transformed mesh method is the more flexible and easier one to implement. However performance wise there's no big difference, as the effective limit you're running into is memory bandwidth, and the amount of memory bandwidth consumed does only depend on the size of your input image. If it's a fragment shader, fed by vertices or a CUDA texture access that's causing the transfer doesn't matter.
If I want to use the OpenGL pipeline, do I need to start an OpenGL context and bring up a window to do the rendering,
On Windows: Yes, but the window can be an invisible one.
On GLX/X11 you need an X server running, but you can use a PBuffer instead of a window to get a OpenGL context.
In either case use a Framebuffer Object as the actual drawing destination. PBuffers may corrupt their primary framebuffer contents at any time. A Framebuffer Object is safe.
or can this be achieved anyway through the CUDA/OpenGL interop somehow?
No, because CUDA/OpenGL interop is for making OpenGL and CUDA interoperate, not make OpenGL work from CUDA. CUDA/OpenGL Interop helps you with the part you mentioned here:
However, the rest of the image processing is probably going to be done with CUDA.
BTW; maybe OpenGL Compute Shaders (available since OpenGL-4.3) would work for you as well.
I've heard this could crash OpenGL if bringing up a window.
OpenGL actually has no say in those things. It's just a API for drawing stuff on a canvas (canvas = window or PBuffer or Framebuffer Object), but it doesn't deal with actually getting a canvas on the scaffolding, so to speak.
Technically OpenGL doesn't care if there's a window or not. It's the graphics system on which the OpenGL context is created. And unfortunately none of the currently existing GPU graphics systems supports true headless operation. NVidia's latest Linux drivers may allow for some crude hacks to setup a truly headless system, but I never tried that, so far.

glEnableClientState and glDisableClientState of OpenGL

What is the meaning of glEnableClientState and glDisableClientState in OpenGL?
So far I've found that these functions are to enable or disable some client side capabilities.
Well, what exactly is the client or server here?
I am running my OpenGL program on a PC, so what is this referring to?
Why do we even need to disable certain capabilities? ...and more intriguing it's about some sort of an array related thing?
The whole picture is very gray to me.
The original terminology stems from the X11 notation, where the server is the actual graphics display system:
A server program providing access to some kind of display device
Clients connecting to the server to draw on the display device provided by it
glEnableClientState and glDisableClientState set state of the client side part. Vertex Arrays used to be located in the client process memory, so drawing using vertex arrays was a client local process.
Today we have Buffer Objects, that place the data in server memory, rendering the whole client side terminology of vertex arrays counterintuitive. It would make sense to discard client states and enable/disable vertex arrays through the usual glEnable/glDisable functions, like we do with framebuffer objects and textures.
If you draw your graphics by passing buffers to OpenGL (glVertexPointer(), etc) instead of direct calls (glVertex3f()), you need to tell OpenGL which buffers to use.
So instead of calling glVertex and glNormal, you'd create buffers, bind them, and use glVertexPointer and glNormalPointer to point OpenGL at your data. Afterwards a call to glDrawElements (or the like) will use those buffers to do the drawing. However, one other required step is to tell the OpenGL driver which buffers you actually want to use, which is there glEnableClientState() comes in.
This is all very hand-wavy. You need to read up on vertex buffer objects and try them out.
In OpenGL terminology, the client is your application, whereas the server is the graphics card (or the driver), I think. The only client-side capabilities are the vertex arrays, as these are stored in CPU memory and therefore on the client-side or more specifically, they are controlled (allocated and freed) by your application and not by the driver.
Vertex buffer objects are a different story. They can be used as vertex arrays, but are controlled by the driver, so the word "client state" doesn't make so much sense anymore when working with buffers.
glEnableClientState and glDisableClientState are mainly used to manage Vertex Arrays and Vertex Buffer Objects.

Self-Referencing Renderbuffers in OpenGL

I have some OpenGL code that behaves inconsistently across different
hardware. I've got some code that:
Creates a render buffer and binds a texture to its color buffer (Texture A)
Sets this render buffer as active, and adjusts the viewport, etc
Activates a pixel shader (gaussian blur, in this instance).
Draws a quad to full screen, with texture A on it.
Unbinds the renderbuffer, etc.
On my development machine this works fine, and has the intended
effect of blurring the texture "in place", however on other hardware
this does not seem to work.
I've gotten it down to two possibilities.
A) Making a renderbuffer render to itself is not supposed to work, and
only works on my development machine due to some sort of fluke.
B) This approach should work, but something else is going wrong.
Any ideas? Honestly I have had a hard time finding specifics about this issue.
A) is the correct answer. Rendering into the same buffer while reading from it is undefined. It might work, it might not - which is exactly what is happening.
In OpenGL's case, framebuffer_object extension has section "4.4.3 Rendering When an Image of a Bound Texture Object is Also Attached to the Framebuffer" which tells what happens (basically, undefined). In Direct3D9, the debug runtime complains loudly if you use that setup (but it might work depending on hardware/driver). In D3D10 the runtime always unbinds the target that is used as destination, I think.
Why this is undefined? One of the reasons GPUs are so fast is that they can make a lot of assumptions. For example, they can assume that units that fetch pixels do not need to communicate with units that write pixels. So a surface can be read, N cycles later the read is completed, N cycles later the pixel shader ends it's execution, then it it put into some output merge buffers on the GPU, and finally at some point it is written to memory. On top of that, the GPUs rasterize in "undefined" order (one GPU might rasterize in rows, another in some cache-friendly order, another in totally another order), so you don't know which portions of the surface will be written to first.
So what you should do is create several buffers. In blur/glow case, two is usually enough - render into first, then read & blur that while writing into second. Repeat this process if needed in ping-pong way.
In some special cases, even the backbuffer might be enough. You simply don't do a glClear, and what you have drawn previously is still there. The caveat is, of course, that you can't really read from the backbuffer. But for effects like fading in and out, this works.