How to speed up offscreen OpenGL rendering with large textures on Win32?

I'm developing some C++ code that can do some fancy 3D transition effects between two images, for which I thought OpenGL would be the best option.
I start with a DIB section and set it up for OpenGL, and I create two textures from input images.
Then for each frame I draw just two OpenGL quads, with the corresponding image texture.
The DIB content is then saved to file.
For example one effect is to locate the two quads (in 3d space) like two billboards, one in front of the other(obscuring it), and then swoop the camera up, forward and down so you can see the second one.
My input images are 1024x768 or so and it takes a really long time to render (100 milliseconds) when the quads cover most of the view. It speeds up if the camera is far away.
I tried rendering each image quad as hundreds of individual tiles, but it takes just the same time, it seems like it depends on the number of visible textured pixels.
I assumed OpenGL could do zillions of polygons a second. Is there something I am missing here?
Would I be better off using some other approach?
The GL strings show up for the DIB version as :
Vendor : Microsoft Corporation
Version: 1.1.0
Renderer : GDI Generic
The Onscreen version shows :
Vendor : ATI Technologies Inc.
Version : 3.2.9756 Compatibility Profile Context
Renderer : ATI Mobility Radeon HD 3400 Series
So I guess I'll have to use FBO's , I'm a bit confused as to how to get the rendered data out from the FBO onto a DIB, any pointers (pun intended) on that?

It sounds like rendering to a DIB is forcing the rendering to happen in software. I'd render to a frame buffer object, and then extract the data from the generated texture. has a pretty decent tutorial.
Keep in mind, however, that graphics hardware is oriented primarily toward drawing on the screen. Capturing rendered data will usually be slower that displaying it, even when you do get the hardware to do the rendering -- though it should still be quite a bit faster than software rendering.
Edit: Dominik Göddeke has a tutorial that includes code for reading back texture data to CPU address space.

One problem with your question:
You provided no actual rendering/texture generation code.
The simplest thing you can do is to make sure your textures have sizes equal to power of two. I.e. instead of 1024x768 use 1024x1024, and use only part of that texture. Explanation: although most of modern hardware supports non-pow2 textures, they are sometimes treated as "special case", and using such texture MAY produce performance drop on some hardware.
Yes, you're missing one important thing. There are few things that limit GPU performance:
1. System memory to video memory transfer rate (probably not your case - only for dynamic textures\geometry when data changes every frame).
2. Computation cost. (If you write a shader with heavy computations, it will be slow).
3. Fill rate (how many pixels program can put on screen per second), AFAIK depends on memory speed on modern GPUs.
4. Vertex processing rate (not your case) - how many vertices GPU can process per second.
5. Texture read rate (how many texels per second GPU can read), on modern GPUs depends on GPU memory speed.
6. Texture read caching (not your case) - i.e. in fragment shader you can read texture few hundreds times per pixel with little performance drop IF coordinates are very close to each other (i.e. almost same texel in each read) - because results are cached. But performance will drop significantly if you'll try to access 100 randomly located texels for every pixels.
All those characteristics are hardware dependent.
I.e., depending on some hardware you may be able to render 1500000 polygons per frame (if they take a small amount of screen space), but you can bring fps to knees with 100 polygons if each polygon fills entire screen, uses alpha-blending and is textured with a highly-detailed texture.
If you think about it, you may notice that there are a lot of videocards that can draw a landscape, but fps drops when you're doing framebuffer effects (like blur, HDR, etc).
Also, you may get performance drop with textured surfaces if you have built-in GPU. When I fried PCIEE slot on previous motherboard, I had to work with built-in GPU (NVidia 6800 or something). Results weren't pleasant. While GPU supported shader model 3.0 and could use relatively computationally expensive shaders, fps rapidly dropped each time when there was a textured object on screen. Obviously happened because built-in GPU used part of system memory as video memory, and transfer rates in "normal" GPU memory and system memory are different.


How can I draw to the display, without OpenGL?

I've been learning OpenGL, and as I sit trying to write my VBOs, PBOs, VAOs, textures, quads, bindings, fragment shaders, vertex shaders, and a whole suite of other modern abstractions upon abstractions built after decades of evolution, I wonder: Isn't the display nothing but a large block of memory?
I've heard of tales, that in the "good ol' days" (such as the Commodore 64), all you had to do was assign a value to an arbitrary byte in memory, and the screen would change a pixel. Extremely simple and elegant. In the modern day, this has changed with layers upon layers of abstractions and safeguards, such that changing a pixel on your display is several hundred feet away.
This begs the question, is it possible in the modern day to just "update a pixel of the screen"? Is it possible to write my own graphics driver or something, where I can send commands to some C wrapper which interfaces with the GPU to change those pixels? This is an extremely broad question, but I'm curious. The answer I'm looking for to this question would provide a rough outline of what you'd have to do in order to be able to arbitrarily get some C code to set a pixel on the screen, as well as a rough outline of why OpenGL has progressed the way it has - what problems did VBOs, PBOs, VAOs, bindings, shaders, etc. solve, and how we got to where we are today.
Yes, it is called a framebuffer.
Your current PC works like that right when you power it up! If you use the CPU to write into video memory, that is called a software renderer.
No, they are not abstractions/safeguards for "changing pixels". Nowadays software renderers are not used anymore. Instead, you have to tell the GPU (which is another computer on its own) how to draw. That "talk" is what the APIs (like OpenGL) do for you.
Now, the GPUs are meant to be fast at drawing, and that requires specialized code and data structures. Those are all the things you mention: VBOs, PBOs, VAOs, shaders, etc. (in OpenGL parlance). There is no way around that, because GPUs are different hardware.
Yes, but that will end up being drawn somehow by the GPU, even if it looks to you like a memory write.
Yes, but that "C wrapper" is the graphics driver. A graphics driver for a modern GPU is very complex.
You cannot write a "C program" to write to a graphical screen because the C standard does not concern itself with graphical displays.
So it depends on your operating system, your hardware, whether you want 2D or 3D acceleration support, the API you choose...
See above.
You can make your own frame buffer - that is just an integer array - and do rasterization on it, then use for example the Windows GDI function SetBitmapBits() to draw it to the display in one go. The final draw-to-display command depends on the operating system.
How you do the rasterization on your framebuffer is completely up to you. You can use the CPU to draw individual pixels or rasterize lines and triangles, see for example this demo of my old CPU graphics engine using Windows GDI:
Using the CPU is fine as long as you do not rasterize large datasets. From my experience, the limit to real-time 60fps rendering on the CPU is ~50k lines per frame.
If you want to rasterize really large datasets, you have to use a GPU in some way. Since the framebuffer is just an integer array, you can transfer it to/from the GPU using OpenCL or CUDA and on the GPU - if your dataset happens to already be in video memory - do all the rasterization extremely fast in parallel. For this you will need an additional z-buffer to decide which pixels to overdraw by occluding geometries. This way you can rasterize approximately 30 Million lines per frame at 60fps. This demo is rendered on the GPU in real time using OpenCL:
Yes. In Windows for example, you can use SetPixel() to draw a pixel or BitBlt() to draw in bulk. See this Q/A
This works fine, but this means you're using the CPU for rendering and you'll find the GPU is much more effective for this task, especially if you require decent framerate and non-trivial graphics. The reason there's these "whole suite of other modern abstractions upon abstractions" is to serve as an interface to the GPU since it has an independent set of memory and totally different execution model. Other GPU libraries (OpenCL, DirectX, Vulkan, etc) all have the same kind of abstractions.
I've glossed over many nuances but I hope the point gets across.

What is the difference between clearing the framebuffer using glClear and simply drawing a rectangle to clear the framebuffer?

I think at least some old graphics drivers used to crash if glClear wasn't used and that glClear is probably faster in a lot of cases but why? How are 3-d graphics drivers usually implemented such that these uses would have different results?
On a high level, it can be faster because the OpenGL implementation knows ahead of time that the whole buffer needs to be set to the same color/value. The more you know about what exactly needs to be done, the more you can take advantage of possible accelerations.
Let's say setting a whole buffer to the same value is more efficient than setting the same pixels to variable values. With a glClear(), you know already that all pixels will have the same value. If you draw a screen sized quad with a fragment shader that emits a constant color, the driver would either have to recognize that situation by analyzing the shaders, or the system would have to compare the values coming out of the shader, to know that all pixels have the same value.
The reason why setting everything to the same value can be more efficient has to do with framebuffer compression and related technologies. GPUs often don't actually write each pixel out to the framebuffer, but use various kinds of compression schemes to reduce the memory bandwidth needed for framebuffer writes. If you imagine almost any kind of compression, all pixels having the same value is very favorable.
To give you some ideas about the published vendor specific technologies, here are a few sources. You can probably find more with a search.
Article talking about new framebuffer compression method in relatively recent AMD cards:
NVIDIA patent on zero bandwidth clears:
Blurb on ARM web site about Mali framebuffer compression:
Why is it faster? Because it is a function that bypasses most calculations that other types of drawings have to go through.
Alpha function, blend function, logical operation, stenciling, texture mapping, and depth-buffering are ignored by glClear
Why do some drivers crash without it? It's hard to say, but it should have something to do with the implementation details of OpenGL. The functions does what it's supposed to do, but might do more that you don't know about.
OpenGL might infer from this function call other tasks that it needs to perform.

OpenGL vector graphics rendering performance on mobile devices

It is generally advised not to use vector graphics in mobile games, or pre-rasterize them - for performance. Why is that? I though that OpenGL is at least as good at drawing lines / triangles as rendering images on screen...
Rasterizing them caches them as images so less overhead takes place vs calculating every coordinate for vector and drawing (more draw cycles and more cpu usage). Drawing a vector is exactly that, you are drawing arcs from point to point on every single call vs displaying an image at a certain coordinate with a cached image file.
Although using impostors is a great optimization trick, depending on the impostors shape, how much overdraw is involved and whenever you may need blending in the process the trick can get you to be fillrate bound. Also in some scenarios where shapes may change, caching the graphics into impostors may not be feasible or may incur in other overheads. Is at matter of balancing your rendering pipeline.
The answer depends on the hardware. Are you using a GPU or NOT?
Today modern mobile devices with Android and IOS have a GPU unit embedded in the chipset.
This GPUs are very good with vector graphics. To probe this point most GPU's have a dedicated Geometry processor in addition to 1 or more pixel processors. (By example Mali-400 GPU).
By example let's say you want to draw a 200 trasparent circles of different colors.
If you do it with modern OpenGL, you will only need one set of geometry (a list of triangles forming a circle) and a list of parameters for each circle, let's say position and color. If you provide this information to the GPU, it will draw it in parallel very quickly.
If you do it using different textures for each color, your program will be very heavy (in storage size) and probably will be more slow due memory bandwidth problems.
It depends on what you want to do, and the hardware. If your hardware doesn't have a GPU you probably should pre-render your graphics.

GPU programming for image processing

I'm working on a project aimed to control a bipad humanoid robot. Unfortunately we have a very limited set of hardware resources (a RB110 board and its mini PCI graphic card). I'm planning to port image processing tasks from CPU to graphic card processor of possible but never done it before... I'm advised to use OpenCV but seems to impossible because our graphic card processor (Volari Z9s) is not supported by framework. Then I found an interesting post on Linux Journal. Author have used OpenGL to process frames retrieved from a v4l device.
I'm a little confused about the relationship between hardware API and OpenGL/OpenCV. In order to utilize a GPU, do the hardware need to be sopported by graphic programming frameworks (OpenGL/OpenCV)? Where can I find such an API?
I googled a lot about my hardware, unfortunately the vendor (XGI Technology) seems to be somehow extinct...
OpenCL and OpenGL are both translated to hardware instructions by the GPU driver, so you need a driver for your operating system that supports these frameworks. Most GPU drivers support some version of OpenGL so that should work.
The OpenGL standard is maintained by the Khronos Group and you can find some tutorials at nehe.
How OpenGL works
OpenGL accepts triangles as input and draws them according to the state it has when the draw is issued. Most OpenGL functions are there to change the operations performed by manipulating this state. Image manipulation can be done by loading the input image as a texture and drawing several vertices with the texture active, resulting in a new Image (or more generic a new 2D grid of data).
From version > 2 (or with the right ARB extensions) the operations performed on the image can be controlled with GLSL programs called vertex and fragment shaders (there are more shaders, but these are the oldest). A vertex shader will be called once per vertex, the results of this are interpolated and forwarded to the fragment shader. A fragment shader will be called every time a new fragment(pixel) is written to the result.
Now this is all about reading and writing images, how to use it for object detection?
Use Vertices to span the input texture over the whole viewport. Instead of computing rgb colors and storing them in the result you can write a fragmentshader that computes grayscale images / gradient images and then checks these textures for each pixel if the pixel is in the center of a cycle with a specific size, part of a line or just has a relatively high gradient compared to its surrounding (good feature) or really anithing else you can find a good parallel algorithm for. (haven't done this myself)
The end result has to be read back to the cpu (sometimes you can use shaders to scale the data down before doing this). OpenCL gives it a less Graphics like feel and gives a lot more freedom but is less supported.
First of all You need shader support (GLSL or asm)
Usual way will be rendering full screen quad with your image (texture) and applying fragment shader. It's called Post-Processing And limited with instruction set and another limitations that your hardware has. On basic lvl it allows you to apply simple (single function) on large data set in parallel way that will produce another data set. But branching (if it is supported) is first performance enemy because GPU consist from couple SIMD blocks

optimal pixel-read back strategy

I need to render certain scenes and read the whole image back in main memory. I've search for this and it seems that most video cards will accelerate the rendering but the read-back will be very slow. After a bit of research i only found this card mentioning "Hardware-Accelerated Pixel Read-Back"
The other approach would do software rendering and the read-back problem doesn't exist, but then the rendering performance will be bad.
Likely, i will have to implement both in order to be able to find the optimal trade-off, but my question is about what other alternative can i have hardware-wise; i understand Quadro is for modelling and designer market segment, which is precisely the client target of this application, Does this means that i'm not likely to find better pixel read-back performance in other video card lines? i.e: Tesla or Fermi, which don't even have video outputs btw
I don't know if the performance would be any different, but you could at least try rendering to an off-screen buffer, then setting that as a texture of a full-screen quad (or outputting that to video in some other way)