Is double buffering needed any more - opengl

As today's cards seem to keep a list of render commands and flush only on a call to glFlush or glFinish, is double buffering really needed any more? An OpenGL game I am developing on Linux (ATI Mobility radeon card) with SDL/OpenGL actually flickers less when SDL_GL_swapbuffers() is replaced by glFinish() and with SDL_GL_SetAttribute(SDL_GL_DOUBLEBUFFER,0) in the init code. Is this a particular case of my card or are such things likely on all cards?
EDIT: I've discovered that the cause for this is KWin. It appears that as datenwolf said, compositing without sync was the cause. When I switched off KWin compositing, the game works fine without ANY source code patches

Double buffering and glFinish are two very different things.
glFinish blocks the program, until all drawing operations are completed.
Double buffering is used to hide the rendering process from the user. Without double buffering, each and every single drawing operation would become visible immediately, assuming that the display refresh frequency is infinitely high. In practice you will get some display artifacts, like parts of the scene visible in one state, the rest not visible or in some other state, the picture could be incomplete, etc. Double buffering avoids this by first rendering into a back buffer, and only after the rendering has been finished swapping this back with the front buffer, that gets sent to the display device.
Now today compositing window management becomes prevalent: Windows has Aero, MacOS X Quartz Extreme and on Linux at least Unity and the GNOME3 shell use compositing if available. The point is: Compositing technically creates doublebuffering: Windows draw to offscreen buffers and of these the final screen is composited. So if you're running on a machine with compositing, then double buffering is kind of redundant if performed in your program, and all it'd take was some kind of synchronization mechanism, to tell the compositor when the next frame is ready. MacOS X has this. X11 still lacks a proper synchronization scheme, see this post on the maillist: http://lists.freedesktop.org/archives/xorg/2004-May/000607.html
TL;DR: Double buffering and glFinish are different things, and you need double buffering (of some sort) to make things look good.

I would expect that it has more to do with what you're rendering or your hardware than anything that could be generalized to something not on your machine. So no: don't try to do this.
Oh, and don't forget multisampling. Many implementations only multisample the back buffer; the front buffer is not multisampled. Doing a swap will downsample from the multisampled buffer.

Related

OSX pushing pixels to screen with minimum latency

I'm trying to develop some very low-latency graphics applications and am getting really frustrated by how long it takes to draw to screen through OpenGL. Every discussion I find about it online addresses optimizing the OpenGL pipeline, but doesn't get anywhere near the results that I need.
Check this out:
https://www.dropbox.com/s/dbz4bq67cxluhs7/MouseLatency.MOV?dl=0
You probably noticed this before: With a c++ OpenGL app, dragging the mouse around the screen, and drawing the mouse location in OpenGL, the OpenGL lags behind by 3 or 4 frames. Clearly OSX CAN draw [the cursor] to screen with very low latency, but OpenGL is much slower. So let's say I don't need to do any fancy OpenGL rendering. I just want to push pixels to screen somehow. Is there a way for me to bypass OpenGL completely and draw to screen faster? Or is this kind of functionality going to be locked inside the kernel somewhere that I can't reach it?
datenwolf's answer is excellent. I just wanted to add one thing to this discussion regarding triple buffering at the compositor level, since I am very familiar with the Microsoft Windows desktop compositor.
I know you are asking about OS X here, but the implementation details I am going to discuss are the most sensible way of implementing this stuff and I would expect to see other systems work this way too.
Triple buffering as you might enable at the application level adds a third buffer to the swap-chain that is synchronized to refresh. That way of doing triple buffering does add latency, because that third buffer has to be displayed and nothing is allowed to touch it until this happens (this is D3D's mandated behavior -- the behavior and feature itself are undefined in OpenGL); but the way the Desktop Window Manager (Windows) works is slightly different.
The behavior I have seen most drivers implement for desktop composition is frame dropping. Any situation where multiple frames are finished between refreshes, all but 1 of those frames are discarded. You actually get lower latency using a window rather than fullscreen + triple buffering, because it does not block buffer swaps when the third buffer (owned by the compositor) has a finished frame waiting to be displayed.
It creates a whole different set of visual issues if framerate is not reasonably consistent. Technically, pixels belonging to dropped frames have infinite latency, so the benefits from latency reduction done this way might be worthless if you needed every single frame drawn to appear on screen.
I believe you can get this behavior on OS X (if you want it) by disabling VSYNC and drawing in a window. VSYNC basically only serves as a form of frame pacing (trade latency for consistency) in this scenario and tearing is eliminated by the compositor itself regardless what rate you draw at.
Regarding mouse cursor latency:
The cursor in any modern window system will always track with minimum latency. There is literally a feature on graphics hardware called a "hardware cursor," where the driver stores the cursor position and then once per-refresh, has the hardware overlay the cursor on top of whatever is sitting in the framebuffer waiting to be scanned-out. So even if your application is drawing at 30 FPS on a 60 Hz display, the cursor is updated every 16 ms when the hardware cursor's used.
This bypasses all graphics APIs altogether, but is quite limited (e.g. it uses the OS-defined cursor).
TL;DR: Latency comes in many forms.
If your problem is input latency, then you can mitigate that by reducing the number of pre-rendered frames and avoiding triple buffering. I could not begin to tell you how to reduce the number of driver pre-rendered frames on OS X.
Minimize length of time before something shows up on screen
If your problem is the amount of time that passes between executions of your render loop, you would go the other way. Increase pre-rendered frames, draw in a window and disable VSYNC. You may run into a lot of frames that are drawn but never displayed in this scenario.
Minimize time spent blocking (increase FPS); some frames will never be displayed
Pre-rendered frames are a powerful little feature that you do not get control over at the OpenGL API level. It sets up how deeply the driver is allowed to pipeline everything and depending on the desired task you will trade different types of latency by fiddling with it. Many gamers swear by setting this value to 1 to minimize input latency at the cost of overall framerate "smoothness."
UPDATE:
Pre-rendered frames are one reason for your multi-frame delay. Fixing this in a cross-platform way is difficult (it's a driver setting), but if you have access to Fence Sync Objects you can produce the same behavior as forcing this to 1.
I can explain this in more detail if need be, the general idea is that you insert a fence sync after the buffer swap and then wait for it to be signaled before the first command in the next frame is allowed to begin. Performance may take a nose dive, but latency will be minimized since the CPU won't be rendering ahead of the GPU anymore.
There are a number of latencies at play here.
Input event → drawing state latency
In your typical interactive application you have a event loop that usually goes
collect user input
process user input
determine what's to be drawn
draw to the back buffer
swap back to front buffer
With the usual ways in which event–update–display loops are written there's almost no delay between step 5 of the previous and step 1 of the following iteration. which means that steps 2, 3, and 4 operate with data that lags about one frame period behind.
So this is the first source of latency.
Tripple buffering / composition latency
Many graphics pipelines enable triple buffering for smoother display update. Instead of keeping only a back and a front buffer around, there's also a third buffer inbetween. The average rate at which to these buffers is drawn is the display refresh period. The buffers themself are stepped at exactly the display refresh period. So this adds another frame period of latency.
If you're running on a system with a window compositor (which is the default by MacOS X) this adds effectively another buffer stage, so if you've got a double buffer mode it gives you triple buffer and if you had a triple buffer it'd give you a "quad" buffer (quotes here, because quad buffer is a term usually used with stereoscopic rendering).
What can you do about this:
Turn off composition
Windows through the DWM API and MacOS X allow to turn off composition or bypass the compositor.
Reducing input lag
Try to collect and integrate the user input as late as possible (use high resolution sleeps). If you've got only a very simple scene you can push the drawing quite close to the V-Sync deadline; in fact the NVidia OpenGL implementation has a vendor specific extension that allows to sleep until a specific amount of time before the next V-Sync.
If your scene is complex but is separable in parts that require low latency user input and stuff where it doesn't matter so much you can draw the higher latency stuff earlier and only at the very last moment integrate user input into it. Of course if the mouse is used to control the viewing direction, or even worse you're rendering for a VR head mounted display things are going to become difficult.

double buffering with FBO+RBO and glFinish()

I am using an FBO+RBO, and instead of regular double buffering on the default framebuffer, I am drawing to the RBO and then blit directly on the GL_FRONT buffer of the default FBO (0) in a single buffered OpenGL context.
It is fine and I dont get any flickering, but if the scene gets a bit complex, I experience a HUGE drop in fps, something so weird that I knew something had to be wrong. And I dont mean from 1/60 to 1/30 because of a skipped sync, I mean a sudden 90% fps drop.
I tried a glFlush() after the blit - no difference, then I tried a glFinish() after the blit, and I had a 10x fps boost.
So I used regular doble buffering on the default framebuffer and swapbuffers(), and the fps got a boost as well, as when using glFinish().
I cannot figure out what is happening. Why glFinish() makes so much of a difference when it should not? and, is it ok to use a RBO and blit directly on the front buffer, instead of using a swapbuffers call in a double buffering context? I know Im missing vsync but the composite manager will sync anyway (infact im not seeing any tearing), it is just as if the monitor is missing 9 out of 10 frames.
And just out of curiosity, does a native swapbuffers() use glFinish() on either windows or linux?
I believe it is a sync-related issue.
When rendering directly to the RBO and blitting to the front buffer, there is simply no sync whatsoever. Thus on complex scenes the GPU command queue will fill quite quickly, then the CPU driver queue will fill quickly as well, until a CPU sync will be forced by the driver during an OpenGL command. At that point the CPU thread will be halted.
What I mean is that, without any form of sync, complex renderings (renderings for which one or more OpenGL command will be put in a queue) will always cause the CPU thread to be halted at some point, since as the queues will fill, the CPU will be issuing more and more commands.
In order to get a smooth (more constant) user interaction, a sync is needed (either with a platform-specific swapbuffers() or a glFinish()) so to stop the CPU from making things worse issuing more and more commands (which in turn would bring the CPU thread to a stop later)
reference: OpenGL Synchronization
There are separate issues here, that are also a little bit connected.
1) Re-implementing double buffering yourself, while on spec the same thing, is not the same thing to the driver. Drivers are highly optimized for the common case. For example, many chips have distinct 2d and 3d units. The swap in swapBuffers is often handled by the 2d unit. Blitting a buffer is probably still done with the 3d unit.
2) glFlush (and Finish) are ignored by many drivers. Flush is a relic of client server rendering. Finish was intended for profiling. But it got abused to work around driver bugs. So now drivers often ignore it to improve the performance of legacy code that used Finish as a workaround.
3) Just don't do single buffered. There is no performance benefit and you are working off the "good" path of the driver. Window managers are super optimized for double buffered opengl.
4) What you are seeing looks a lot like you are simply leaking resources. Do you allocate buffers without freeing them? A quick and dirty way to check is if any glGen* functions return ever increasing ids.

OpenGL window systems screen tearing prevention

in my OpenGL application I want to prevent screen tearing for obvious reasons. So far I have been using vsync. But I would like to replace it with a page flipping buffer swap (changing a pointer to the monitor's data instead of changing the value) to improve performance. My question is: Do the important windowing systems (Windows, Cocoa, X11) support this kind of buffer swap at all and does it need to be requested explicitly or is it the default behavior?
V-Sync is the "vertical retrace synchronization". If V-Sync is enables it means, that the double buffers are exchanged in that timespan, when the display is not drawing. It's a term inherited from the time of CRT displays, where an electron beam was used to draw the image line by line from top left to the bottom. When the beam reached the bottom right it had to be returned to the top right. The electron beam was steered using two pair of electromagnet coils and (unlike the electrostatic deflectors in an oscilloscopes) can not operate beyond a certain slew rate. That's the V-Sync
Today, displays receive their data still line by line into a buffer internal to the display. At the end of a whole frame a small pause is inserted.
So the "vertical retrace" is that timespan where you can update the data in your display framebuffer, wihout interfereing with the drawing process.
So far I have been using vsync.
No, you didn't "use" vsync. You do use double buffering, which exchange is synchronized by the V-Sync.
But I would like to replace it with a page flipping buffer swap
This is not your decision to make. What method is used is chosen by the graphics hardware and its driver. Your program lives in userspace and can't even talk on that a low level with the hardware. And normally the method that performs best in the situation is used.

Perfect V-sync implementation for a lightweight OpenGL game: need one tidbit of information

In the game our Internet-assembled team is programming, we're assuming everybody from our audience will have WAY over fullspeed in the game.
So, to save video RAM, and hopefully give a little more idle time to the graphics card, using V-sync without double buffering would be our best option. So, in OpenGL, we need to know how to do that.
From my understanding, V-sync is when the graphics card is paused once it's done rendering a single frame until that frame has finished being sent to the display device. Double buffering doesn't pause render operations (or maybe it does, or maybe it's implementation-specific; not sure), because it instead draws to a second buffer before copying to the framebuffer, so that the monitor either gets the full frame or no new frame at all (specifically, the last stored image in the framebuffer). Well, we don't need that feature, as long as the graphics card just writes to the framebuffer ONLY when it damn needs to.
This is a pretty slow online game (But it's VERY creative ^_^). There's very little realtime action. Therefore, extremely precise user input is not a necessity; it can be captured from the OS as a single unit any time before rendering a frame.
So, in order to do EXACTLY this, I need to be able to get a "Frame has finished sending to monitor" message from OpenGL. Is it possible? If not, what is the best alternative?
The game is being programmed for Windows only at the moment but should have work done for Linux in a few months.
You suffer from a misconception what V-Sync does. There's a part in video RAM that's continously sent to the display device at a constant rate, the frame refresh rate. So immediately after a full frame has been sent the next frame gets sent, after a very short blank time. But the time between sending frames is far shorter than the time it takes to send the full frame.
What happens without V-Sync is, that operations on the contents of the framebuffer get visible, for example if the frame is filled alternating with red and green and there's no V-Sync you'll see red and green bands on the monitor. To avoid this, V-Sync swaps the pointer the display driver uses to access the framebuffer just after a full frame has been sent.
Which brings us to what doublebuffering does. Without doublebuffering there's little use for a V-Sync. The action triggered by V-Sync must happen very, very fast. So this boils down to swapping a pointer or a very fast blitting operation (potentially by simply setting CoW attributes for the GPU's MMU).
Without doublebuffering and no V-Sync the effect is, that one can see the process in which the picture is rendered piece by piece to the framebuffer. Of course if rendering happens faster than a frame period this has the effect that top-down you'll see a only sparsely populated image with more and more content being visible toward the bottem, and somewhere inbetween it'll hit the lower screen edge, wapping around to the top. The intersection line will be moving.
TL;DR: Just use double buffering and enable V-Sync for buffer swap. Don't be afraid of memory consumption. All GPUs in circulation today have more than enough RAM to easily provide the memory for doublebuffered colour planes. Just do the math: 1920x1200 * RGB = 6MiB, even the smallest GPUs in PCs today deliver at least 128MiB of RAM. Mobile devices, let's say iPad 1024*768 * RGB = 2MiB vs. 32MiB for graphics. The UI of the iPad is doublebuffered anyway.
You can use wglGetProcAddress to get the address of wglSwapIntervalEXT, and then call wglSwapIntervalEXT(1); to synchronize updates with the vertical synch. When you do this, you don't get a message at the vertical synch -- instead glFlush simply doesn't return until a vertical retrace has happened, and the screen has been updated. So, you have a WM_PAINT handler that looks something like this:
BeginPaint
wglMakeCurrent
do drawing
glFlush
EndPaint
The glFlush is needed in any case, to ensure the drawing you've done gets sent to the screen.

How does Photoshop (Or drawing programs) blit?

I'm getting ready to make a drawing application in Windows. I'm just wondering, do drawing programs have a memory bitmap which they lock, then set each pixel, then blit?
I don't understand how Photoshop can move entire layers without lag or flicker without using hardware acceleration. Also in a program like Expression Design, I could have 200 shapes and move them around all at once with no lag. I'm really wondering how this can be done without GPU help.
Also, I don't think super efficient algorithms could justify that?
Look at this question:
Reduce flicker with GDI+ and C++
All you can do about DC drawing without GPU is to reduce flickering. Anything else depends on the speed of filling your memory bitmap. And here you can use efficient algorithms, multithreading and whatever you need.
Certainly modern Photoshop uses GPU acceleration if available. Another possible tool is DMA. You may also find it helpful to read the source code of existing programs like GIMP.
Double (or more) buffering is the way it's done in games, where we're drawing a ton of crap into a "back" buffer while the "front" buffer is being displayed. Then when the draw is done, the buffers are swapped (a pointer swap, not copies!) and the process continues in the new front and back buffers.
Triple buffering offers another bonus, in that you can start drawing two-frames-from-now when next-frame is done, but without forcing a buffer swap in the middle of the screen refresh. Many games do the buffer swap in the middle of the refresh, but you can sometimes see it as visible artifacts (tearing) on the screen.
Anyway- for an app drawing bitmaps into a window, if you've got some "slow" operation, do it into a not-displayed buffer while presenting the displayed version to the rendering API, e.g. GDI. Let the system software handle all of the fancy updating.