WGL: No double buffering + multi sampling = FAIL? - opengl

I usually create a pixel format using wglChoosePixelFormatARB() with these arguments (among others):
WGL_DOUBLE_BUFFER_ARB = GL_TRUE
WGL_SAMPLE_BUFFERS_ARB = GL_TRUE
WGL_SAMPLES_ARB = 4
i.e. double buffering on and x4 multisampling. This works just fine.
But when I try to turn of the double buffering:
WGL_DOUBLE_BUFFER_ARB = GL_FALSE
WGL_SAMPLE_BUFFERS_ARB = GL_TRUE
WGL_SAMPLES_ARB = 4
The call to wglChoosePixelFormatARB() fails (or rather indicates it didn't create anything)
When I effectively turn multisampling off:
WGL_DOUBLE_BUFFER_ARB = GL_FALSE
WGL_SAMPLE_BUFFERS_ARB = GL_TRUE
WGL_SAMPLES_ARB = 1
I works fine again.
Is there something inherent that prevents a non-double buffered pixel format to work with multisampling?
The reason I'm turning double buffering off is to achieve unconstrained frame rate. with double buffering the frame rate I get is only up to 60 FPS (this laptop LCD works at 60Hz). But with double buffering off I can get up to 1500 FPS. Is there a way to achieve this with double buffering on?

In theory, drawing in a single-buffer mode means that you're directly modifying what is being presented to the screen (aka the front buffer).
Since that memory is in a specific format already, you don't get to choose another one.
(I'm saying in theory because the platform does however it pleases it in practice. Aero for example does not allow access to the front-buffer).
Moreover, when doing multisampling, the step that converts the X samples/pixel to 1 pixel for drawing is when the back-buffer is copied to the front buffer (what is called the resolve step). In single buffer mode, there is no such step.
As to your 60 fps locking, you might want to look atWGL_EXT_swap_control. The issue here is that you don't generally want to update what is being shown on screen while the screen refreshes the data; it creates tearing. So by default, Swap only updates while the screen is vertical syncing (aka vsync), so you end up locking to the refresh rate of the screen.
If you don't mind your display showing parts of different frames, you can turn it off.
For completeness, there is an alternative mode called triple buffering, that essentially has the GPU ping-pong rendering between 2 back-buffers while the front buffer is shown. It is up to the gpu to pick the last finished back-buffer when comes time to change what shows on screen (vsync). Sadly, I am not aware of a WGL method to ask for triple buffering.

Related

Understanding buffer swapping in more detail

This is more a theoretical question. This is what I understand regarding buffer swapping and vsync:
I - When vsync is off, whenever the developer swap the front/back buffers, the buffer that the GPU is reading from and sending to the monitor will be changed to the new one, regardless if the old buffer was being read (i.e. no vblank is needed).
II - When vsync is on, the buffers are not immediately swapped, they will only be changed when the old buffer was completely read (i.e. vblank is needed).
III - Turning vsync off can boost the frame rate to be greater than the monitor refresh rate, but screen tearing can appear when buffers are swapped when they are being read
IV - Turning vsync on prevents tearing, but the monitor refresh rate limits the FPS.
Based on this I tried to do the following experiment: I disabled vsync and every frame I rendered all pixels with a solid color using glClearColor + glClear, choosing a new random color per frame. I got ~2400FPS in a 60Hz monitor. Since every frame I swapped the buffers, and since the monitor takes 1/60 second for each full screen drawing, I was expecting that each time the monitor was being refreshed, the buffers would have been swapped roughly ~40 times. This is because in 1/60s, there are around 40 buffer swapping calls. Since everytime the buffers are swapped the clear color is different, I was expecting to see a really messy image, with lots of different colors, because of the tearing. Instead, by taking some screenshots I didn't see any tearing... every pixel had the same solid color.
Could someone point the wrong assumptions that I had and why I see such behavior?
Thanks in advance!
The problem was related to the window manager. I could see the expected behavior when I ran in full screen.

OSX pushing pixels to screen with minimum latency

I'm trying to develop some very low-latency graphics applications and am getting really frustrated by how long it takes to draw to screen through OpenGL. Every discussion I find about it online addresses optimizing the OpenGL pipeline, but doesn't get anywhere near the results that I need.
Check this out:
https://www.dropbox.com/s/dbz4bq67cxluhs7/MouseLatency.MOV?dl=0
You probably noticed this before: With a c++ OpenGL app, dragging the mouse around the screen, and drawing the mouse location in OpenGL, the OpenGL lags behind by 3 or 4 frames. Clearly OSX CAN draw [the cursor] to screen with very low latency, but OpenGL is much slower. So let's say I don't need to do any fancy OpenGL rendering. I just want to push pixels to screen somehow. Is there a way for me to bypass OpenGL completely and draw to screen faster? Or is this kind of functionality going to be locked inside the kernel somewhere that I can't reach it?
datenwolf's answer is excellent. I just wanted to add one thing to this discussion regarding triple buffering at the compositor level, since I am very familiar with the Microsoft Windows desktop compositor.
I know you are asking about OS X here, but the implementation details I am going to discuss are the most sensible way of implementing this stuff and I would expect to see other systems work this way too.
Triple buffering as you might enable at the application level adds a third buffer to the swap-chain that is synchronized to refresh. That way of doing triple buffering does add latency, because that third buffer has to be displayed and nothing is allowed to touch it until this happens (this is D3D's mandated behavior -- the behavior and feature itself are undefined in OpenGL); but the way the Desktop Window Manager (Windows) works is slightly different.
The behavior I have seen most drivers implement for desktop composition is frame dropping. Any situation where multiple frames are finished between refreshes, all but 1 of those frames are discarded. You actually get lower latency using a window rather than fullscreen + triple buffering, because it does not block buffer swaps when the third buffer (owned by the compositor) has a finished frame waiting to be displayed.
It creates a whole different set of visual issues if framerate is not reasonably consistent. Technically, pixels belonging to dropped frames have infinite latency, so the benefits from latency reduction done this way might be worthless if you needed every single frame drawn to appear on screen.
I believe you can get this behavior on OS X (if you want it) by disabling VSYNC and drawing in a window. VSYNC basically only serves as a form of frame pacing (trade latency for consistency) in this scenario and tearing is eliminated by the compositor itself regardless what rate you draw at.
Regarding mouse cursor latency:
The cursor in any modern window system will always track with minimum latency. There is literally a feature on graphics hardware called a "hardware cursor," where the driver stores the cursor position and then once per-refresh, has the hardware overlay the cursor on top of whatever is sitting in the framebuffer waiting to be scanned-out. So even if your application is drawing at 30 FPS on a 60 Hz display, the cursor is updated every 16 ms when the hardware cursor's used.
This bypasses all graphics APIs altogether, but is quite limited (e.g. it uses the OS-defined cursor).
TL;DR: Latency comes in many forms.
If your problem is input latency, then you can mitigate that by reducing the number of pre-rendered frames and avoiding triple buffering. I could not begin to tell you how to reduce the number of driver pre-rendered frames on OS X.
Minimize length of time before something shows up on screen
If your problem is the amount of time that passes between executions of your render loop, you would go the other way. Increase pre-rendered frames, draw in a window and disable VSYNC. You may run into a lot of frames that are drawn but never displayed in this scenario.
Minimize time spent blocking (increase FPS); some frames will never be displayed
Pre-rendered frames are a powerful little feature that you do not get control over at the OpenGL API level. It sets up how deeply the driver is allowed to pipeline everything and depending on the desired task you will trade different types of latency by fiddling with it. Many gamers swear by setting this value to 1 to minimize input latency at the cost of overall framerate "smoothness."
UPDATE:
Pre-rendered frames are one reason for your multi-frame delay. Fixing this in a cross-platform way is difficult (it's a driver setting), but if you have access to Fence Sync Objects you can produce the same behavior as forcing this to 1.
I can explain this in more detail if need be, the general idea is that you insert a fence sync after the buffer swap and then wait for it to be signaled before the first command in the next frame is allowed to begin. Performance may take a nose dive, but latency will be minimized since the CPU won't be rendering ahead of the GPU anymore.
There are a number of latencies at play here.
Input event → drawing state latency
In your typical interactive application you have a event loop that usually goes
collect user input
process user input
determine what's to be drawn
draw to the back buffer
swap back to front buffer
With the usual ways in which event–update–display loops are written there's almost no delay between step 5 of the previous and step 1 of the following iteration. which means that steps 2, 3, and 4 operate with data that lags about one frame period behind.
So this is the first source of latency.
Tripple buffering / composition latency
Many graphics pipelines enable triple buffering for smoother display update. Instead of keeping only a back and a front buffer around, there's also a third buffer inbetween. The average rate at which to these buffers is drawn is the display refresh period. The buffers themself are stepped at exactly the display refresh period. So this adds another frame period of latency.
If you're running on a system with a window compositor (which is the default by MacOS X) this adds effectively another buffer stage, so if you've got a double buffer mode it gives you triple buffer and if you had a triple buffer it'd give you a "quad" buffer (quotes here, because quad buffer is a term usually used with stereoscopic rendering).
What can you do about this:
Turn off composition
Windows through the DWM API and MacOS X allow to turn off composition or bypass the compositor.
Reducing input lag
Try to collect and integrate the user input as late as possible (use high resolution sleeps). If you've got only a very simple scene you can push the drawing quite close to the V-Sync deadline; in fact the NVidia OpenGL implementation has a vendor specific extension that allows to sleep until a specific amount of time before the next V-Sync.
If your scene is complex but is separable in parts that require low latency user input and stuff where it doesn't matter so much you can draw the higher latency stuff earlier and only at the very last moment integrate user input into it. Of course if the mouse is used to control the viewing direction, or even worse you're rendering for a VR head mounted display things are going to become difficult.

How to do exactly one render per vertical sync (no repeating, no skipping)?

I'm trying to do vertical synced renders so that exactly one render is done per vertical sync, without skipping or repeating any frames. I would need this to work under Windows 7 and (in the future) Windows 8.
It would basically consist of drawing a sequence of QUADS that would fit the screen so that a pixel from the original images matches 1:1 a pixel on the screen. The rendering part is not a problem, either with OpenGL or DirectX. The problem is the correct syncing.
I previously tried using OpenGL, with the WGL_EXT_swap_control extension, by drawing and then calling
SwapBuffers(g_hDC);
glFinish();
I tried all combinations and permutation of these two instructions along with glFlush(), and it was not reliable.
I then tried with Direct3D 10, by drawing and then calling
g_pSwapChain->Present(1, 0);
pOutput->WaitForVBlank();
where g_pSwapChain is a IDXGISwapChain* and pOutput is the IDXGIOutput* associated to that SwapChain.
Both versions, OpenGL and Direct3D, result in the same: The first sequence of, say, 60 frames, doesn't last what it should (instead of about 1000ms at 60hz, is lasts something like 1030 or 1050ms), the following ones seem to work fine (about 1000.40ms), but every now and then it seems to skip a frame. I do the measuring with QueryPerformanceCounter.
On Direct3D, trying a loop of just the WaitForVBlank, the duration of 1000 iterations is consistently 1000.40 with little variation.
So the trouble here is not knowing exactly when each of the functions called return, and whether the swap is done during the vertical sync (not earlier, to avoid tearing).
Ideally (if I'm not mistaken), to achieve what I want, it would be to perform one render, wait until the sync starts, swap during the sync, then wait until the sync is done. How to do that with OpenGL or DirectX?
Edit:
A test loop of just WaitForVSync 60x takes consistently from 1000.30ms to 1000.50ms.
The same loop with Present(1,0) before WaitForVSync, with nothing else, no rendering, takes the same time, but sometimes it fails and takes 1017ms, as if having repeated a frame. There's no rendering, so there's something wrong here.
I have the same problem in DX11. I want to guarantee that my frame rendering code takes an exact multiple of the monitor's refresh rate, to avoid multi-buffering latency.
Just calling pSwapChain->present(1,0) is not sufficient. That will prevent tearing in fullscreen mode, but it does not wait for the vblank to happen. The present call is asynchronous and it returns right away if there are frame buffers remaining to be filled. So if your render code is producing a new frame very quickly (say 10ms to render everything) and the user has set the driver's "Maximum pre-rendered frames" to 4, then you will be rendering four frames ahead of what the user sees. This means 4*16.7=67ms of latency between mouse action and screen response, which is unacceptable. Note that the driver's setting wins - even if your app asked for pOutput->setMaximumFrameLatency(1), you'll get 4 frames regardless. So the only way to guarantee no mouse-lag regardless of driver setting is for your render loop to voluntarily wait until the next vertical refresh interval, so that you never use those extra frameBuffers.
IDXGIOutput::WaitForVBlank() is intended for this purpose. But it does not work! When I call the following
<render something in ~10ms>
pSwapChain->present(1,0);
pOutput->waitForVBlank();
and I measure the time it takes for the waitForVBlank() call to return, I am seeing it alternate between 6ms and 22ms, roughly.
How can that happen? How could waitForVBlank() ever take longer than 16.7ms to complete? In DX9 we solved this problem using getRasterState() to implement our own, much-more-accurate version of waitForVBlank. But that call was deprecated in DX11.
Is there any other way to guarantee that my frame is exactly aligned with the monitor's refresh rate? Is there another way to spy the current scanline like getRasterState used to do?
I previously tried using OpenGL, with the WGL_EXT_swap_control extension, by drawing and then calling
SwapBuffers(g_hDC);
glFinish();
That glFinish() or glFlush is superfluous. SwapBuffers implies a glFinish.
Could it be, that in your graphics driver settings you set "force V-Blank / V-Sync off"?
We use DX9 currently, and want to switch to DX11. We currently use GetRasterState() to manually sync to the screen. That goes away in DX11, but I've found that making a DirectDraw7 device doesn't seem to disrupt DX11. So just add this to your code and you should be able to get the scanline position.
IDirectDraw7* ddraw = nullptr;
DirectDrawCreateEx( NULL, reinterpret_cast<LPVOID*>(&ddraw), IID_IDirectDraw7, NULL );
DWORD scanline = -1;
ddraw->GetScanLine( &scanline );
On Windows 8.1 and Windows 10, you can make use of the DXGI 1.3 DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT. See MSDN. The sample here is for Windows 8 Store apps, but it should be adaptable to class Win32 windows swapchains as well.
You may find this video useful as well.
When creating a Direct3D device, set PresentationInterval parameter of the D3DPRESENT_PARAMETERS structure to D3DPRESENT_INTERVAL_DEFAULT.
If you run in kernel-mode or ring-0, you can attempt to read bit 3 from the VGA input register (03bah,03dah). The information is quite old but although it was hinted here that the bit might have changed location or may be obsoleted in later version of Windows 2000 and up, I actually doubt this. The second link has some very old source-code that attempts to expose the vblank signal for old Windows versions. It no longer runs, but in theory rebuilding it with latest Windows SDK should fix this.
The difficult part is building and registering a device driver that exposes this information reliably and then fetching it from your application.

Is double buffering needed any more

As today's cards seem to keep a list of render commands and flush only on a call to glFlush or glFinish, is double buffering really needed any more? An OpenGL game I am developing on Linux (ATI Mobility radeon card) with SDL/OpenGL actually flickers less when SDL_GL_swapbuffers() is replaced by glFinish() and with SDL_GL_SetAttribute(SDL_GL_DOUBLEBUFFER,0) in the init code. Is this a particular case of my card or are such things likely on all cards?
EDIT: I've discovered that the cause for this is KWin. It appears that as datenwolf said, compositing without sync was the cause. When I switched off KWin compositing, the game works fine without ANY source code patches
Double buffering and glFinish are two very different things.
glFinish blocks the program, until all drawing operations are completed.
Double buffering is used to hide the rendering process from the user. Without double buffering, each and every single drawing operation would become visible immediately, assuming that the display refresh frequency is infinitely high. In practice you will get some display artifacts, like parts of the scene visible in one state, the rest not visible or in some other state, the picture could be incomplete, etc. Double buffering avoids this by first rendering into a back buffer, and only after the rendering has been finished swapping this back with the front buffer, that gets sent to the display device.
Now today compositing window management becomes prevalent: Windows has Aero, MacOS X Quartz Extreme and on Linux at least Unity and the GNOME3 shell use compositing if available. The point is: Compositing technically creates doublebuffering: Windows draw to offscreen buffers and of these the final screen is composited. So if you're running on a machine with compositing, then double buffering is kind of redundant if performed in your program, and all it'd take was some kind of synchronization mechanism, to tell the compositor when the next frame is ready. MacOS X has this. X11 still lacks a proper synchronization scheme, see this post on the maillist: http://lists.freedesktop.org/archives/xorg/2004-May/000607.html
TL;DR: Double buffering and glFinish are different things, and you need double buffering (of some sort) to make things look good.
I would expect that it has more to do with what you're rendering or your hardware than anything that could be generalized to something not on your machine. So no: don't try to do this.
Oh, and don't forget multisampling. Many implementations only multisample the back buffer; the front buffer is not multisampled. Doing a swap will downsample from the multisampled buffer.

Perfect V-sync implementation for a lightweight OpenGL game: need one tidbit of information

In the game our Internet-assembled team is programming, we're assuming everybody from our audience will have WAY over fullspeed in the game.
So, to save video RAM, and hopefully give a little more idle time to the graphics card, using V-sync without double buffering would be our best option. So, in OpenGL, we need to know how to do that.
From my understanding, V-sync is when the graphics card is paused once it's done rendering a single frame until that frame has finished being sent to the display device. Double buffering doesn't pause render operations (or maybe it does, or maybe it's implementation-specific; not sure), because it instead draws to a second buffer before copying to the framebuffer, so that the monitor either gets the full frame or no new frame at all (specifically, the last stored image in the framebuffer). Well, we don't need that feature, as long as the graphics card just writes to the framebuffer ONLY when it damn needs to.
This is a pretty slow online game (But it's VERY creative ^_^). There's very little realtime action. Therefore, extremely precise user input is not a necessity; it can be captured from the OS as a single unit any time before rendering a frame.
So, in order to do EXACTLY this, I need to be able to get a "Frame has finished sending to monitor" message from OpenGL. Is it possible? If not, what is the best alternative?
The game is being programmed for Windows only at the moment but should have work done for Linux in a few months.
You suffer from a misconception what V-Sync does. There's a part in video RAM that's continously sent to the display device at a constant rate, the frame refresh rate. So immediately after a full frame has been sent the next frame gets sent, after a very short blank time. But the time between sending frames is far shorter than the time it takes to send the full frame.
What happens without V-Sync is, that operations on the contents of the framebuffer get visible, for example if the frame is filled alternating with red and green and there's no V-Sync you'll see red and green bands on the monitor. To avoid this, V-Sync swaps the pointer the display driver uses to access the framebuffer just after a full frame has been sent.
Which brings us to what doublebuffering does. Without doublebuffering there's little use for a V-Sync. The action triggered by V-Sync must happen very, very fast. So this boils down to swapping a pointer or a very fast blitting operation (potentially by simply setting CoW attributes for the GPU's MMU).
Without doublebuffering and no V-Sync the effect is, that one can see the process in which the picture is rendered piece by piece to the framebuffer. Of course if rendering happens faster than a frame period this has the effect that top-down you'll see a only sparsely populated image with more and more content being visible toward the bottem, and somewhere inbetween it'll hit the lower screen edge, wapping around to the top. The intersection line will be moving.
TL;DR: Just use double buffering and enable V-Sync for buffer swap. Don't be afraid of memory consumption. All GPUs in circulation today have more than enough RAM to easily provide the memory for doublebuffered colour planes. Just do the math: 1920x1200 * RGB = 6MiB, even the smallest GPUs in PCs today deliver at least 128MiB of RAM. Mobile devices, let's say iPad 1024*768 * RGB = 2MiB vs. 32MiB for graphics. The UI of the iPad is doublebuffered anyway.
You can use wglGetProcAddress to get the address of wglSwapIntervalEXT, and then call wglSwapIntervalEXT(1); to synchronize updates with the vertical synch. When you do this, you don't get a message at the vertical synch -- instead glFlush simply doesn't return until a vertical retrace has happened, and the screen has been updated. So, you have a WM_PAINT handler that looks something like this:
BeginPaint
wglMakeCurrent
do drawing
glFlush
EndPaint
The glFlush is needed in any case, to ensure the drawing you've done gets sent to the screen.