Multisampling and memory usage - opengl

The naive interpretation of multisampling would imply that, for instance, 8x MSAA would require a framebuffer that takes 8 times the space of a non-multisampled framebuffer, for all the duplicated samples. Since the latest video cards support even 32x MSAA, that would mean that just the color buffer of a 1600x1200 output would use 1600·1200·4·32 = ~245 MB.
Is this actually the case? I mean, I realize that potential memory optimizations are likely to be implementation-dependent, but is there any information on this? Should I be extremely conscious of, for instance, allocating multisampled textures? (This is my main question.)
I'm asking in the context of OpenGL, but I don't reckon this would be different between DirectX and OpenGL.

Related

Is glClear(GL_COLOR_BUFFER_BIT) preferred before a whole frame buffer overwritten?

I saw different opinions.
For now, I only concern about color data.
In Chapter 28. Graphics Pipeline Performance, it says:
Avoid extraneous color-buffer clears. If every pixel is guaranteed to
be overwritten in the frame buffer by your application, then avoid
clearing color, because it costs precious bandwidth.
In How does glClear() improve performance?, it quotes from Apple's Technical Q&A on addressing flickering (QA1650):
You must provide a color to every pixel on the screen. At the
beginning of your drawing code, it is a good idea to use glClear() to
initialize the color buffer. A full-screen clear of each of your
color, depth, and stencil buffers (if you're using them) at the start
of a frame can also generally improve your application's performance.
And one answer in that post:
By issuing a glClear command, you are telling the hardware that you do
not need previous buffer content, thus it does not need to copy the
color/depth/whatever from the framebuffer to the smaller tile memory.
To that answer, my question is:
If there is no blending, why do we need to read color data from the framebuffer.
(For now, I only concern about color data)
But anyway, in general, do I need to call glClear(GL_COLOR_BUFFER_BIT)
In Chapter 28. Graphics Pipeline Performance, it says:
There are a lot of different kinds of hardware. On hardware that was prevalent when GPU Gems #1 was printed, this advice was sound. Nowadays it no longer is.
Once upon a time, clearing buffers actually meant that the hardware would go to each pixel and write the clear value. This process obviously took a non-trivial amount of GPU time, so high-performance application developers did their best to avoid incurring the wrath of the clear operation.
Nowadays (and by which, I mean pretty much any GPU made in the last 8-10 years at least), graphics chips are smarter about clears. Instead of doing a clear, they play games with the framebuffer's caches.
The value a framebuffer image is cleared to matters when doing read/modify/write operations. This includes blending and such, but it also includes any form of depth or stencil testing. In order to do a RMF operation, you must first read the value that's there.
This is where the cleverness comes in. When you "clear" a framebuffer image, nothing gets written. Instead, the framebuffer images address space is invalidated. When a read operation happens to an invalidated address, it simply returns the clear value. This costs zero bandwidth. Indeed, it saves bandwidth, because the read operation doesn't actually have to read memory. It just fetches a clear value.
Depending on how the cache works, this may even be faster when doing pure write operations. But that rather depends on different hardware.
For mobile hardware that uses tile-based rendering, this matters even more. Before a tile can begin processing, it has to read the current values of the framebuffer images. If the images are cleared, it doesn't need to read anything; it simply sets the tile memory to the clear color.
This case matters a lot even if you're not blending to the framebuffer. Why? Because neither the GPU nor the API knows that you won't be blending. It only knows that you're going to perform some number of rendering operations to that image. So it must assume the worst and read the image into the tiles. Unless you cleared it beforehand, of course.
In short, when using those images for framebuffers, clearing the images first is generally no slower than not clearing the images.
The above all assumes that you clear the entire image. If you're only clearing a sub-region of the image, then such optimizations are less likely to happen. Though it may still be possible, at least for the optimizations that are based on cache behavior.

Why does allocating a large number of VBOs cause performance issues?

I have an application that allocates ~300 VBOs. However, only 40 of these are used for draw commands each frame. I've verified this with an OpenGL profiler.
I notice that if I decrease the number of VBOs, performance is much improved. However, given that most of the VBOs are unused most of the time, I'm surprised this is a problem. I'd assume that most of the VBOs don't have memory allocated to them, since I haven't even called glBufferData on the unused VBOs.
Does anyone know why having extra unused VBOs would cause a performance hit? I'm guessing it's probably driver-dependent (I have a Nvidia 460GTX).
Also, I'd be interested in ways to combine a bunch of particle systems (most of which are unused during any given frame) into a single VBO so that I don't run into this issue.
EDIT: It turns out that performance issue wasn't related to the VBOs. However, I learned a lot about streaming data into VBOs while investigating. This article was very interesting: http://onrendering.blogspot.com/2011/10/buffer-object-streaming-in-opengl.html.
It turns out that the number of VBOs was not the cause of the performance bottleneck in my case. In fact, it seems most OpenGL implementations handle large numbers of VBOs pretty well. I tested on a 2009 MacBook Air and an Nvidia GTX 460.
Tangentially related: if you're using many VBOs, there's usually a way to avoid that and gain some efficiency. In my case, I used a single streaming VBO to render particles from multiple different particle systems, instead of dedicating a VBO to each particle system. This reduced the number of batches/draw calls, and freed up some CPU cycles.
Here's more information on VBO streaming:
http://onrendering.blogspot.com/2011/10/buffer-object-streaming-in-opengl.html

OpenGL - Power Of Two Textures

OpenGL uses power-of-two textures.
This is because some GPUs only accept power-of-two textures due to MipMapping. Using these power-of-two textures causes problems when drawing a texture larger than it is.
I had thought of one way to workaround this, which is to only use the PO2 ratios when we're making the texture smaller than it actually is, and using a 1:1 ratio when we're making it bigger, but will this create compatibility issues with some GPUs?
If anybody knows whether issues would occur (I cannot check this as my GPU accepts NPO2 Textures), or a better workaround, I would be grateful.
Your information is outdated. Arbitrary dimension textures are supported since OpenGL-2, which has been released in 2004. All contemporary GPUs do support NPOT2 textures very well, and without any significant performance penality.
There's no need for any workarounds.

Hardware support for non-power-of-two textures

I have been hearing controversial opinions on whether it is safe to use non-power-of two textures in OpenGL applications. Some say all modern hardware supports NPOT textures perfectly, others say it doesn't or there is a big performance hit.
The reason I'm asking is because I want to render something to a frame buffer the size of the screen (which may not be a power of two) and use it as a texture. I want to understand what is going to happen to performance and portability in this case.
Arbitrary texture sizes have been specified as core part of OpenGL ever since OpenGL-2, which was a long time ago (2004). All GPUs designed every since do support NP2 textures just fine. The only question is how good the performance is.
However ever since GPUs got programmable any optimization based on the predictable patterns of fixed function texture gather access became sort of obsolete and GPUs now have caches optimized for general data locality and performance is not much of an issue now either. In fact, with P2 textures you may need to upscale the data to match the format, which increases the required memory bandwidth. However memory bandwidth is the #1 bottleneck of modern GPUs. So using a slightly smaller NP2 texture may actually improve performance.
In short: You can use NP2 textures safely and performance is not much of a big issue either.
All modern APIs (except some versions of OpenGL ES, I believe) on modern graphics hardware (the last 10 or so generations from ATi/AMD/nVidia and the last couple from Intel) support NP2 texture just fine. They've been in use, particularly for post-processing, for quite some time.
However, that's not to say they're as convenient as power-of-2 textures. One major case is memory packing; drivers can often pack textures into memory far better when they are powers of two. If you look at a texture with mipmaps, the base and all mips can be packed into an area 150% the original width and 100% the original height. It's also possible that certain texture sizes will line up memory pages with stride (texture row size, in bytes), which would provide an optimal memory access situation. NP2 makes this sort of optimization harder to perform, and so memory usage and addressing may be a hair less efficient. Whether you'll notice any effect is very much driver and application-dependent.
Offscreen effects are perhaps the most common usecase for NP2 textures, especially screen-sized textures. Almost every game on the market now that performs any kind of post-processing or deferred rendering has 1-15 offscreen buffers, many of which are the same size as the screen (for some effects, half or quarter-size are useful). These are generally well-supported, even with mipmaps.
Because NP2 textures are widely supported and almost a sure bet on desktops and consoles, using them should work just fine. If you're worried about platforms or hardware where they may not be supported, easy fallbacks include using the nearest power-of-2 size (may cause slightly lower quality, but will work) or dropping the effect entirely (with obvious consquences).
I have a lot of experience in making games (+4 years) and using texture atlases for iOS & Android though cross platform development using OpenGL 2.0
Stick with PoT textures with a maximum size of 2048x2048 because some devices (especially the cheap ones with cheap hardware) still don't support dynamic texture sizes, i know this from real life testers and seeing it first hand. There are so many devices out there now, you never know what sort of GPU you'll be facing.
You're iOS devices will also show black squares and artefacts if you are not using PoT textures.
Just a tip.
Even if arbitrary texture size is required by OpenGL X certain videocards are still not fully compliant with OpenGL. I had a friend with a IntelCard having problems with NPOT2 textures (I assume now Intel Cards are fully compliant).
Do you have any reason for using NPOT2 Textures? than do it, but remember that maybe some old hardware don't support them and you'll probably need some software fallback that can make your textures POT2.
Don't you have any reason for using NPOT2 Textures? then just use POT2 Textures. (certain compressed formats still requires POT2 textures)

Storing many small textures in OpenGL

I'm building an OpenGL app with many small textures. I estimate that I will have a few hundred
textures on the screen at any given moment.
Can anyone recommend best practices for storing all these textures in memory so as to avoid potential performance issues?
I'm also interested in understanding how OpenGL manages textures. Will OpenGL try to store them into GPU memory? If so, how much GPU memory can I count on? If not, how often does OpenGL pass the textures from application memory to the GPU, and should I be worried about latency when this happens?
I'm working with OpenGL 3.3. I intend to use only modern features, i.e. no immediate mode stuff.
If you have a large number of small textures, you would be best off combining them into a single large texture with each of the small textures occupying known sub-regions (a technique sometimes called a "texture atlas"). Switching which texture is bound can be expensive, in that it will limit how much of your drawing you can batch together. By combining into one you can minimize the number of times you have to rebind. Alternatively, if your textures are very similarly sized, you might look into using an array texture (introduction here).
OpenGL does try to store your textures in GPU memory insofar as possible, but I do not believe that it is guaranteed to actually reside on the graphics card.
The amount of GPU memory you have available will be dependent on the hardware you run on and the other demands on the system at the time you run. What exactly "GPU memory" means will vary across machines, it can be discrete and used only be the GPU, shared with main memory, or some combination of the two.
Assuming your application is not constantly modifying the textures you should not need to be particularly concerned about latency issues. You will provide OpenGL with the textures once and from that point forward it will manage their location in memory. Assuming you don't need more texture data than can easily fit in GPU memory every frame, it shouldn't be cause for concern. If you do need to use a large amount of texture data, try to ensure that you batch all use of a certain texture together to minimize the number of round trips the data has to make. You can also look into the built-in texture compression facilities, supplying something like GL_COMPRESSED_RGBA to your call to glTexImage2D, see the man page for more details.
Of course, as always, your best bet will be to test these things yourself in a situation close to your expected use case. OpenGL provides a good number of guarantees, but much will vary depending on the particular implementation.