Why does allocating a large number of VBOs cause performance issues? - opengl

I have an application that allocates ~300 VBOs. However, only 40 of these are used for draw commands each frame. I've verified this with an OpenGL profiler.
I notice that if I decrease the number of VBOs, performance is much improved. However, given that most of the VBOs are unused most of the time, I'm surprised this is a problem. I'd assume that most of the VBOs don't have memory allocated to them, since I haven't even called glBufferData on the unused VBOs.
Does anyone know why having extra unused VBOs would cause a performance hit? I'm guessing it's probably driver-dependent (I have a Nvidia 460GTX).
Also, I'd be interested in ways to combine a bunch of particle systems (most of which are unused during any given frame) into a single VBO so that I don't run into this issue.
EDIT: It turns out that performance issue wasn't related to the VBOs. However, I learned a lot about streaming data into VBOs while investigating. This article was very interesting: http://onrendering.blogspot.com/2011/10/buffer-object-streaming-in-opengl.html.

It turns out that the number of VBOs was not the cause of the performance bottleneck in my case. In fact, it seems most OpenGL implementations handle large numbers of VBOs pretty well. I tested on a 2009 MacBook Air and an Nvidia GTX 460.
Tangentially related: if you're using many VBOs, there's usually a way to avoid that and gain some efficiency. In my case, I used a single streaming VBO to render particles from multiple different particle systems, instead of dedicating a VBO to each particle system. This reduced the number of batches/draw calls, and freed up some CPU cycles.
Here's more information on VBO streaming:
http://onrendering.blogspot.com/2011/10/buffer-object-streaming-in-opengl.html

Related

How well do opengl drivers handle large texture arrays in limited VRAM

My game engine tries to allocate large texture arrays to be able to batch majority (if not all) of its draw together. This array may become large enough that fails to allocate, at which point I'd (continually) split the texture array in halves.
Is it bad design to push the boundaries until receiving a glGetError:Out of memory and scale back from there?
Is my application a jerk because it's allocating huge chunks of VRAM, which may require swapping into GTT memory? As in, is it less ideal for the graphics driver to be manipulating a few large texture arrays rather than many individual textures when dealing with other OS operations?
It is hard to evaluate how well drivers handle large texture arrays. Behaviour of different drivers may vary a lot.
While using texture array can improve performance by reducing the number of draw calls, that should not be the main goal. Reduction of draw calls is somewhat important on mobile platforms, and even there, several dozens of them is not a problem. I'm not sure about your concerns and what exactly you try to optimise, but I would recommend using profiling tools from GPU vendor before doing any optimisation.
Is it bad design to push the boundaries until receiving a glGetError:Out of memory and scale back from there?
This is what typically done when data is dynamically loaded to the GPU. Once the error is received, old data should be unloaded to load a new one.
Is my application a jerk because it's allocating huge chunks of VRAM, which may require swapping into GTT memory?
There is no way to check if data was swapped to GTT or not (if driver supports GTT at all). The driver handles it on its own, and there is no access to that from OpenGL API. You may need to use profiling tools like Nsight, if you are using a GPU from NVidia.
However, if you are planning to have one giant textures array, it must fit into VRAM as a whole, it can not be partially in VRAM and in GTT. I would not recommend relying on GTT at all.
It must fit into VRAM, because when you bind it, the driver can not know beforehand which layers will be used and which won't since selection happens in the shader.
Despite the fact that textures array and 3dtexture are conceptually different, at hardware level they work very similarly, the difference is that the first one uses filtering in two dimensions and the second one - in three dimensions.
I was playing with large 3d textures for a while. I did experiments with GeForce 1070 (it has 6GB), and it handles textures ~1GB very good. The largest texture I managed to load was around 3GB (2048x2048x7**), but often it throws an error. Despite the fact that it should have a large amount of free VRAM that would fit the texture, it may fail to allocate such big chunk due to various reasons. So I would not recommend allocating textures that are comparable to the total size of VRAM unless it is absolutely necessary.

OpenGL program with Intel HD and NVidia GPU usage

I am new in OpenGL and I want somebody to explain me how the program uses GPU.
I have an array of triangles(class that contains 3 points). Here is the code that draw them( I know these functions are depricated).
glBegin(GL_LINES);
for(int i=0; i<trsize; ++i){
glVertex3d((GLdouble)trarr[i].p1().x(), (GLdouble)trarr[i].p1().y(), (GLdouble)trarr[i].p1().z());
glVertex3d((GLdouble)trarr[i].p2().x(), (GLdouble)trarr[i].p2().y(), (GLdouble)trarr[i].p2().z());
glVertex3d((GLdouble)trarr[i].p3().x(), (GLdouble)trarr[i].p3().y(), (GLdouble)trarr[i].p3().z());
}
glEnd();
And i also use depricated functions for rotating, transforming, etc.
When the size of array is bigger than 50k, the program works very slow.
I tried to use only Intel HD or only NVidia gtx860M (the default NVidia Program allows to choose GPU) but they both works very slow. Maybe Intel HD works even a bit faster.
So, why there is no difference between these two GPUs?
And will the program work faster with using shaders?
The probable bottleneck is looping over the vertices, accessing the array and pulling out the vertex data 50000 times per render then sending the data to the GPU for rendering.
Using a VBO would indeed be faster and compresses the cost of extracting the data and sending it to the GPU to once on initialization.
Even using a user memory buffer would speed it up because you won't be calling 50k functions but the driver can just do a memcopy of the relevant data.
When the size of array is bigger than 50k, the program works very slow.
The major bottleneck when drawing in intermediate mode is, that all your vertices have to be transferred in every frame from your programs memory to the GPU memory. The bus between GPU and CPU is limited in the amout of data it can transfer, so the best guess is, that 50k triangles are simply more than the bus can transport. Another problem is, that the driver has to process all the commands you send him on the CPU, which can also be a big overhead.
So, why there is no difference between these two GPUs?
There is (in general) a huge performance difference between the Intel HD card and a NVIDIA card, but the bus between them might be the same.
And will the program work faster with using shaders?
It will not benefit directly from the user of shaders, but definitely from storing the vertices once on the gpu memory (see VBO/VAO). The second improvement is, that you can render the whole VBO using only one draw call, which decreases the amount of instructions the cpu has to handle.
Seeing the same performance with two GPUs that have substantially different performance potential certainly suggests that your code is CPU limited. But I very much question some of the theories about the performance bottleneck in the other answers/comments.
Some simple calculations suggest that memory bandwidth should not come into play at all. With 50,000 triangles, with 3 vertices each, and 24 bytes per vertex, you're looking at 3,600,000 bytes of vertex data per frame. Say you're targeting 60 frames/second, this is a little over 200 MBytes/second. That's less than 1% of the memory bandwidth of a modern PC.
The most practical implementation of immediate mode on a modern GPU is for drivers to collect all the data into buffers, and then submit it all at once when a buffer fills up. So there's no need for a lot of kernel calls, and the data for each vertex is certainly not sent to the GPU separately.
Driver overhead is most likely the main culprit. With 50,000 triangles, and 3 API calls per triangle, this is 150,000 API calls per frame, or 9 million API calls/second if you're targeting 60 frames/second. That's a lot! For each of these calls, you'll have:
Looping and array accesses in your own code.
The actual function call.
Argument passing.
State management and logic in the driver code.
etc.
One important aspect that makes this much worse than it needs to be: You're using double values for your coordinates. That doubles the amount of data that needs to be passed around compared to using float values. And since the OpenGL vertex pipeline operates in single precision (*), the driver will have to convert all the values to float.
I suspect that you could get a significant performance improvement even with using the deprecated immediate mode calls if you started using float for all your coordinates (both your own storage, and for passing them to OpenGL). You could also use the version of the glVertex*() call that takes a single argument with a pointer to the vector, instead of 3 separate arguments. This would be glVertex3fv() for float vectors.
Making the move to VBOs is of course the real solution. It will reduce the number of API calls by orders of magnitude, and avoid any data copying as long as the vertex data does not change over time.
(*) OpenGL 4.1 adds support for double vertex attributes, but they require the use of specific API functions, and only make sense if single precision floats really are not precise enough.

Should I make my raytracer with GLSL or OpenCL, and how I do I get a large 1gb buffer?

Right now, I have implemented a GLSL raytracer that uses a buffer texture to access the acceleration structure used for ray tracing.
I'm traversing the texture with a while loop, and it's very costly, but I think there's hope for making it faster. But there seems to be a wall that I'm going to hit that I can't seem to fix. Buffer textures have a limited size, on my GPU it was around 200mb, I forget exactly what it was.
I need my data structure to be around 1gb.
Someone recommended OpenCL to me to solve the problem, so I studied OpenCL and now I'm familiar with the API. However, I discovered that OpenCL also has a similar problem with maximum buffer sizes. Most GPUs will only give you access to 1/4 of total vram in a single buffer. Most GPU's have 1 or 2 gbs of vram so creating 1 buffer for my structure will not work.
It seems like the only way get my data structure on the GPU is to split it up into multiple buffers now. My question is, what's the most effective and fast way to do this, and would it wise to continue in OpenCL or GLSL. I know branching buffer/texture reads can be costly, and it seems like that's what I would have to do if I split it up. You could avoid the branch if you somehow put the buffer to read in an array and index the buffer somehow, however, I have experienced indexing with GLSL to be EXTREMELY slow, even if it's just indexing a local array (why is this?). I wonder if the same slowness would occur if you grouped buffers into an array in a kernel, if that's even possible.
Current devices with updated drivers can access more than that. AMD has an envvar that lets you set it even higher.
OpenCL could be a good solution for you.
Also, OpenGL 4.3 added Compute Shaders which are extremely OpenCL-like and perfect for folks with OpenGL experience and an existing OpenGL application.
Regarding performance, looping in your kernel can be a problem due to work group divergence, and if you don't have many work items active, it can reduce device occupancy.

Hardware support for non-power-of-two textures

I have been hearing controversial opinions on whether it is safe to use non-power-of two textures in OpenGL applications. Some say all modern hardware supports NPOT textures perfectly, others say it doesn't or there is a big performance hit.
The reason I'm asking is because I want to render something to a frame buffer the size of the screen (which may not be a power of two) and use it as a texture. I want to understand what is going to happen to performance and portability in this case.
Arbitrary texture sizes have been specified as core part of OpenGL ever since OpenGL-2, which was a long time ago (2004). All GPUs designed every since do support NP2 textures just fine. The only question is how good the performance is.
However ever since GPUs got programmable any optimization based on the predictable patterns of fixed function texture gather access became sort of obsolete and GPUs now have caches optimized for general data locality and performance is not much of an issue now either. In fact, with P2 textures you may need to upscale the data to match the format, which increases the required memory bandwidth. However memory bandwidth is the #1 bottleneck of modern GPUs. So using a slightly smaller NP2 texture may actually improve performance.
In short: You can use NP2 textures safely and performance is not much of a big issue either.
All modern APIs (except some versions of OpenGL ES, I believe) on modern graphics hardware (the last 10 or so generations from ATi/AMD/nVidia and the last couple from Intel) support NP2 texture just fine. They've been in use, particularly for post-processing, for quite some time.
However, that's not to say they're as convenient as power-of-2 textures. One major case is memory packing; drivers can often pack textures into memory far better when they are powers of two. If you look at a texture with mipmaps, the base and all mips can be packed into an area 150% the original width and 100% the original height. It's also possible that certain texture sizes will line up memory pages with stride (texture row size, in bytes), which would provide an optimal memory access situation. NP2 makes this sort of optimization harder to perform, and so memory usage and addressing may be a hair less efficient. Whether you'll notice any effect is very much driver and application-dependent.
Offscreen effects are perhaps the most common usecase for NP2 textures, especially screen-sized textures. Almost every game on the market now that performs any kind of post-processing or deferred rendering has 1-15 offscreen buffers, many of which are the same size as the screen (for some effects, half or quarter-size are useful). These are generally well-supported, even with mipmaps.
Because NP2 textures are widely supported and almost a sure bet on desktops and consoles, using them should work just fine. If you're worried about platforms or hardware where they may not be supported, easy fallbacks include using the nearest power-of-2 size (may cause slightly lower quality, but will work) or dropping the effect entirely (with obvious consquences).
I have a lot of experience in making games (+4 years) and using texture atlases for iOS & Android though cross platform development using OpenGL 2.0
Stick with PoT textures with a maximum size of 2048x2048 because some devices (especially the cheap ones with cheap hardware) still don't support dynamic texture sizes, i know this from real life testers and seeing it first hand. There are so many devices out there now, you never know what sort of GPU you'll be facing.
You're iOS devices will also show black squares and artefacts if you are not using PoT textures.
Just a tip.
Even if arbitrary texture size is required by OpenGL X certain videocards are still not fully compliant with OpenGL. I had a friend with a IntelCard having problems with NPOT2 textures (I assume now Intel Cards are fully compliant).
Do you have any reason for using NPOT2 Textures? than do it, but remember that maybe some old hardware don't support them and you'll probably need some software fallback that can make your textures POT2.
Don't you have any reason for using NPOT2 Textures? then just use POT2 Textures. (certain compressed formats still requires POT2 textures)

Storing many small textures in OpenGL

I'm building an OpenGL app with many small textures. I estimate that I will have a few hundred
textures on the screen at any given moment.
Can anyone recommend best practices for storing all these textures in memory so as to avoid potential performance issues?
I'm also interested in understanding how OpenGL manages textures. Will OpenGL try to store them into GPU memory? If so, how much GPU memory can I count on? If not, how often does OpenGL pass the textures from application memory to the GPU, and should I be worried about latency when this happens?
I'm working with OpenGL 3.3. I intend to use only modern features, i.e. no immediate mode stuff.
If you have a large number of small textures, you would be best off combining them into a single large texture with each of the small textures occupying known sub-regions (a technique sometimes called a "texture atlas"). Switching which texture is bound can be expensive, in that it will limit how much of your drawing you can batch together. By combining into one you can minimize the number of times you have to rebind. Alternatively, if your textures are very similarly sized, you might look into using an array texture (introduction here).
OpenGL does try to store your textures in GPU memory insofar as possible, but I do not believe that it is guaranteed to actually reside on the graphics card.
The amount of GPU memory you have available will be dependent on the hardware you run on and the other demands on the system at the time you run. What exactly "GPU memory" means will vary across machines, it can be discrete and used only be the GPU, shared with main memory, or some combination of the two.
Assuming your application is not constantly modifying the textures you should not need to be particularly concerned about latency issues. You will provide OpenGL with the textures once and from that point forward it will manage their location in memory. Assuming you don't need more texture data than can easily fit in GPU memory every frame, it shouldn't be cause for concern. If you do need to use a large amount of texture data, try to ensure that you batch all use of a certain texture together to minimize the number of round trips the data has to make. You can also look into the built-in texture compression facilities, supplying something like GL_COMPRESSED_RGBA to your call to glTexImage2D, see the man page for more details.
Of course, as always, your best bet will be to test these things yourself in a situation close to your expected use case. OpenGL provides a good number of guarantees, but much will vary depending on the particular implementation.