Imagine I have a vertex array and an index array. Suppose it is possible to rearrange the elements in the index array so that a single call glDrawElements with GL_TRIANGLE_STRIP draws the desired figure. Another possibility would be to call glDrawElements with GL_TRIANGLES but this would make the index array longer.
Does it really matter in terms of efficiency (I mean real efficiency, not some microoptimizations) which way I choose or is the underlying routine the same anyway?
A side note: The reason I am reluctant to rearrange my elements to use GL_TRIANGLE_STRIP is that because I think that the triangles in the strip will have alternating winding. Am I wrong?
There's no really performance differences between a call with GL_TRIANGLE_STRIP and GL_TRIANGLES.
Now if you can rearrange your indices for maximizing post-transform vertex cache, you can have huge performance gains. I did so years ago for rendering large terrain patches and observed 10 to 15 times FPS speedups (using some kind of Hilbert curve scheme).
I think that GL_TRIANGLE_STRIP and GL_TRIANGLES are quite as efficient, if your indices are ordered in a way to maximize vertex transform cache. Of course it will take more memory to store.
There's probably not much of a performance difference. But it's still recommended that you not use either method in the main render loop. Instead use display lists. I've converted some of my OpenGL code to display lists, and it's way faster because you end up cutting out a ton of CPU->GPU communication.
There's a tutorial here: http://www.lighthouse3d.com/opengl/displaylists/
It all depends on what your driver has to do with your vertex data. If it has to do some processing (turn QUADS into TRIANGLES, like it usually has to do) then it will not be optimal. The only way to see what is optimal for your driver is to measure. Find a opengl benchmark and see which vertex primitives are optimal for your driver.
Since you want to compare GL_TRIANGLE_STRIP and GL_TRIANGLES, most likely you will find the performance loss to be minimal in this case.
Related
I am currently learning OpenGL for 3D rendering and i can't quite wrap my head around some things regarding shaders and VBOs, i get that all VBOs share one index and therefore you need to duplicate some data
but when you create more VBOs there are nearly no faces with vertices that share the same position normal and texture coordinates so the indices are at least from my point of view pretty useless, it is basically just an array of consecutive numbers.
Is there an aspect of indicesBuffers i don't see ?
The utility of index buffers is, as with the utility of all vertex specification features, dependent on your mesh data.
Most of the meshes that get used in high-performance graphics, particularly those with significant polygon density, are smooth. Normals across such meshes are primarily smooth, since the modeller is usually approximating a primarily curved surface. Oh yes, there can be some sharp edges here and there, but for the most part, each position in such models has a single normal.
Texture coordinates usually vary smoothly across meshes too. There are certainly texture coordinate edges; well-optimized UV unwrapping often produces these kinds of things. But if you have a mesh of real size (10K+ vertices), most positions have a single texture coordinate. And tangents/bitangents are based on the changes in texture coordinates, so those will match the texture topology.
Are there meshes where the normal topology is highly disjoint with position topology? Yes. Cubes being the obvious example. But there are oftentimes needs for highly faceted geometry, either to achieve a specific look or for low-polygon uses. In these cases, normal indexed rendering may not be of benefit to you.
But that does not change the fact that these cases are the exception, generally speaking, rather than the rule. Even if your code always involves these cases, that simply isn't true for the majority of high-performance graphics applications out there.
In a closed mesh, every vertex is shared by at least two faces. (The only time a vertex will be used fewer than three times is in a double-sided mesh, where two faces have the same vertices but opposite normals and winding order.) Not using indices, and just duplicating vertices, is not only inefficient, but, at minimum, doubles the amount of vertex data required.
There’s also potential for cache thrashing that could be otherwise avoided, related pipeline stalls and other insanity.
Indices are your friend. Get to know them.
Update
Typically, normals, etc. are stored in a normal map, or interpolated between vertices.
If you just want a faceted or "flat shaded" render, use the cross product of dFdx() and dFdy() (or, in HLSL, ddx() and ddy())to generate the per-pixel normal in your fragment shader. Duplicating data is bad, and only necessary under very special and unusual circumstances. Nothing you've mentioned leads me to believe that this is necessary for your use case.
In this tutorial for OpenGL ES, techniques for optimizing models are explained and one of those is to use triangle strips to define your mesh, using "degenerate" triangles to end one strip and begin another without ending the primitive. http://www.learnopengles.com/tag/degenerate-triangles/
However, this guide is very specific to mobile platforms, and I wanted to know if this technique held for modern desktop hardware. Specifically, would it hurt? Would it either cause graphical artifacts or degrade performance (opposed to splitting the strips into separate primatives?)
If it causes no artifacts and performs at least as well, I aim to use it solely because it makes organizing vertices in a certain mesh I want to draw easier.
Degenerate triangles work pretty well on all platforms. I'm aware of an old fixed-function console that struggled with degenerate triangles, but anything vaguely modern will be fine. Reducing the number of draw calls is always good and I would certainly use degenerates rather than multiple calls to glDrawArrays.
However, an alternative that usually performs better is indexed draws of triangle lists. With a triangle list you have a lot of flexibility to reorder the triangles to take maximum advantage of the post-transform cache. The post-transform cache is a hardware cache of the last few vertices that went through the vertex shader, the GPU can spot if you've re-issued the same vertex and skip the entire vertex shader for that vertex.
In addition to the above answers (no it shouldn't hurt at all unless you do something mad in terms of the ratio of real triangles to the degenerates), also note that the newer versions of OpenGL and OpenGL ES (3.x or higher) APIs support a means to insert breaks into index lists without needing an actual degenerate triangle, which is called primitive restart.
https://www.khronos.org/opengles/sdk/docs/man3/html/glEnable.xhtml
When enabled you can encode "MAX_INT" for the index type, and when detected that forces the GPU to restart building a new tristrip from the next index value.
It will not cause artifacts. As to "degrading performance"... relative to what? Relative to a random assortment of triangles with no indexing? Yes, it will be faster than that.
But there are plenty of other things one can do. For example, primitive restarting, which removes the need for degenerate triangles. Then there's using ordered lists of triangles for improved cache coherency. Will triangle strips be faster than that?
It rather depends on what you're rendering, how expensive your vertex shaders are, and various other things.
But at the end of the day, if you care about maximum performance on particular platforms, then you should profile for each platform and pick the vertex data based on what platform you're running on. If performance is really that important to you, then you're going to have to put forth some effort.
I know in DirectX 11 you can use the awesome tessellation feature for LODs, but knowing that DirectX 9 doesn't have this feature, how would I go about creating LODs for the models in my 3D application/game to speed it up?
I heard back in the old days before DirectX 10 or 11 came out, people used to create many of the same type of models but just with different polycounts (i.e: one with very low polycount for far away objects and one with a high polycount for very near objects).
But doing this would mean doubling or even tripling the size of models in the game right? Is there any other approaches in achieving LODs in DirectX 9? Or is this really the best soultion when it comes to DirectX 9? Can someone at least point me in the right direction for this issue I can atleast go away and do more research about it?
Thanks
Generating multiple LOD meshes using mesh simplification algorithms (or by hand) might not be as bad as you think in terms of memory consumption. As in mipmaps, since your simplified mesh have much less vertices, they shouldn't triple the size of your in-game models. And you don't have to keep the high-resolution meshes in video memory if you're not going to be using them for a while.
An alternative to save memory is to simplify meshes by discarding vertices only. This way, you can use a single vertex buffer and have different index buffers for each LOD. You might get slightly lower quality LOD meshes, but the memory overhead of keeping them all in memory will be much smaller.
If I'm not mistaking, tessellation is for subdivision, so it wouldn't help you anyways if you want a coarser mesh (though it can probably help interpolate between LODs.)
I am quickly finding that one of the organisational considerations you must make when preparing rendering in OpenGL is the type of topography and the arrangement of vertices.
Now there are some interesting methods out there for organising vertices into very long arrays, with nice uses of interleaved arrays, indexes, etc, so that you can pour a lot of geometry into one OpenGL call.
But it's much easier in some cases to simply iterate and perform multiple calls with smaller vertex arrays.
While I agree with the notion that premature optimization is somewhat wasteful, just how important of a consideration should it be to minimize OpenGL calls, especially if multiple calls would actually involve far fewer vertexes per call?
I can see that this is one of those decisions that is important early in the development process, since it forms a lot of the structure of how vertexes get created and organized.
There is an overhead for each command you send down to the GPU. By batching the vertices you minimize that overhead and also allows the driver to make small optimizations in you data before sending it to the hardware. It can make quite a difference and is the reason the glBegin and glEnd was completely removed from newer iterations of OpenGL.
You should try to avoid making many driver states changes and many drawing calls.
EDIT: Consider using degenerated vertices in you triangle strips (also helps in minimizing the number of vertices processed) so that you can just use one drawing call and render all your topology (unless you need to change some driver state between parts of the topology).
You can find a balance for your specific needs. But the thing is that there're many variables in the equation. And there's no simple solution (like "always make scene as one big single batch!"). TraxNet gave you a good advice though - always try to minimize api calls(whether drawing or state changes). But it hasn't to be just a few calls. On modern PC it could be thousands per frame, not so modern mobile phone, maybe, just a few hundred.
Also TraxNet mentioned degenerate triangles(helping form strips) - though they're still triangles(kinda add to 'total' triangle count rendered) - they cost almost nothing still helping to minimize amount of draw calls.
I am experimenting with several ways to draw a lot of sprites (e.g. for particle system) and I have some inconclusive results. So this is what I tried and what I have:
This is done drawing 25k sprites:
Using regular glBegin/glEnd and using trig to calculate vertex points - 17-18fps.
Using regular glBegin/glEnd, but using glRotate, glTranslate and glScale to transform the sprite - 14-15fps.
Using vertex arrays instead of glBegin and glEnd, but still using trig to calculate vertex point position - 10-11fps.
Using vertex arrays instead of glBegin and glEnd, but using glRotate, glTranslate and glScale to transform the sprite - 10-11fps.
So my question is, why is using vertex arrays slower than using glBegin/glEnd while I have read (here even) that it should be faster?
And why does using your own trigonometry (which in my case is 5 cos, 5 sin, more than 5 divisions, 15 multiplications and about 10 additions/subtractions) is faster than using 5 functions (glPushMatrix(), glTranslated(), glRotated(), glScaled(), glPopMatrix()). I though they are done on the GPU so it should be much, much faster.
I do get more promising results when drawing less sprites. Like when I draw 10k sprites, then vertex arrays can be about 5fps faster, but still inconsistent. Also note than these fps can be increased overall because I have other calculations going on, so I am not really looking at the fps itself, but the difference between them. Like if vertex arrays and gl transform was 5-10fps more than glBegin/glEnd with manual trig, then I would be happy, but for now, it just doesn't seem to be worth the hassle. They would help with porting to GLES (as it doesn't have glBegin/glEnd), but I guess I will make a separate implementation for that.
So is there any way to speed this up without using geometry shaders? I don't really understand them (maybe some great tutorial?), and they could break compatibility with older hardware, so I want to squeeze all the juice I can without using shaders.
So my questions are why does using vertex arrays is slower than using glBegin/glEnd while I have read (here even) that it should be faster?
Who says that they are slower?
All you can say is that, for your particular hardware, for your current driver, glBegin/glEnd are slower. Have you verified this on other hardware?
More importantly, there is the question of how you are drawing these. Do you draw a single sprite from the vertex array, then draw another, then draw another? Or do you draw all of them with a single glDrawArrays or glDrawElements call?
If you're not drawing all of them in one go (or at least large groups of them at once), then you're not going as fast as you should be.
And why does using your own trigonometry (which in my case is 5 cos, 5 sin, more than 5 divisions, 15 multiplications and about 10 additions/subtractions) is faster than using 5 functions (glPushMatrix(), glTranslated(), glRotated(), glScaled(), glPopMatrix()). I though they are done on the GPU so it should be A LOT faster.
Well, let's think about this. glPushMatrix costs nothing. glTranslated creates a double-precision floating-point matrix and then does a matrix multiply. glRotated does at least one sin and one cos, does some additions and subtractions to compute a matrix (all in double-precision) and then does a matrix multiply. glScaled computes a matix, and does a matrix multiply.
Each "does a matrix multiply" consists of 16 floating-point multiplies and 12 floating-point adds. And since you asked for double-precision math, you can forget about SSE vector math or whatever; this is doing standard math. And you're doing 3 of these for every point.
What happens on the GPU is the multiplication of that matrix with the vertex positions. And since you're only passing 4 positions before changing the matrix, it's not particularly surprising that this is slower.
Have you considered using glPoints...() instead? This is kinda what they were designed to do, depending on which version of OpenGL you are supporting.
Have you tried VBO's instead? They're the current standard, so most cards are optimized in their favor.
Also:
you should use your own math calculations
consider offloading as much calculation as possible to a shader
The fps amounts you posted are contrary to what one might expect -- you probably do something wrong. Can you paste some of your rendering code?
Do you have a specific reason to use double precision matrix functions? They are usually a lot slower than single precision ones.