TriangleList versus TriangleStrip - opengl

Between a TriangleStrip and a TriangleList, which one performs faster?
Something interesting I've just read says: "My method using triangle list got about 780fps, and the one with triangle strip only got 70fps". I don't have details as to what exactly he is making, but according to this he's getting about 10 times the frame rate using a TriangleList. I find this counter-intuitive because the list contains more vertex data.
Does anyone know a technical reason why the TriangleList might be so much faster than a Strip?

Triangle strips are a memory optimization, not a speed optimization. At some point in the past, when bus bandwidth between the system memory and video memory was the main bottle neck in a data intensive application, then yes it would also saved time but this is very rarely the case anymore. Also, transform cache was very small in old hardware, so an ordinary strip would cache better than a badly optimized indexed list.
The reason why a triangle list can be equaly or more efficient than a triangle strip is indices. Indices let the hardware transform and cache vertices in a very previsible fashion, given that you are optimizing your geometry and triangle order correctly. Also, in a very complex mesh requiring a lot of degenerate triangles, strips will be both slower and take more memory than an indexed list.
I must say I'm a little surprised that your example shows an order of magnitude difference though.

A triangle list can be much faster than a strip because it saves draw calls by batching the vertex data together easily. Draw calls are expensive so the memory you save by using a strip is sometimes not worth the decreased performance.

Indexed triangle lists will generally win..
Here's a simple rule. Count the number of vertices you will be uploading to the graphics card. If the triangle list (indexed triangle list to be precise) has less vertices than the same data as a triangle strip, then likely it will run faster.
If the number of vertices are very close in both cases, then possibly the strip will run faster because it doesn't have the overhead of the indice list, but I expect that is also driver specific.
Non-Indexed triangle lists are almost always worst case (3 verts per triangle, no sharing) unless you are just dealing with disjoint quads which will also cost 6 verts per quad using degenerate stripping. In that case, you get each quad for 4 verts with indexed triangle lists so it probably wins again but you'd want to test on your target hardware I think.

Related

Is there a faster alternative to geometry shaders that can render points as a specific number of triangles?

I'm currently using openGL with a geometry shader to take points and convert them to triangles during rendering.
I have n lists of points that will each be rendered as n triangles (first list of points each becomes one triangle, second becomes two triangles, etc). I've tried swapping geometry shaders for each of these lists with max_vertices being the minimum for each list. With OpenGL I seemingly have no control over how this is ultimately implemented on the GPU via the geometry shader and some drivers seem to handle it very slowly while others are very fast.
Is there any way to perform this specific task optimally, ideally taking advantage of the fact that I know the exact number of desired output triangles per element and in total? I would be happy to use some alternative to geometry shaders for this if possible. I would also be happy to try Vulkan if it can do the trick.
What you want is arbitrary amplification of geometry: taking one point primitive and producing arbitrarily many entirely separate primitives from it. And the tool GPUs have for that is geometry shaders (or just using a compute shader to generate your vertex data manually, but that's probably not faster and definitely more memory consuming).
While GS's are not known for performance, there is one way you might be able to speed up what you're doing. Since all of the primitives in a particular call will generate a specific number of primitives, you can eschew having each GS output more than one primitive by employing vertex instanced rendering.
Here, you use glDrawArraysInstanced. Your VS needs to pass gl_InstanceID to the GS, which can use that to figure out which triangle to generate from the vertex. That is, instead of having a loop over n to generate n triangles, the GS only generates one triangle. But it gets called instanceCount times, and each call should generate the gl_InstanceIDth triangle.
Now, one downside of this is that the order of triangles generated will be different. In your original GS code, where each GS generates all of the triangles from a point, all of the triangles from one point will be rendered before rendering any triangles from another point. With vertex instancing, you get one triangle from all of the points, then it produces another triangle from all the points, etc. If rendering order matters to you, then this won't work.
If that's important, then you can try geometry shader instancing instead. This works similarly to vertex instancing, except that the instance count is part of the GS. Each GS invocation is only responsible for a single triangle, and you use gl_InvocationID to decide which triangle to use it on. This will ensure that all primitives from one set of GS instances will be rendered before any primitives from a different set of GS instances.
The downside is what I said: the instance count is part of the GS. Unlike instanced rendering, the number of instances is baked into the GS code itself. So you will need a separate program for every count of triangles you work with. SPIR-V specialization constants make it a bit easier on you to build those programs, but you still need to maintain (and swap) multiple programs.
Also, while instanced rendering has no limit on the number of instances, GS's do have a limit. And that limit can be as small as 32 (which is a very popular number).

How to efficiently align vertices in VBO for indexed rendeing?

I have a VBO of 1 050 625 vertices representing a height map. I draw the mesh with GL_TRIANGLE_STRIPS by frustum-culled chunks of 32*32 cells with indexed rendering.
Should I care about how my vertices are aligned in the VBO in terms of performance? I mean is there any information about how distance between different elements affects performance, like: [100,101,102] or [10,1017,2078]?
Distance between indices affects the memory positions to be read from. The affection is related to cached memory. If the position is not in the current cache it must be read from main memory.
At least theorically. In practice, it depends on hardware and driver implementantion. Cache size and bus speed have influence.
As a point to start from, anything with size below a few MB should be the quickest solution.
Anyhow, when performance is a matter, the true way of knowing about it is benchmarking different options, in different hardware if possible.

Is it possible to construct vertex on GPU from a non-XYZ vertex buffer?

I'm writing a particle simulation where the logic is updated using Intel AVX. I'm using a SoA approach to maximize my "SIMD-friendliness" but I shuffle the particle position components into XYZ-format
when updating the vertex buffer.
Is it possible to exclude the shuffle part and simply pass the vertex data in
XXYYZZ-format and construct each vertex in a shader stage?
My first thought was using three vertex buffers with x, y and z components separated and construct each vertex using the same subscript index to access the x, y and z component of a vertex.
I'm aware that this is not the conventional way but I would like to emphasize that this is just an experiment. Perhaps anyone got some knowledge about this approach (if it is even possible) and/or could point me in the right direction? Perhaps there is a name to it aswell?
Thank you!
There is no restriction on how you feed the GPU with your vertices. You can customize the input layout to read values from any number of vertex buffers, in your example, you will have at least three elements. In the vertex shader, you receive your three elements as three scalars and swizzle them back. The only real limitation is that each value are at the same index in each buffer.
In regards to performance, unless you want to get the top 1% performance of the GPU, you will see no difference compared to a well interleaved vertex. This influence mostly the bandwidth and L2 cache miss, so unless you have crazy millions of particles, it is unlikely to happen. And if you have, you can use a compute shader to interleave the data in a pre-process.

Are sparse AABB trees made with pointers?

I'm using an octree of axis aligned bounding boxes to segment the space in my scene where I do a physics simulation.The problem is, the scene is very large(space) and I need to detect collision of large objects at large distances as well as small objects at close distances.The thing is, there are only a few of them on the scene, but kilometers apart, so this means a lot of empty space.So basically I'm wasting 2 gigs of RAM to store bounding boxes for empty sectors.I'd like to only allocate memory for the sectors that actually contain something(to have them be pointers to AABBs), but that would mean thousands of allocations each frame to re-create the octree.If I use a pool to counter the slowdown from allocations it would still mean I'm allocating 2 gigs of RAM for my application.Is there any other way to achieve this?
Look into Loose Octrees (for dealing with many objects) or a more adaptive system such as AABB-trees built around each object rather than one for the entire space. You can perform general distance/collision using the overall AABB (the root) and get finer collisions using the tree under each object (and eventually a ray-triangle intersection test if you need that fine a resolution). The only disadvantage with AABB-trees is that if the object rotates you need to rebuild the tree (you can adaptively scale and translate the AABB-tree).

How does interleaved vertex submission help performance?

I have read and seen other questions that all generally point to the suggestion to interleav vertex positions and colors, etc into one array, as this minimizes the data that gets sent from cpu to gpu.
What I'm not clear on is how OpenGL does this when, even with an interleaved array, you must still make separate GL calls for position and color pointers. If both pointers use the same array, just set to start at different points in that array, does the draw call not copy the array twice since it was the object of two different pointers?
This is mostly about cache. For example, imagine we have 4 vertex and 4 colors. You can provide the information this way (excuse me but I don't remember the exact function names)
glVertexPointer(..., vertex);
glColorPointer(..., colors);
What it internally does, is read vertex[0], then apply colors[0], then again vertex[1] with colors[1]. As you can see, if vertex is, for example, 20 megs long, vertex[0] and colors[0] will be, to say the least, 20 megabytes apart from each other.
Now, on the other hand, if you provide a structure like { vertex0, color0, vertex1, color1, etc.} there will be a lot of cache hits because, well, vertex0 and color0 are together, and so are vertex1 and color1.
Hope this helps answer the question
edit: on second read, I may not have answered the question. You might probably be wondering how does OpenGL know which values to read from that structure, maybe? Like I said before with a structure such as { vertex, color, vertex, color } you tell OpenGL that vertex is at position 0, with an offset of 2 (so next one will be at position 2, then 4, etc) and color starts at position 1, with an offset of 2 also (so position 1, then 3, etc).
addition: In case you want a more practical example, look at this link http://www.lwjgl.org/wiki/index.php?title=Using_Vertex_Buffer_Objects_(VBO). You can see there how it only provides the buffer once and then uses offsets to render efficiently.
I suggest reading: Vertex_Specification_Best_Practices
h4lc0n provided quite nice explanation, but I would like add some additional info:
interleaved data can actually hurt performance when your data often changes. For instance when you change position of point sprites, you update POS, but COLOR and TEXCOORD are usually the same. Then, when data is interleaved you must "touch" additional data. In that case it would be better to have one VBO for POS only (or in general for data that changes often) and the second VBO for data that is constant.
it is not easy to give strict rules about VBO layout, since it is very vendor/driver specific. Also your usage can be different from others. In general it is needed to make some benchmarks for your particular test cases
You could also make an argument for separating different attributes. Assuming a GPU does not process one vertex after another but rather a bunch (ex. 16) of them in parallel, you would would get something like this while executing a vertex shader:
read attribute A for all 16 vertices
perform some computations
read attribute B for all 16 vertices
perform some more computations
....
So you read one attribute for many vertices at once. From this reasoning it would seem that interleaving the attributes actually hurts the performance. Of cours this would only be visible if you are either bandwidth constrained or if the memory latency cannot be hidden for some reason (ex. a complex shader that requires many registers will reduce the number of vertices that can be in flight at a given time).