Cache Friendly Vertex Definition - c++

I am writing an opengl application and for vertices, normals, and colors, I am using separate buffers as follows:
GLuint vertex_buffer, normal_buffer, color_buffer;
My supervisor tells me that if I define an struct like:
struct vertex {
glm::vec3 pos;
glm::vec3 normal;
glm::vec3 color;
};
GLuint vertex_buffer;
and then define a buffer of these vertices, my application will gets so much faster because when the position is cached the normals and colors will be in cache line.
What I think is that defining such struct is not having that much affect on the performance because defining the vertex like the struct will cause less vertices in the cacheline while defining them as separate buffers, will cause to have 3 different cache lines for positions, normals and colors in the cache. So, nothing has been changed. Is that true?

First of all, using separate buffers for different vertex attributes may not be a good technique.
Very important factor here is GPU architecture. Most (especially modern) GPUs have multiple cache lines (data for Input Assembler stage, uniforms, textures), but fetching input attributes from multiple VBOs can be inefficient anyway (always profile!). Defining them in interleaved format can help improve performance:
And that's what you would get, if you used such struct.
However, that's not always true (again, always profile!) - although interleaved data is more GPU-friendly, it needs to be properly aligned and can take significantly more space in memory.
But, in general:
Interleaved data formats:
Cause less GPU cache pressure, because the vertex coordinate and attributes of a single vertex aren't scattered all over in memory.
They fit consecutively into few cache lines, whereas scattered
attributes could cause more cache updates and therefore evictions. The
worst case scenario could be one (attribute) element per cache line at
a time because of distant memory locations, while vertices get pulled
in a non-deterministic/non-contiguous manner, where possibly no
prediction and prefetching kicks in. GPUs are very similar to CPUs in
this matter.
Are also very useful for various external formats, which satisfy the deprecated interleaved formats, where datasets of compatible data
sources can be read straight into mapped GPU memory. I ended up
re-implementing these interleaved formats with the current API for
exactly those reasons.
Should be layouted alignment friendly just like simple arrays. Mixing various data types with different size/alignment requirements
may need padding to be GPU and CPU friendly. This is the only downside
I know of, appart from the more difficult implementation.
Do not prevent you from pointing to single attrib arrays in them for sharing.
Source
Further reads:
Best Practices for Working with Vertex Data
Vertex Specification Best Practices

Depends on the GPU architecture.
Most GPUs will have multiple cache lines (some for uniforms, others for vertex attributes, others for texture sampling)
Also when the vertex shader is nearly done the GPU can pre-fetch the next set of attributes into the cache. So that by the time the vertex shader is done the next attributes are right there ready to be loaded into the registers.
tl;dr don't bother with these "rule of thumbs" unless you actually profile it or know the actual architecture of the GPU.

Tell your supervisor "premature optimization is the root of all evil" – Donald E. Knuth. But don't forget the next sentence "but that doesn't mean we shouldn't optimize hot spots".
So did you actually profile the differences?
Anyway, the layout of your vertex data is not critical for caching efficiency on modern GPUs. It used to be on old GPUs (ca. 2000), which is why there were functions for interleaving vertex data. But these days it's pretty much a non-issue.
That has to do with the way modern GPUs access memory and in fact modern GPUs' cache lines are not index by memory address, but by access pattern (i.e. the first distinct memory access in a shader gets the first cache line, the second one the second cache line, and so on).

Related

Why does OpenGL not support multiple index buffering?

Why does OpenGL not support multiple index buffers for vertex attributes (yet)?
To me it seems very useful, since you could reuse attributes and you would have a lot more control over the rendering of your geometry.
Is there a reason why all attribute arrays have to take the same index or could this feature be available in the near future?
OpenGL (and D3D. And Metal. And Mantle. And Vulkan) doesn't support this because hardware doesn't support this. Hardware doesn't support this because, for the vast majority of mesh data, this would not help. This is primarily useful for meshes that are predominantly not smooth (vertices sharing positions but not normals and so forth). And most meshes are smooth.
Furthermore, it will frequently be a memory-vs-performance tradeoff. Accessing your vertex data will likely be slower. The GPU has to fetch from two distinct locations in memory, compared to the case of a single interleaved fetch. And while caching helps, the cache coherency of multi-indexed accesses is much harder to control than for single-indexed accesses.
Hardware is unlikely to support this for that reason. But it also is unlikely to support it because you can do it yourself. Whether through buffer textures, image load/store or SSBOs, you can get your vertex data however you want nowadays. And since you can, there's really no reason for hardware makers to develop special hardware to help you.
Also, there are questions as to whether you'd really be making your vertex data smaller at all. In multi-indexed rendering, each vertex is defined by a set of indices. Well, each index takes up space. If you have more than 64K of attributes in a model (hardly an unreasonable number in many cases), then you'll need 4 bytes per index.
A normal can be provided in 4 bytes, using GL_INT_2_10_10_10_REV and normalization. A 2D texture coordinate can be stored in 4 bytes too, as a pair of shorts. Colors can be stored in 4 bytes. So unless multiple attributes share the same index (normals and texture coordinate edges happen at the same place, as might happen on a cube), you will actually make your data bigger by doing this in many cases.

Comparing the multiDrawArrays, using primitive restart and multiDrawElements in terms of performance?

I want to draw a mass of branches with different shapes, each of which consisting of 4 triangle strips. (Using OpenGL)
So now I'm considering using one of those method calls (multiDrawArrays, using primitive restart and multiDrawElements).
I was wondering which one is more efficient. Is the method multiDrawArrays() equivalent to several drawArrays() in terms of speed?
Does VAO store the vertex info in the RAM while the SSBO store those in the VRAM? If so, is it better to use SSBO rather than VAO considering the performance?
As #derhass already pointed out in a comment, some of your terminology is mixed up. A VAO (Vertex Array Object) contains state that defines how vertex data is associated with vertex attributes. It's the VBO (Vertex Buffer Object) that contains the actual vertex data.
I doubt that using a SSBO for vertex data would be beneficial. In general, the buffer types primarily define how the data is used. The graphics pipeline is tailored towards fetching vertex data from VBOs, and many GPUs have dedicated fixed function hardware to pull data from VBOs and feed it into the vertex shader. I can't see how using explicit code in the vertex shader to pull the vertex data from a SSBO instead would be more efficient.
Whether the data is stored in VRAM or SRAM is a different consideration. The only control you have over that is with the last argument to glBufferData(). It provides a hint on how you plan to use the data. For example, if you specify GL_STATIC_DRAW, you're telling the driver that you're not planning to modify the data, which suggests that placing it in VRAM might be a good idea. Whether it will actually be in VRAM is then up to the driver, and it may decide that based on various criteria.
Functionally, glMultiDrawArrays() is equivalent to multiple calls to glDrawArrays(). But it can certainly be more efficient. If nothing else, it saves the overhead of making multiple API calls. Each API call has a certain amount of overhead, for example:
It might pass through a couple of software layers, resulting in additional function calls under the hood.
It needs to get the current context from thread local storage.
It needs to do error checking.
It may need some form of locking to deal with access from multiple contexts in multiple threads (might not be needed for a draw call).
It needs to check for pending state changes.
The MultiDraw calls were introduced to cut down on the number of API calls needed.
Now, whether glMultiDrawArrays() or glDrawElements() with primitive restart is more efficient, that's impossible to say in general. If you're not already using an index buffer, I would be a bit hesitant to introduce one just so that you can use primitive restart. So my instinct would be:
Use glDrawElements() with primitive restart if you're using an index buffer anyway.
Use glMultiDrawArrays() if you're not using an index buffer.
The real answer, as always, can only be obtained by benchmarking. And it can of course be platform/hardware dependent. My prediction is that you will not see a significant difference in most cases, since both of these allow you to draw a lot of your geometry with a single API call, which should avoid bottlenecks in this area.

Should I omit vertex normals when there is no lighting calculations?

I have an openGL program that doesn't use lighting or shading of any kind; the illusion of shadow is done completely through textures, since the meshes are low-poly. Faces are not backculled, and I wouldn't use normal-mapping of course.
My question is, should I define the vertex normals anyway? Would excluding them use fewer resources and speed rendering, or would excluding them negatively impact the performance/visuals in some way?
My question is, should I define the vertex normals anyway?
There is no need to, if they are not used.
Would excluding them use fewer resources and speed rendering, or would excluding them negatively impact the performance/visuals in some way?
It definitively wouldn't impact the visuals if there are not used.
You do not mention if you use old fixed-function pipeline or the modern programmable pipeline. In the old fixed-function pipeline, the normals are only used for the lighting calculation. The have nothing to do with the face culling. The front/back sides are determined solely by the primitive winding order in screen space.
If you use the programmable pipeline, the normals are used for whatever you use them. The GL itself will not care at all about it.
So excluding them should result in less memory needed for the object to be stored. If rendereing actually gets faster is hard to predict. If the normals aren't used, they shouldn't even be fetched, no matter if they are provided or not. But caching will also have an impact here, so the improvement of not fetching them might not be noticeable at all.
Only if you are using immediate mode(glBegin()/glEnd()) to specify geometry (which you really should never ever do), excluding the normals will save you one gl function call per vertex, and this should give a significant improvement (but still will be orders of magnitude slower than using vertex arrays).
If normals are not used for lighting, you don't need them (they are not used for back-face culling either).
The impact of performance is more about how this changes your vertex layout and resulting impact on pre-transform cache (assuming you have interleaved vertex format). Like on CPU's, GPU's fetch data in cache lines, and if without (or with) normals you get better alignment with cache lines, it can have an impact on the performance. For example if your vertex size is 32 bytes and removal of the normal gets it down to 20 bytes this will cause GPU fetching 2 cache lines for some vertices, while with 32 byte vertex format it's always fetches only one cache line. However, if your vertex size is 44 bytes and removal of normal gets it down to 32 bytes, then for sure it's an improvement (better alignment and less data).
However, this is quite a fine level optimization in the end and unlikely have any significant impact either way unless you are really pushing huge amount of geometry through the pipeline with very lightweight vertex/pixel shaders (e.g. shadow pass).

A triangle with 3 varyings of same value.. does GPU interpolate / waste performance?

I have a simple question of which I was unable to find solid facts about GPUs behaviour in case of 3 vertexes having the same varying output from vertex shader.
Does the GPU notice that case or does it try to interpolate when its not even needed ?
This might be interesting as there are quite some cases where you want a constantish varying available in fragment shader per triangle. Please don't just guess, try to bring up references or atleast reasons why you think its the one way or another.
The GPU does the interpolation, no matter if it's needed or not.
The reason is quite simple: checking if the varying variable has already been changed is very expensive.
Shaders are small programs, that are executed concurrently on different GPU cores. So if you would like to avoid that two different cores are computing the same value, you would have to "reserve" the output variable. So you need an additional data structure (like a flag or mutex) that every core can read. In your case this would mean, that three different cores have to read the same flag, and the first of them has to reserve it if it's not already reserved.
This has to happen atomically, meaning that the reserving core has to be the only one who is setting the flag at a time. To do this all other cores would e.g. have to be stopped for a tick. As you don't know the which cores are computing the vertex shader you would have to stop ALL other cores (on a GTX Titan this would be 2687 others).
Additionally, when the variable is set and a new frame is rendered, all the flags would have to be reset, so the race for the flag can begin again.
To conclude: you would need additional hardware in your GPU, that is expensive and slows down the rendering pipeline.
It is the programmers job to avoid that multiple shaders are producing the same output. So if you are doing your job right this does not happen or you know, that avoiding it (on the CPU) would cost more than ignoring it.
An example would be the stiching for different levels of detail (like on a height map), where most methods are creating some fragments twice. This is a very small impact on the rendering performance but would require a lot of CPU time to avoid.
If the behavior isn't mandated in the OpenGL specificiation then the answer is that it's up to the implementation.
The comments and other answers are almost certainly spot on that there is no optimization path for identical values because there would be little to no benefit from the added complexity to make such a path.

Vertex Buffers - indexed or direct, interlaced or separate

What are some common guidelines in choosing vertex buffer type? When should we use interlaced buffers for vertex data, and when separate ones? When should we use an index array and when direct vertex data?
I'm searching for some common quidelines - I some cases where one or the opposite fits better, but not all cases are easily solvable. What should one have in mind choosing the vertex buffer format when aiming for performance?
Links to web resources on the topic are also welcome.
First of all, you can find some useful information on the OpenGL wiki. Second of all, if in doubt, profile, there are some rules-of-thumb about this one but experience can vary based on the data set, hardware, drivers, ... .
Indexed versus direct rendering
I would almost always by default use the indexed method for vertex buffers. The main reason for this is the so called post-transform cache. It's a cache kept after the vertex processing stage of your graphics pipeline. Essentially it means that if you use a vertex multiple times you have a good chance of hitting this cache and being able to skip the vertex computation. There is one condition to even hit this cache and that is that you need to use indexed buffers, it won't work without them as the index is a part of this cache's key.
Also, you likely will save storage, an index can be as small as you want (1 byte, 2 byte) and you can reuse a full vertex specification. Suppose that a vertex and all attributes total to about 30 bytes of data and you share this vertex over let's say 2 polygons. With indexed rendering (2 byte indices) this will cost you 2*index_size+attribute_size = 34 byte. With non-indexed rendering this will cost you 60 bytes. Often your vertices will be shared more than twice.
Is index-based rendering always better? No, there might be scenarios where it's worse. For very simple applications it might not be worth the code overhead to set up an index-based data model. Also, when your attributes are not shared over polygons (e.g. normal per-polygon instead of per-vertex) there is likely no vertex-sharing at all and IBO's won't give a benefit, only overhead.
Next to that, while it enables the post-transform cache, it does make generic memory cache performance worse. Because you access the attributes relatively random, you might have quite some more cache misses and memory prefetching (if this would be done on the GPU) won't work decently. So it might be (but measure) that if you have enough memory and your vertex shader is extremely simple that the non-indexed version outperforms the indexed version.
Interleaving vs non-interleaving vs buffer per-attribute
This story is a bit more subtle and I think it comes down to weighing some properties of your attributes.
Interleaved might be better because all attributes will be close together and likely be in a few memory cachelines (maybe even a single one). Obviously, this can mean better peformance. However, combined with indexed-based rendering your memory access is quite random anyway and the benefit might be smaller than you'd expect.
Know which attributes are static and which are dynamic. If you have 5 attributes of which 2 are completely static, 1 changes every 15 minutes and 2 every 10 seconds, consider putting them in 2 or 3 separate buffers. You don't want to re-upload all 5 attributes every time those 2 most frequent change.
Consider that attributes should be aligned on 4 bytes. So you might want to take interleaving even one step further from time to time. Suppose you have a vec3 1-byte attribute and some scalar 1-byte attribute, naively this will need 8 bytes. You might gain a lot by putting them together in a single vec4, which should reduce usage to 4 bytes.
Play with buffer size, a too large buffer or too many small buffers may impact performance. But this is likely very dependent on the hardware, driver and OpenGL implementation.
Indexed vs Direct
Let's see what you get by indexing. Every repeating vertex, that is, a vertex with "smooth" break will cost you less. Every singular "edge" vertex will cost you more. For data that's based on real world and is relatively dense, one vertex will belong to many triangles, and thus indexes will speed it up. For procedurally generated arbitrary data, direct mode will usually be better.
Indexed buffers also add additional complications to the code.
Interleaved vs Separate
The main difference here is actually based on a question "will I want to update only one component?". If the answer is yes, then you shouldn't interleave, because any update will be extremely costly. If it's no, using interleaved buffers should improve locality of reference and generally be faster on most of the hardware.