I have an openGL program that doesn't use lighting or shading of any kind; the illusion of shadow is done completely through textures, since the meshes are low-poly. Faces are not backculled, and I wouldn't use normal-mapping of course.
My question is, should I define the vertex normals anyway? Would excluding them use fewer resources and speed rendering, or would excluding them negatively impact the performance/visuals in some way?
My question is, should I define the vertex normals anyway?
There is no need to, if they are not used.
Would excluding them use fewer resources and speed rendering, or would excluding them negatively impact the performance/visuals in some way?
It definitively wouldn't impact the visuals if there are not used.
You do not mention if you use old fixed-function pipeline or the modern programmable pipeline. In the old fixed-function pipeline, the normals are only used for the lighting calculation. The have nothing to do with the face culling. The front/back sides are determined solely by the primitive winding order in screen space.
If you use the programmable pipeline, the normals are used for whatever you use them. The GL itself will not care at all about it.
So excluding them should result in less memory needed for the object to be stored. If rendereing actually gets faster is hard to predict. If the normals aren't used, they shouldn't even be fetched, no matter if they are provided or not. But caching will also have an impact here, so the improvement of not fetching them might not be noticeable at all.
Only if you are using immediate mode(glBegin()/glEnd()) to specify geometry (which you really should never ever do), excluding the normals will save you one gl function call per vertex, and this should give a significant improvement (but still will be orders of magnitude slower than using vertex arrays).
If normals are not used for lighting, you don't need them (they are not used for back-face culling either).
The impact of performance is more about how this changes your vertex layout and resulting impact on pre-transform cache (assuming you have interleaved vertex format). Like on CPU's, GPU's fetch data in cache lines, and if without (or with) normals you get better alignment with cache lines, it can have an impact on the performance. For example if your vertex size is 32 bytes and removal of the normal gets it down to 20 bytes this will cause GPU fetching 2 cache lines for some vertices, while with 32 byte vertex format it's always fetches only one cache line. However, if your vertex size is 44 bytes and removal of normal gets it down to 32 bytes, then for sure it's an improvement (better alignment and less data).
However, this is quite a fine level optimization in the end and unlikely have any significant impact either way unless you are really pushing huge amount of geometry through the pipeline with very lightweight vertex/pixel shaders (e.g. shadow pass).
Related
I am currently learning OpenGL for 3D rendering and i can't quite wrap my head around some things regarding shaders and VBOs, i get that all VBOs share one index and therefore you need to duplicate some data
but when you create more VBOs there are nearly no faces with vertices that share the same position normal and texture coordinates so the indices are at least from my point of view pretty useless, it is basically just an array of consecutive numbers.
Is there an aspect of indicesBuffers i don't see ?
The utility of index buffers is, as with the utility of all vertex specification features, dependent on your mesh data.
Most of the meshes that get used in high-performance graphics, particularly those with significant polygon density, are smooth. Normals across such meshes are primarily smooth, since the modeller is usually approximating a primarily curved surface. Oh yes, there can be some sharp edges here and there, but for the most part, each position in such models has a single normal.
Texture coordinates usually vary smoothly across meshes too. There are certainly texture coordinate edges; well-optimized UV unwrapping often produces these kinds of things. But if you have a mesh of real size (10K+ vertices), most positions have a single texture coordinate. And tangents/bitangents are based on the changes in texture coordinates, so those will match the texture topology.
Are there meshes where the normal topology is highly disjoint with position topology? Yes. Cubes being the obvious example. But there are oftentimes needs for highly faceted geometry, either to achieve a specific look or for low-polygon uses. In these cases, normal indexed rendering may not be of benefit to you.
But that does not change the fact that these cases are the exception, generally speaking, rather than the rule. Even if your code always involves these cases, that simply isn't true for the majority of high-performance graphics applications out there.
In a closed mesh, every vertex is shared by at least two faces. (The only time a vertex will be used fewer than three times is in a double-sided mesh, where two faces have the same vertices but opposite normals and winding order.) Not using indices, and just duplicating vertices, is not only inefficient, but, at minimum, doubles the amount of vertex data required.
There’s also potential for cache thrashing that could be otherwise avoided, related pipeline stalls and other insanity.
Indices are your friend. Get to know them.
Update
Typically, normals, etc. are stored in a normal map, or interpolated between vertices.
If you just want a faceted or "flat shaded" render, use the cross product of dFdx() and dFdy() (or, in HLSL, ddx() and ddy())to generate the per-pixel normal in your fragment shader. Duplicating data is bad, and only necessary under very special and unusual circumstances. Nothing you've mentioned leads me to believe that this is necessary for your use case.
Why does OpenGL not support multiple index buffers for vertex attributes (yet)?
To me it seems very useful, since you could reuse attributes and you would have a lot more control over the rendering of your geometry.
Is there a reason why all attribute arrays have to take the same index or could this feature be available in the near future?
OpenGL (and D3D. And Metal. And Mantle. And Vulkan) doesn't support this because hardware doesn't support this. Hardware doesn't support this because, for the vast majority of mesh data, this would not help. This is primarily useful for meshes that are predominantly not smooth (vertices sharing positions but not normals and so forth). And most meshes are smooth.
Furthermore, it will frequently be a memory-vs-performance tradeoff. Accessing your vertex data will likely be slower. The GPU has to fetch from two distinct locations in memory, compared to the case of a single interleaved fetch. And while caching helps, the cache coherency of multi-indexed accesses is much harder to control than for single-indexed accesses.
Hardware is unlikely to support this for that reason. But it also is unlikely to support it because you can do it yourself. Whether through buffer textures, image load/store or SSBOs, you can get your vertex data however you want nowadays. And since you can, there's really no reason for hardware makers to develop special hardware to help you.
Also, there are questions as to whether you'd really be making your vertex data smaller at all. In multi-indexed rendering, each vertex is defined by a set of indices. Well, each index takes up space. If you have more than 64K of attributes in a model (hardly an unreasonable number in many cases), then you'll need 4 bytes per index.
A normal can be provided in 4 bytes, using GL_INT_2_10_10_10_REV and normalization. A 2D texture coordinate can be stored in 4 bytes too, as a pair of shorts. Colors can be stored in 4 bytes. So unless multiple attributes share the same index (normals and texture coordinate edges happen at the same place, as might happen on a cube), you will actually make your data bigger by doing this in many cases.
In this tutorial for OpenGL ES, techniques for optimizing models are explained and one of those is to use triangle strips to define your mesh, using "degenerate" triangles to end one strip and begin another without ending the primitive. http://www.learnopengles.com/tag/degenerate-triangles/
However, this guide is very specific to mobile platforms, and I wanted to know if this technique held for modern desktop hardware. Specifically, would it hurt? Would it either cause graphical artifacts or degrade performance (opposed to splitting the strips into separate primatives?)
If it causes no artifacts and performs at least as well, I aim to use it solely because it makes organizing vertices in a certain mesh I want to draw easier.
Degenerate triangles work pretty well on all platforms. I'm aware of an old fixed-function console that struggled with degenerate triangles, but anything vaguely modern will be fine. Reducing the number of draw calls is always good and I would certainly use degenerates rather than multiple calls to glDrawArrays.
However, an alternative that usually performs better is indexed draws of triangle lists. With a triangle list you have a lot of flexibility to reorder the triangles to take maximum advantage of the post-transform cache. The post-transform cache is a hardware cache of the last few vertices that went through the vertex shader, the GPU can spot if you've re-issued the same vertex and skip the entire vertex shader for that vertex.
In addition to the above answers (no it shouldn't hurt at all unless you do something mad in terms of the ratio of real triangles to the degenerates), also note that the newer versions of OpenGL and OpenGL ES (3.x or higher) APIs support a means to insert breaks into index lists without needing an actual degenerate triangle, which is called primitive restart.
https://www.khronos.org/opengles/sdk/docs/man3/html/glEnable.xhtml
When enabled you can encode "MAX_INT" for the index type, and when detected that forces the GPU to restart building a new tristrip from the next index value.
It will not cause artifacts. As to "degrading performance"... relative to what? Relative to a random assortment of triangles with no indexing? Yes, it will be faster than that.
But there are plenty of other things one can do. For example, primitive restarting, which removes the need for degenerate triangles. Then there's using ordered lists of triangles for improved cache coherency. Will triangle strips be faster than that?
It rather depends on what you're rendering, how expensive your vertex shaders are, and various other things.
But at the end of the day, if you care about maximum performance on particular platforms, then you should profile for each platform and pick the vertex data based on what platform you're running on. If performance is really that important to you, then you're going to have to put forth some effort.
I am writing an opengl application and for vertices, normals, and colors, I am using separate buffers as follows:
GLuint vertex_buffer, normal_buffer, color_buffer;
My supervisor tells me that if I define an struct like:
struct vertex {
glm::vec3 pos;
glm::vec3 normal;
glm::vec3 color;
};
GLuint vertex_buffer;
and then define a buffer of these vertices, my application will gets so much faster because when the position is cached the normals and colors will be in cache line.
What I think is that defining such struct is not having that much affect on the performance because defining the vertex like the struct will cause less vertices in the cacheline while defining them as separate buffers, will cause to have 3 different cache lines for positions, normals and colors in the cache. So, nothing has been changed. Is that true?
First of all, using separate buffers for different vertex attributes may not be a good technique.
Very important factor here is GPU architecture. Most (especially modern) GPUs have multiple cache lines (data for Input Assembler stage, uniforms, textures), but fetching input attributes from multiple VBOs can be inefficient anyway (always profile!). Defining them in interleaved format can help improve performance:
And that's what you would get, if you used such struct.
However, that's not always true (again, always profile!) - although interleaved data is more GPU-friendly, it needs to be properly aligned and can take significantly more space in memory.
But, in general:
Interleaved data formats:
Cause less GPU cache pressure, because the vertex coordinate and attributes of a single vertex aren't scattered all over in memory.
They fit consecutively into few cache lines, whereas scattered
attributes could cause more cache updates and therefore evictions. The
worst case scenario could be one (attribute) element per cache line at
a time because of distant memory locations, while vertices get pulled
in a non-deterministic/non-contiguous manner, where possibly no
prediction and prefetching kicks in. GPUs are very similar to CPUs in
this matter.
Are also very useful for various external formats, which satisfy the deprecated interleaved formats, where datasets of compatible data
sources can be read straight into mapped GPU memory. I ended up
re-implementing these interleaved formats with the current API for
exactly those reasons.
Should be layouted alignment friendly just like simple arrays. Mixing various data types with different size/alignment requirements
may need padding to be GPU and CPU friendly. This is the only downside
I know of, appart from the more difficult implementation.
Do not prevent you from pointing to single attrib arrays in them for sharing.
Source
Further reads:
Best Practices for Working with Vertex Data
Vertex Specification Best Practices
Depends on the GPU architecture.
Most GPUs will have multiple cache lines (some for uniforms, others for vertex attributes, others for texture sampling)
Also when the vertex shader is nearly done the GPU can pre-fetch the next set of attributes into the cache. So that by the time the vertex shader is done the next attributes are right there ready to be loaded into the registers.
tl;dr don't bother with these "rule of thumbs" unless you actually profile it or know the actual architecture of the GPU.
Tell your supervisor "premature optimization is the root of all evil" – Donald E. Knuth. But don't forget the next sentence "but that doesn't mean we shouldn't optimize hot spots".
So did you actually profile the differences?
Anyway, the layout of your vertex data is not critical for caching efficiency on modern GPUs. It used to be on old GPUs (ca. 2000), which is why there were functions for interleaving vertex data. But these days it's pretty much a non-issue.
That has to do with the way modern GPUs access memory and in fact modern GPUs' cache lines are not index by memory address, but by access pattern (i.e. the first distinct memory access in a shader gets the first cache line, the second one the second cache line, and so on).
I've been reading about reconstructing a fragment's position in world space from a depth buffer, but I was thinking about storing position in a high-precision three channel position buffer. Would doing this be faster than unpacking the position from a depth buffer? What is the cost of reconstructing position from depth?
This question is essentially unanswerable for two reasons:
There are several ways of "reconstructing position from depth", with different performance characteristics.
It is very hardware-dependent.
The last point is important. You're essentially comparing the performance of a texture fetch of a GL_RGBA16F (at a minimum) to the performance of a GL_DEPTH24_STENCIL8 fetch followed by some ALU computations. Basically, you're asking if the cost of fetching an addition 32-bits per fragment (the difference between the 24x8 fetch and the RGBA16F fetch) is equivalent to the ALU computations.
That's going to change with various things. The performance of fetching memory, texture cache sizes, and so forth will all have an effect on texture fetch performance. And the speed of ALUs depends on how many are going to be in-flight at once (ie: number of shading units), as well as clock speeds and so forth.
In short, there are far too many variables here to know an answer a priori.
That being said, consider history.
In the earliest days of shaders, back in the GeForce 3 days, people would need to re-normalize a normal passed from the vertex shader. They did this by using a cubemap, not by doing math computations on the normal. Why? Because it was faster.
Today, there's pretty much no common programmable GPU hardware, in the desktop or mobile spaces, where a cubemap texture fetch is faster than a dot-product, reciprocal square-root, and a vector multiply. Computational performance in the long-run outstrips memory access performance.
So I'd suggest going with history and finding a quick means of computing it in your shader.