I've been reading about reconstructing a fragment's position in world space from a depth buffer, but I was thinking about storing position in a high-precision three channel position buffer. Would doing this be faster than unpacking the position from a depth buffer? What is the cost of reconstructing position from depth?
This question is essentially unanswerable for two reasons:
There are several ways of "reconstructing position from depth", with different performance characteristics.
It is very hardware-dependent.
The last point is important. You're essentially comparing the performance of a texture fetch of a GL_RGBA16F (at a minimum) to the performance of a GL_DEPTH24_STENCIL8 fetch followed by some ALU computations. Basically, you're asking if the cost of fetching an addition 32-bits per fragment (the difference between the 24x8 fetch and the RGBA16F fetch) is equivalent to the ALU computations.
That's going to change with various things. The performance of fetching memory, texture cache sizes, and so forth will all have an effect on texture fetch performance. And the speed of ALUs depends on how many are going to be in-flight at once (ie: number of shading units), as well as clock speeds and so forth.
In short, there are far too many variables here to know an answer a priori.
That being said, consider history.
In the earliest days of shaders, back in the GeForce 3 days, people would need to re-normalize a normal passed from the vertex shader. They did this by using a cubemap, not by doing math computations on the normal. Why? Because it was faster.
Today, there's pretty much no common programmable GPU hardware, in the desktop or mobile spaces, where a cubemap texture fetch is faster than a dot-product, reciprocal square-root, and a vector multiply. Computational performance in the long-run outstrips memory access performance.
So I'd suggest going with history and finding a quick means of computing it in your shader.
Related
I was recently looking for ray tracing via opengl tutorials. Most of tutorials prefer compute shaders. I wonder why don't they just render to texture, then render the texture to screen as quad.
What is the advantages and disadvantages of compute shader method over screen quad?
Short answer: because compute shaders give you more effective tools to perform complex computations.
Long answer:
Perhaps the biggest advantage that they afford (in the case of tracing) is the ability to control exactly how work is executed on the GPU. This is important when you're tracing a complex scene. If your scene is trivial (e.g., Cornell Box), then the difference is negligible. Trace some spheres in your fragment shader all day long. Check http://shadertoy.com/ to witness the madness that can be achieved with modern GPUs and fragment shaders.
But. If your scene and shading are quite complex, you need to control how work is done. Rendering a quad and doing the tracing in a frag shader is going to, at best, make your application hang while the driver cries, changes its legal name, and moves to the other side of the world...and at worst, crash the driver. Many drivers will abort if a single operation takes too long (which virtually never happens under standard usage, but will happen awfully quickly when you start trying to trace 1M poly scenes).
So you're doing too much work in the frag shader...next logical though? Ok, limit the workload. Draw smaller quads to control how much of the screen you're tracing at once. Or use glScissor. Make the workload smaller and smaller until your driver can handle it.
Guess what we've just re-invented? Compute shader work groups! Work groups are compute shader's mechanism for controlling job size, and they're a far better abstraction for doing so than fragment-level hackery (when we're dealing with this kind of complex task). Now we can very naturally control how many rays we dispatch, and we can do so without being tightly-coupled to screen-space. For a simple tracer, that adds unnecessary complexity. For a 'real' one, it means that we can easily do sub-pixel raycasting on a jittered grid for AA, huge numbers of raycasts per pixel for pathtracing if we so desire, etc.
Other features of compute shaders that are useful for performant, industrial-strength tracers:
Shared Memory between thread groups (allows, for example, packet tracing, wherein an entire packet of spatially-coherent rays are traced at the same time to exploit memory coherence & the ability to communicate with nearby rays)
Scatter Writes allow compute shaders to write to arbitrary image locations (note: image and texture are different in subtle ways, but the advantage remains relevant); you no longer have to trace directly from a known pixel location
In general, the architecture of modern GPUs are designed to support this kind of task more naturally using compute. Personally, I have written a real-time progressive path tracer using MLT, kd-tree acceleration, and a number of other computationally-expensive techniques (PT is already extremely expensive). I tried to remain in a fragment shader / full-screen quad as long as I could. Once my scene was complex enough to require an acceleration structure, my driver started choking no matter what hackery I pulled. I re-implemented in CUDA (not quite the same as compute, but leveraging the same fundamental GPU architectural advances), and all was well with the world.
If you really want to dig in, have a glance at section 3.1 here: https://graphics.cg.uni-saarland.de/fileadmin/cguds/papers/2007/guenther_07_BVHonGPU/Guenter_et_al._-_Realtime_Ray_Tracing_on_GPU_with_BVH-based_Packet_Traversal.pdf. Frankly the best answer to this question would be an extensive discussion of GPU micro-architecture, and I'm not at all qualified to give that. Looking at modern GPU tracing papers like the one above will give you a sense of how deep the performance considerations go.
One last note: any performance advantage of compute over frag in the context of raytracing a complex scene has absolutely nothing to do with rasterization / vertex shader overhead / blending operation overhead, etc. For a complex scene with complex shading, bottlenecks are entirely in the tracing computations, which, as discussed, compute shaders have tools for implementing more efficiently.
I am going to complete Josh Parnell information.
One problem with both fragment shader and compute shader is that they both lack recursivity.
A ray tracer is recursive by nature (yeah I know it is always possible to transform a recursive algorithm in a non recursive one, but is is not always that easy to do it).
So another way to see the problem could be the following :
Instead to have "one thread" per pixel, one idea could be to have one thread per path (a path is a part of your ray (between 2 bounces)).
Going that way, you are dispatching on your "bunch" of rays instead on your "pixel grid". Doing so simplify the potential recursivity of the ray tracer, and avoid divergence in complex materials :
More information here :
http://research.nvidia.com/publication/megakernels-considered-harmful-wavefront-path-tracing-gpus
Why does OpenGL not support multiple index buffers for vertex attributes (yet)?
To me it seems very useful, since you could reuse attributes and you would have a lot more control over the rendering of your geometry.
Is there a reason why all attribute arrays have to take the same index or could this feature be available in the near future?
OpenGL (and D3D. And Metal. And Mantle. And Vulkan) doesn't support this because hardware doesn't support this. Hardware doesn't support this because, for the vast majority of mesh data, this would not help. This is primarily useful for meshes that are predominantly not smooth (vertices sharing positions but not normals and so forth). And most meshes are smooth.
Furthermore, it will frequently be a memory-vs-performance tradeoff. Accessing your vertex data will likely be slower. The GPU has to fetch from two distinct locations in memory, compared to the case of a single interleaved fetch. And while caching helps, the cache coherency of multi-indexed accesses is much harder to control than for single-indexed accesses.
Hardware is unlikely to support this for that reason. But it also is unlikely to support it because you can do it yourself. Whether through buffer textures, image load/store or SSBOs, you can get your vertex data however you want nowadays. And since you can, there's really no reason for hardware makers to develop special hardware to help you.
Also, there are questions as to whether you'd really be making your vertex data smaller at all. In multi-indexed rendering, each vertex is defined by a set of indices. Well, each index takes up space. If you have more than 64K of attributes in a model (hardly an unreasonable number in many cases), then you'll need 4 bytes per index.
A normal can be provided in 4 bytes, using GL_INT_2_10_10_10_REV and normalization. A 2D texture coordinate can be stored in 4 bytes too, as a pair of shorts. Colors can be stored in 4 bytes. So unless multiple attributes share the same index (normals and texture coordinate edges happen at the same place, as might happen on a cube), you will actually make your data bigger by doing this in many cases.
I have an openGL program that doesn't use lighting or shading of any kind; the illusion of shadow is done completely through textures, since the meshes are low-poly. Faces are not backculled, and I wouldn't use normal-mapping of course.
My question is, should I define the vertex normals anyway? Would excluding them use fewer resources and speed rendering, or would excluding them negatively impact the performance/visuals in some way?
My question is, should I define the vertex normals anyway?
There is no need to, if they are not used.
Would excluding them use fewer resources and speed rendering, or would excluding them negatively impact the performance/visuals in some way?
It definitively wouldn't impact the visuals if there are not used.
You do not mention if you use old fixed-function pipeline or the modern programmable pipeline. In the old fixed-function pipeline, the normals are only used for the lighting calculation. The have nothing to do with the face culling. The front/back sides are determined solely by the primitive winding order in screen space.
If you use the programmable pipeline, the normals are used for whatever you use them. The GL itself will not care at all about it.
So excluding them should result in less memory needed for the object to be stored. If rendereing actually gets faster is hard to predict. If the normals aren't used, they shouldn't even be fetched, no matter if they are provided or not. But caching will also have an impact here, so the improvement of not fetching them might not be noticeable at all.
Only if you are using immediate mode(glBegin()/glEnd()) to specify geometry (which you really should never ever do), excluding the normals will save you one gl function call per vertex, and this should give a significant improvement (but still will be orders of magnitude slower than using vertex arrays).
If normals are not used for lighting, you don't need them (they are not used for back-face culling either).
The impact of performance is more about how this changes your vertex layout and resulting impact on pre-transform cache (assuming you have interleaved vertex format). Like on CPU's, GPU's fetch data in cache lines, and if without (or with) normals you get better alignment with cache lines, it can have an impact on the performance. For example if your vertex size is 32 bytes and removal of the normal gets it down to 20 bytes this will cause GPU fetching 2 cache lines for some vertices, while with 32 byte vertex format it's always fetches only one cache line. However, if your vertex size is 44 bytes and removal of normal gets it down to 32 bytes, then for sure it's an improvement (better alignment and less data).
However, this is quite a fine level optimization in the end and unlikely have any significant impact either way unless you are really pushing huge amount of geometry through the pipeline with very lightweight vertex/pixel shaders (e.g. shadow pass).
Should I prefer to calculate matrices on the CPU or GPU?
Let's say I have the following matrices P * V * M , should I calculate them on the CPU so that I can send the final matrix to the GPU (GLSL) or should I send those three matrices separately to the GPU so that GLSL can calculate the final matrix?
I mean in this case GLSL would have to calculate the MVP matrix for every vertex, so it is probably faster to precompute it on the CPU.
But let's say that GLSL only has to calculate he MVP matrix once, would the GPU calculate the final matrix faster than the CPU?
General rule: If you can pass it to a shader in form of a uniform, always precalculate on the CPU; no exceptions. Calculations on the shader side make sense only for values that vary between vertices and fragments. Everything that's constant among a whole batch of vertices is most efficiently dealt with on the CPU.
GPUs are not magic "can do faster everything" machines. There are certain tasks where a CPU can easily outperform a GPU, even for very large datasets. So a very simple guideline is: If you can move it to the CPU without spending more CPU time doing the calculation than it takes for the GPU in total overhead to process it, then do it on the CPU. The calculation of a single matrix is among those tasks.
Like most situations with OpenGL, it depends.
In most cases, a single calculation can be done faster on the CPU than on the GPU. The GPU's advantage is that it can do lots of calculations in parallel.
On the other hand, it also depends where your bottlenecks are. If your CPU is doing lots of other work, but your shaders are not a bottleneck yet on the lowest-powered target system, then you could easily see some performance improvement by moving some matrix multiplications to the vertex shader.
Generally, you should avoid any work in the fragment shader that could also be done in the vertex shader or on the CPU, but beyond that, it depends on the situation. Unless you are running into performance issues, just do it whatever way is easiest for you, and if you are having performance issues, do it both ways and profile the performance to see which works better.
Should Particle systems be updated entirely in the geometry shader or should the geometry shader be passed updated data for positions and life ect. At the moment i update everything in the geometry but iam not sure if this is the best idea incase some of the data is needed in the C++.
It's possible to almost everything in shaders ( especially if you're going for SM4+ ). I don't recommend going for anything over SM3 if you want any sort of market penetration. I still regret we didn't provide a SM2 fallback for our latest game, because quite a few people still use old crappy SM2 cards.
More on to the question. You can use RTT and never to do a round trip back to the main memory ( this is slow as hell, minimize transfer from graphics memory to main memory ), but the down side is that you need to use some rather elaborate tricks to compute AABBs ( which you'll want on the CPU side of things ) if you go pure GPU.
Instead we do everything which requires changing the state of a particle on the CPU side. We then have a tight memory representation of that data which gets updated to GPU. The vertex shader is rather meaty ( but that's totally fine, do as much as you possibly can in the vertex shader! ), it extracts this compressed representation of a particle, transforms it, and passes the uncompressed data on to the pixel shader. An important observation here is that you can, and should, split per vertex & per particle data. This implies using instancing ( which is just a way of saying: use frequency dividers ). We represent particle rotation with a normal + rotation about that normal.
Another reason for doing the state changes of a particle CPU side is that it's a heck of a lot easier to composite behavior CPU side. Any at least half decent particle system needs quite a bit of knobs to turn to be able to create interesting particle effects.
EDIT: And if you have anything resembling Particle::Update that can't be inlined you've failed, minimize per particle function calls, especially virtual ones, and keep the memory representation of a particle tightly packed!
That depends on what kind of particle system you have. In most cases you have a software representation in C++ and a hardware representation for your shader. The geometry data for the shader is computed from the software representation and should be as little as possible. Because in most cases not the computational power is the limiting resource but the transfer rate to the graphics card.
If you can decrease transfers even further with your method, you can still keep a software representation in memory for further usage. Even if this implies computing data twice, it can be faster than the transfer process.