Cost of Branching on uniforms on modern GPUs

Cost of Branching on uniforms on modern GPUs - opengl

When using GLSL on modern (GL3.3+) GPUs, what is the likely cost of branching on a uniform?
In my engine I'm getting to the point where I have a lot of shaders. And I have several different quality presets for a lot of those. As it stands, I'm using uniforms with if() in the shaders to choose different quality presets. I'm however worried that I might achieve better performance by recompiling the shaders and using #ifdef. The problem with that is the need to worry about tracking and resetting other uniforms when I recompile a shader.
Basically what I want to know is if my fears are unfounded. Is branching on a uniform cheap on modern GPUs? I have done a few tests myself and found very little difference either way, but I've only tested on an nVidia 680.

I will admit that I'm not an expert, but perhaps my speculation is better than nothing.
I would think that branching on uniforms is indeed fairly cheap. It's clearly much different from branching on texture or attribute data, since all the ALUs in the SIMD will follow the same code path from the shader, so it is a "real" branch rather than an execution mask. I'm not too sure how shader processors suffer from branch bubbles in their pipeline, but the pipeline is certainly bound to be more shallow than in general-purpose CPUs (particularly given the much lower clock-speeds they typically run at).
I wish I could be more helpful and I'd also appreciate if someone else can answer more authoritatively. I, for one, wouldn't worry too much about branching on uniforms, however. But as always, if you have the possibility, do profile your shader and see if it makes any noticeable difference.

Related

Does any graphics API allow efficient per-primitive branching?

When writing fragment shaders in OpenGL, one can branch either on compile-time constants, on uniform variables or on varying variables.
How performant that branching is depends on the hardware and driver implementation, but generally branching on a compile time constant is usually free and branching on a uniform is faster than on a varying.
In the case of a varying, the rasterizer still has to interpolate the variable for each fragment and the branch has to be decided on each family execution, even if the value of the varying is the same for each fragment in the current primitive.
What I wonder is whether any graphics api or extension allows some fragment shader branching that is executed only once per rasterized primitive (or in the case of tiled rendering once per primitive per bin)?

Dynamic branching is only expensive when it causes divergence of instances executing at the same time. The cost of interpolating a "varying" is trivial.
Furthermore, different GPUs handle primitive rasterization differently. Some GPUs ensure that wavefronts for fragment shaders only contain instances that are executing on the same primitive. On these GPUs, branching based on values that that don't change per-primitive will be fast.
However, other GPUs will pack instances from different primitives into the same wavefronts. On these GPUs, divergence will happen if the value is different for different primitives. How much divergence? It rather depends on how often you get multiple instances in a primitive. If many of your primitives are small in rasterized space, then you'll get a lot more divergence than if you have a lot of large primitives.
GPUs that pack instances from different primitives into a wavefront are trying to maximize how much their cores get utilized. It's a tradeoff: you're minimizing the overall number of wavefronts you have to execute, but a particular cause of divergence (data that is constant within a primitive but not between them) will be penalized.
In any case, try to avoid divergence when you can. But if your algorithm requires it... then your algorithm requires it, and the performance you get is the performance you get. The best you can do is let the GPU know that the "varying" will be constant per-primitive by using flat interpolation.

glColorPointer performance, better with floats or bytes?

Doing some maintenance on an old project and was asked by the client to see if it was possible to improve performance. I've done the parts I know and can easily test but then I tested
glColorPointer(4,GL_UNSIGNED_BYTE,...,...)
vs
glColorPointer(4,GL_FLOAT,...,...)
I could see literally no difference on the handful of machines I could test it on. Obviously it means thats not a bottleneck but since it's the first time I've been in a situation where I have access to both color formats it's also the first time I can wonder if there's a speed difference between the 2.
I'm expecting the answer is internally opengl adapters use float colors so it would be preferable to use float when available but anyone have a more definitive answer then that?
edit: the client has a few dozen machines that are ~10 year old and the project is used on those machines if it makes a difference

There's really no generally valid answer. You did the right thing by testing.
At least on desktop GPUs, it's fairly safe to assume that they will internally operate with 32-bit floats. On mobile GPUs, lower precision formats are more common, and you have some control over it using precision qualifiers in the shader code.
Assuming that 32-bit floats are used internally, there are two competing considerations:
If you specify the colors in a different format, like GL_UNSIGNED_BYTE, a conversion is needed while fetching the vertex data.
If you specify the colors in a more compact format, the vertex data uses less memory. This also has the effect that less memory bandwidth is consumed for fetching the data, with fewer cache misses, and potentially less cache pollution.
Which of these is more relevant really depends on the exact hardware, and the overall workload. The format conversion for item 1 can potentially be almost free if the hardware supports the byte format as part of fixed function vertex fetching hardware. Otherwise, it can add a little overhead.
Saving memory bandwidth is always a good thing. So by default, I would think that using the most compact representation is more likely to be beneficial. But testing and measuring is the only conclusive way to decide.
In reality, it's fairly rare that fetching vertex data is a major bottleneck in the pipeline. It does happen, but it's just not very common. So it's not surprising that you couldn't measure a difference.
For example, in a lot of use cases, texture data is overall much bigger than vertex data. If that is the case, the bandwidth consumed by texture sampling is often much more significant than the one used by vertex fetching. Also, related to this, there are mostly many more fragments than vertices, so anything related to fragment processing is much more performance critical than vertex processing.
On top of this, many applications make too many OpenGL API calls, or use the API in inefficient ways, and end up being limited by CPU overhead, particularly on very high performance GPUs. If you're optimizing performance for an existing app, that is pretty much the first thing you should check: Find out if you're CPU or GPU limited.

A triangle with 3 varyings of same value.. does GPU interpolate / waste performance?

I have a simple question of which I was unable to find solid facts about GPUs behaviour in case of 3 vertexes having the same varying output from vertex shader.
Does the GPU notice that case or does it try to interpolate when its not even needed ?
This might be interesting as there are quite some cases where you want a constantish varying available in fragment shader per triangle. Please don't just guess, try to bring up references or atleast reasons why you think its the one way or another.

The GPU does the interpolation, no matter if it's needed or not.
The reason is quite simple: checking if the varying variable has already been changed is very expensive.
Shaders are small programs, that are executed concurrently on different GPU cores. So if you would like to avoid that two different cores are computing the same value, you would have to "reserve" the output variable. So you need an additional data structure (like a flag or mutex) that every core can read. In your case this would mean, that three different cores have to read the same flag, and the first of them has to reserve it if it's not already reserved.
This has to happen atomically, meaning that the reserving core has to be the only one who is setting the flag at a time. To do this all other cores would e.g. have to be stopped for a tick. As you don't know the which cores are computing the vertex shader you would have to stop ALL other cores (on a GTX Titan this would be 2687 others).
Additionally, when the variable is set and a new frame is rendered, all the flags would have to be reset, so the race for the flag can begin again.
To conclude: you would need additional hardware in your GPU, that is expensive and slows down the rendering pipeline.
It is the programmers job to avoid that multiple shaders are producing the same output. So if you are doing your job right this does not happen or you know, that avoiding it (on the CPU) would cost more than ignoring it.
An example would be the stiching for different levels of detail (like on a height map), where most methods are creating some fragments twice. This is a very small impact on the rendering performance but would require a lot of CPU time to avoid.

If the behavior isn't mandated in the OpenGL specificiation then the answer is that it's up to the implementation.
The comments and other answers are almost certainly spot on that there is no optimization path for identical values because there would be little to no benefit from the added complexity to make such a path.

Performance of WebGL and OpenGL

For the past month I've been messing with WebGL, and found that if I create and draw a large vertex buffer it causes low FPS. Does anyone know if it be the same if I used OpenGL with C++?
Is that a bottleneck with the language used (JavaScript in the case of WebGL) or the GPU?
WebGL examples like this show that you can draw 150,000 cubes using one buffer with good performance but anything more than this, I get FPS drops. Would that be the same with OpenGL, or would it be able to handle a larger buffer?
Basically, I've got to make a decision to continue using WebGL and try to optimise by code or - if you tell me OpenGL would perform better and it's a language speed bottleneck, switch to C++ and use OpenGL.

If you only have a single drawArrays call, there should not be much of a difference between OpenGL and WebGL for the call itself. However, setting up the data in Javascript might be a lot slower, so it really depends on your problem. If the bulk of your data is static (landscape, rooms), WebGL might work well for you. Otherwise, setting up the data in JS might be too slow for your purpose. It really depends on your problem.
p.s. If you include more details of what you are trying to do, you'll probably get more detailed / specific answers.

Anecdotally, I wrote a tile-based game in the early 2000's using the old glVertex() style API that ran perfectly smoothly. I recently started port it to WebGL and glDrawArrays() and now on my modern PC that is at least 10 times faster it gets terrible performance.
The reason seems to be that I was faking a call go glBegin(GL_QUADS); glVertex()*4; glEnd(); by using glDrawArrays(). Using glDrawArrays() to draw one polygon is much much slower in WebGL than doing the same with glVertex() was in C++.
I don't know why this is. Maybe it is because javascript is dog slow. Maybe it is because of some context switching issues in javascript. Anyway I can only do around 500 one-polygon glDrawArray() calls while still getting 60 FPS.
Everybody seems to work around this by doing as much on the GPU as possible, and doing as few glDrawArray() calls per frame as possible. Whether you can do this depends on what you are trying to draw. In the example of cubes you linked they can do everything on the GPU, including moving the cubes, which is why it is fast. Essentially they cheated - typically WebGL apps won't be like that.
Google had a talk where they explained this technique (they also unrealistically calculate the object motion on the GPU): https://www.youtube.com/watch?v=rfQ8rKGTVlg

OpenGL is more flexible and more optimized because of the newer versions of the api used.
It is true if you say that OpenGL is faster and more capable, but it also depends on your needs.
If you need one cube mesh with texture, webGL would be sufficient. However, if you intend building large-scale projects with lots of vertices, post-processing effects and different rendering techniques (and kind of displacement, parallax mapping, per-vertex or maybe tessellation) then OpenGL might be a better and wiser choice actually.
Optimizing buffers to a single call, optimizing update of those can be done, but it has its limits, of course, and yes, OpenGL would most likely perform better anyway.
To answer, it is not a language bottleneck, but an api-version-used one.
WebGL is based upon OpenGL ES, which has some pros but also runs a bit slower and it has more abstraction levels for code handling than pure OpenGL has, and that is reason for lowering performance - more code needs to be evaluated.
If your project doesn't require web-based solution, and doesn't care which devices are supported, then OpenGL would be a better and smarter choice.
Hope this helps.

WebGL is much slower on the same hardware compared to equivalent OpenGL, because of the high overheard for each WebGL call.
On desktop OpenGL, this overhead is at least limited, if still relatively expensive.
But in browsers like Chrome, WebGL requires that not only do we cross the FFI barrier to access those native OpenGL API calls (which still incur the same overhead), but we also have the cost of security checks to prevent the GPU being hijacked for computation.
If you are looking at something like glDraw* calls, which are called per frame, this means we are talking about perhaps (an) order(s) of magnitude fewer calls. All the more reason to opt for something like instancing, where the number of calls is drastically reduced.

Which OpenGL functions are not GPU-accelerated?

I was shocked when I read this (from the OpenGL wiki):
glTranslate, glRotate, glScale
Are these hardware accelerated?
No, there are no known GPUs that
execute this. The driver computes the
matrix on the CPU and uploads it to
the GPU.
All the other matrix operations are
done on the CPU as well :
glPushMatrix, glPopMatrix,
glLoadIdentity, glFrustum, glOrtho.
This is the reason why these functions
are considered deprecated in GL 3.0.
You should have your own math library,
build your own matrix, upload your
matrix to the shader.
For a very, very long time I thought most of the OpenGL functions use the GPU to do computation. I'm not sure if this is a common misconception, but after a while of thinking, this makes sense. Old OpenGL functions (2.x and older) are really not suitable for real-world applications, due to too many state switches.
This makes me realise that, possibly, many OpenGL functions do not use the GPU at all.
So, the question is:
Which OpenGL function calls don't use the GPU?
I believe knowing the answer to the above question would help me become a better programmer with OpenGL. Please do share some of your insights.
Edit:
I know this question easily leads to optimisation level. It's good, but it's not the intention of this question.
If anyone knows a set of GL functions on a certain popular implementation (as AshleysBrain suggested, nVidia/ATI, and possibly OS-dependent) that don't use the GPU, that's what I'm after!
Plausible optimisation guides come later. Let's focus on the functions, for this topic.
Edit2:
This topic isn't about how matrix transformations work. There are other topics for that.

Boy, is this a big subject.
First, I'll start with the obvious: Since you're calling the function (any function) from the CPU, it has to run at least partly on the CPU. So the question really is, how much of the work is done on the CPU and how much on the GPU.
Second, in order for the GPU to get to execute some command, the CPU has to prepare a command description to pass down. The minimal set here is a command token describing what to do, as well as the data for the operation to be executed. How the CPU triggers the GPU to do the command is also somewhat important. Since most of the time, this is expensive, the CPU does not do it often, but rather batches commands in command buffers, and simply sends a whole buffer for the GPU to handle.
All this to say that passing work down to the GPU is not a free exercise. That cost has to be pitted against just running the function on the CPU (no matter what we're talking about).
Taking a step back, you have to ask yourself why you need a GPU at all. The fact is, a pure CPU implementation does the job (as AshleysBrain mentions). The power of the GPU comes from its design to handle:
specialized tasks (rasterization, blending, texture filtering, blitting, ...)
heavily parallel workloads (DeadMG is pointing to that in his answer), when a CPU is more designed to handle single-threaded work.
And those are the guiding principles to follow in order to decide what goes in the chip. Anything that can benefit from those ought to run on the GPU. Anything else ought to be on the CPU.
It's interesting, by the way. Some functionality of the GL (prior to deprecation, mostly) are really not clearly delineated. Display lists are probably the best example of such a feature. Each driver is free to push as much as it wants from the display list stream to the GPU (typically in some command buffer form) for later execution, as long as the semantics of the GL display lists are kept (and that is somewhat hard in general). So some implementations only choose to push a limited subset of the calls in a display list to a computed format, and choose to simply replay the rest of the command stream on the CPU.
Selection is another one where it's unclear whether there is value to executing on the GPU.
Lastly, I have to say that in general, there is little correlation between the API calls and the amount of work on either the CPU or the GPU. A state setting API tends to only modify a structure somewhere in the driver data. It's effect is only visible when a Draw, or some such, is called.
A lot of the GL API works like that. At that point, asking whether glEnable(GL_BLEND) is executed on the CPU or GPU is rather meaningless. What matters is whether the blending will happen on the GPU when Draw is called. So, in that sense, Most GL entry points are not accelerated at all.
I could also expand a bit on data transfer but Danvil touched on it.
I'll finish with the little "s/w path". Historically, GL had to work to spec no matter what the hardware special cases were. Which meant that if the h/w was not handling a specific GL feature, then it had to emulate it, or implement it fully in software. There are numerous cases of this, but one that struck a lot of people is when GLSL started to show up.
Since there was no practical way to estimate the code size of a GLSL shader, it was decided that the GL was supposed to take any shader length as valid. The implication was fairly clear: either implement h/w that could take arbitrary length shaders -not realistic at the time-, or implement a s/w shader emulation (or, as some vendors chose to, simply fail to be compliant). So, if you triggered this condition on a fragment shader, chances were the whole of your GL ended up being executed on the CPU, even when you had a GPU siting idle, at least for that draw.

The question should perhaps be "What functions eat an unexpectedly high amount of CPU time?"
Keeping a matrix stack for projection and view is not a thing the GPU can handle better than a CPU would (on the contrary ...). Another example would be shader compilation. Why should this run on the GPU? There is a parser, a compiler, ..., which are just normal CPU programs like the C++ compiler.
Potentially "dangerous" function calls are for example glReadPixels, because data can be copied from host (=CPU) memory to device (=GPU) memory over the limited bus. In this category are also functions like glTexImage_D or glBufferData.
So generally speaking, if you want to know how much CPU time an OpenGL call eats, try to understand its functionality. And beware of all functions, which copy data from host to device and back!

Typically, if an operation is per-something, it will occur on the GPU. An example is the actual transformation - this is done once per vertex. On the other hand, if it occurs only once per large operation, it'll be on the CPU - such as creating the transformation matrix, which is only done once for each time the object's state changes, or once per frame.
That's just a general answer and some functionality will occur the other way around - as well as being implementation dependent. However, typically, it shouldn't matter to you, the programmer. As long as you allow the GPU plenty of time to do it's work while you're off doing the game sim or whatever, or have a solid threading model, you shouldn't need to worry about it that much.
#sending data to GPU: As far as I know (only used Direct3D) it's all done in-shader, that's what shaders are for.

glTranslate, glRotate and glScale change the current active transformation matrix. This is of course a CPU operation. The model view and projection matrices just describes how the GPU should transforms vertices when issue a rendering command.
So e.g. by calling glTranslate nothing is translated at all yet. Before rendering the current projection and model view matrices are multiplied (MVP = projection * modelview) then this single matrix is copied to the GPU and then the GPU does the matrix * vertex multiplications ("T&L") for each vertex. So the translation/scaling/projection of the vertices is done by the GPU.
Also you really should not be worried about the performance if you don't use these functions in an inner loop somewhere. glTranslate results in three additions. glScale and glRotate are a bit more complex.
My advice is that you should learn a bit more about linear algebra. This is essential for working with 3D APIs.

There are software rendered implementations of OpenGL, so it's possible that no OpenGL functions run on the GPU. There's also hardware that doesn't support certain render states in hardware, so if you set a certain state, switch to software rendering, and again, nothing will run on the GPU (even though there's one there). So I don't think there's any clear distinction between 'GPU-accelerated functions' and 'non-GPU accelerated functions'.
To be on the safe side, keep things as simple as possible. The straightforward rendering-with-vertices and basic features like Z buffering are most likely to be hardware accelerated, so if you can stick to that with the minimum state changing, you'll be most likely to keep things hardware accelerated. This is also the way to maximize performance of hardware-accelerated rendering - graphics cards like to stay in one state and just crunch a bunch of vertices.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js