"Bytecode" instead of hardcoded shader performance

"Bytecode" instead of hardcoded shader performance - opengl

I'm making a graphics program that generates models. When the user performs some actions, the behavior of the shader needs to change. These actions don't only affect numeric constants, nor input data, they affect the number, order and type of a series of computation steps.
To solve this problem two solutions came to my mind:
Generate the shader code at run-time and compile it then. This is very CPU dependent, since the compilation can take some time, but it is very GPU friendly.
Use some kind of bytecode that the same shader interprets at run-time. This removes the need of compiling the shader again, but now the GPU needs to take care of much bookkeeping.
I developed prototypes for both approaches and the results are more extreme than I expected.
The compilation times depends much on the rest of the shader (I guess that there is a lot of function in-lining), I think I could refactor the shader to do less work per thread and improve the compilation time. But, I don't know now if this will be enough, and I don't like very much the idea of run-time recompilation (very platform dependent, more difficult to debug, more complex).
On the other hand, the bytecode approach runs (without taking the compilation time of the first approach in account) 25 times slower.
I knew that the bytecode approach was going to be slower, but I didn't expect this, particularly after optimizing it.
The interpreter works by reading bytecode from an uniform buffer object. This is a simplification of it, I placed a "..." where the useful (non bookkeeping) code goes, that part is the same as the other approach (obviously, that is not inside a loop with a big if/else to select the proper instruction):
layout (std140, binding=7) uniform shader_data{
uvec4 code[256];
};
float interpreter(vec3 init){
float d[4];
vec3 positions[3];
int dDepth=0;
positions[0]=init;
for (int i=0; i<code[128].x; i+=3){
const uint instruction=code[i].x;
const uint ldi=code[i].y;
const uint sti=code[i].z;
if (instruction==MIX){
...
}else{
if (instruction<=BOX){
if (instruction<=TRANSLATION){
if(instruction==PARA){
...
}else{//TRANSLATION;
...
}
}else{
if (instruction==EZROT){
...
}else{//BOX
...
}
}
}else{
if (instruction<=ELLI){
if (instruction==CYL){
...
}else{//ELLI
...
}
}else{
if (instruction==REPETITION){
...
}else{//MIRRORING
...
}
}
}
}
}
return d[0];
}
My question is: do you know why is so much slower (because I don't see so much bookkeeping in the interpreter)? Can you guess what are the main performance problems of this interpreter?

GPUs don't like conditional branching at the best of times. Byte code interpretation is thus one of the worst things you could possibly do on a GPU.
Granted, the principle problem of branching is not so bad in your case, because your "byte code" is all in uniform memory. Even so, it's going to run excessively slow due to all of the branches.
It would be much better to have a better handle on the possibilities of your shader at a high level, then use a very small number of branches to decide what your entire shader's behavior will be. These wouldn't be at the level of byte code. They'd be more like "compute positions with matrix skinning" or "compute lighting with this BRDF" or "use a shadow map".
This is the so-called "ubershader" approach: one shader, with a number of large and distinct codepaths that are determined by a few uniform settings.
If you can't do that, then there's really not much you can do outside of recompiling when needed. And that's going to hurt on the CPU; you cannot expect to use the shader the frame you start recompiling it (or for several frames thereafter, in all likelihood). SPIR-V shaders might help recompilation performance, but probably not that much.
although having a small delay (~100ms) is not that bad since it is not a game
I say measure the time it takes to do shader compilation. If it's less than 100ms (or whatever you consider to be sufficiently interactive), go with it.
However, be advised that many OpenGL implementations recompile shaders on a separate thread. So by the time glLinkProgram has finished, the shader may not be done. To accurately profile this process, you need to force the recompilation to have happened. Getting the GL_LINK_STATUS should do the trick.
One more performance trick: do not use glCompileShader and glLinkProgram. Instead, use glCreateShaderProgramv instead. It makes a separable program (containing just one shader stage), but that process will likely be faster than having to compile and link as separate actions.

Related

Understanding C++ constexpr Performance

I recently wrote a compile-time ray tracer using constexpr functions with C++17. The full source code can be seen here. The relevant code for this question looks like this:
constexpr auto image = []() {
StaticImage<image_width, image_height> image;
Camera camera{Pointf{0.0f, 0.0f, 500.0f},
Vectorf{0.0f},
Vectorf{0.0f, 1.0f, 0.0f},
500.0f};
std::array<Shapes, 1> shapes_list{Sphere{Pointf{0.0f}, 150.0f}};
std::array<Materials, 1> materials_list{DefaultMaterial{}};
ShapeContainer<decltype(shapes_list)> shapes{std::move(shapes_list)};
MaterialContainer<decltype(materials_list)> materials{
std::move(materials_list)};
SphereScene scene;
scene.set_camera(camera);
Renderer::render(scene, image, shapes, materials);
return image;
}();
Where each of the classes shown here (StaticImage, Camera, Shapes, Materials, ShapeContainer, MaterialContainer, and SphereScene) consist entirely of constexpr functions. Renderer::render is also constexpr and is in charge of looping over every pixel in the image, shooting rays into the scene, and setting the corresponding colour.
With this current setup and an image of 512x512, using MSVC 16.9.2 in Release mode, the compiler takes approximately 35 minutes to finish generating the image. During this process, its memory usage rises to the point where it ends up using almost 64GB of RAM.
So, my questions is: why are the compilation time and memory usage so high?
My theory was that part of the reason for the compilation time was the complexity of the call-stacks (i.e. lots of templates, CRTP, and depth), so I tried simplifying the call stack a bit by removing several templates (the Vector class is no longer templated for example) and managed to reduce the compilation time down to 32 minutes, and the memory usage to 61GB. Better, but still very high. The thing is that I can't quite figure out why it's so slow. I do understand that evaluating all of the constexpr functions is a very involved process (since the compiler has to check for UB, type-deduction, etc.) but I wasn't expecting it to be quite this slow. I'm also really confused by the high memory usage. The image array itself uses no more than 4MB of memory (512 * 512 * 3 * sizeof(float)) so where is the extra memory coming from?

Compile-time execution is going to be much less efficient than runtime execution. The compiler has to do more work to execute the same code. The point of compile-time execution is to do computations that you can't do at runtime. And sometimes, to compile-time cache simpler computations.
Writing a whole, non-trivial application that exists only at compile-time is not going to be a fast thing to get done.
As for the particulars, the principle reason for the cost increase is that compile-time execution has to detect all undefined behavior. This means that a lot of things that might just be offsetting a pointer have to be more complicated. Stack variables can't just be offsetting the stack pointer; they have to track the lifetime of the object explicitly. And so forth.
Compile-time execution is basically interpreted C++. And there's not much reason to make it a particularly fast interpreter. Most compile-time operations are dealing with computations based on types and simple values, not with complex data structures. So that's what compilers are primarily optimized for.
I recall that some noise had been made recently to improve Clang's constexpr execution via better interpretation. But I don't know how much came of it.

Better to keep vec4 instead of vec3 to avoid always casting?

I'm using OpenGL, and in my code I have some unreadable and annoying lines like:
newChild->dposition = dvec_4(dvec_4(newChild->boundingVerts[2].position, 1) + newChild->parent->dposition);
The idea was to keep positions in vec3s, with many objects in the scene it could amount to a good saving in storage, and even more importantly reduce the size of buffers sent to the graphics card. But it leads to really hard to read code casting back and fourth, plus all the casts I imagine do cost something. So is it better to keep vec4s to avoid the casting?

Without having access to all of the code, it is hard to say.
However, I would rather say that using vec4s might bring performance/code quality benefits:
Guessing that the data is likely used on the GPU, it is likely more efficient to load/store a vec4, than a vec3. I am not exactly sure, but I do not think that there is a single instruction to load a vec3. It will be broken into loading a vec2 and a float I think.
Later, you could easily store some additional data into that additional float.
Less casting and making code more readable.
Depending on the memory layout/member types of your struct, it might be so that the struct is aligned to 16 bytes anyway undoing your "memory optimization".
If something is wrong, please correct me.

Why do unused shader-variables/uniforms get optimized away instead of resulting in a compilation error?

I am currently working on a typesafe shader system and it was surprisingly easy to implement until now.
I just realized that uniforms and shader inputs are getting optimized away if they are not used in the shader. This is not hard to solve but I wonder how I would handle such a case.
Now to my question:
Is there any reason why someone would want an unused shader-input/uniform in their code?
...
in vec2 pos;
in vec2 uv;
...
For example this would be of type Shader<vec2,vec2>, but if uv is not used it would really be Shader<vec2>.
Now the question is if I just generate Shader<vec2> or if I generate Shader<vec2,vec2> but then throw an compilation error because the active uniforms length (in this case 1) would be smaller then the length of types.
At the moment I would just generate Shader<vec2> because that is how opengl handles it, but I am wondering why.
Edit:
Okay this is going to be way more complicated than I thought. It seems that every compiler will "optimize away" unused variables differently. I thought I could just use http://docs.gl/gl4/glGetActiveUniform but then the generated types would not be deterministic which is definitely not what I want.
I probably would have to write my own parser for this. I might even decide to abandon the project now.

Image interpolation implementation with C++

I have a question related to the implementation of image interpolation (bicubic and bilinear methods) with C++. My main concern is speed. Based on my understanding of the problem, in order to make the interpolation program fast and efficient, the following strategies can be adopted:
Fast image interpolation using Streaming SIMD Extensions (SSE)
Image interpretation with multi-thread or GPU
Fast image interpolation algorithms
C++ implementation tricks
Here, I am more interested in the last strategy. I set up a class for interpolation:
/**
* This class is used to perform interpretaion for a certain poin in
* the image grid.
*/
class Sampling
{
public:
// samples[0] *-------------* samples[1]
// --------------
// --------------
// samples[2] *-------------*samples[3]
inline void sampling_linear(unsigned char *samples, unsigned char &res)
{
unsigned char res_temp[2];
sampling_linear_1D(samples,res_temp[0]);
sampling_linear_1D(samples+2,res_temp[1]);
sampling_linear_1D(res_temp,res);
}
private:
inline void sampling_linear_1D(unsigned char *samples, unsigned char &res)
{
}
}
Here I only give an example for bilinear interpolation. In order to make the program run faster, the inline function is employed. My question is whether this implementation scheme is efficient. Additionally, during the interpretation procedure if I give the use the option of choosing between different interpolation methods. Then I have two choices:
Depending on the interpolation method, invoke the function the perform interpolation for the whole image.
For each output image pixel, first determine its position in the input image, and then according to the interpolation method setting, determine the interpolation function.
The first method means more codes in the program while the second one may lead to inefficiency. Then, how could I choose between these two schemes? Thanks!

Fast image interpolation using Streaming SIMD Extensions (SSE)
This may not provide desired result, because I expect that your algorithm will be memory-bounded rather than FLOP/s bounded.
I mean - it definitely will be improvement, but not beneficial in compare to implementation cost.
And by the way, modern compilers can perform auto-vectorization (i.e. use of SSE and futher extensions): GCC starting from 4.0, MSVC starting from 2012, MSVC Auto-Vectorization video lectures.
Image interpretation with multi-thread or GPU
Multi-thread version should give good effect, because it would allow you to exploit all available memory throughput.
If you do not plan to process data several times, or use it in some way on GPU, then GPGPU may not give desired result. Yes, it will produce result faster (mostly due to higher memory speed), but this effect will be crossed out by slow transfer between main RAM and GPU's RAM.
Just for example, approximate modern throughputs:
CPU RAM ~ 20GiB/s
GPU RAM ~ 150GiB/s
Transfering between CPU RAM <-> GPU RAM ~ 3-5 GiB/s
For single pass memory bounded algorithms, in most cases, third item makes usage of GPUs impractical (for such algoirthms).
In order to make the program run faster, the inline function is employed
Class member functions are "inline" by default. Beaware, that main purpose of "inline" is not actually "inling", but helping to prevent One Definition Rule violation when your functions are defined in headers.
There are compiler-dependent "forceinline" features, for instance MSVC has __forceinline. Or abstracted from compiler ifdef'ed BOOST_FORCEINLINE macro.
Anyway, trust your compiler unless you don't prove otherwise (with help of assembler for example). Most important fact, is that compiler should see functions defenitions - then it can decide itself to inline, even if function is not inline itself.
My question is whether this implementation scheme is efficient.
As I understand, as pre-step, you gather samples into 2x2 matrix. I think it may be better to pass directly two pointers to arrays of two elements within image directly, or one pointer + width size (to calc second pointer automaticly). However, it is not a big issue, most likely your temporary 2x2 matrix will be optimized away.
What is really important - is how you traverse your image.
Let's say for given x and y, index is calculated as:
i=width*y+x;
Then your traversal loop should be:
for(int y=/*...*/)
for(int x=/*...*/)
{
// loop body
}
Because, if you would chose another order (x first, then y) - it will be not cache-friendly, and as the result performance drop can be up to 64x (depending on your pixel size). You may check it just for your interest.
The first method means more codes in the program while the second one may lead to inefficiency. Then, how could I choose between these two schemes? Thanks!
In this case, you can use compile-time polymorphism to reduce code amount in first version. For instance, based on templates.
Just look at std::accumulate - it can be written once, and then it will work on different types of iterators, different binary operations (functions or functors), without imply any runtime penalty due to it's polymorphism.
Alexander Stepanov says:
For many years, I tried to achieve relative efficiency in more advanced languages (e.g., Ada and Scheme) but failed. My generic versions of even simple algorithms were not able to compete with built-in primitives. But in C++ I was finally able to not only accomplish relative efficiency but come very close to the more ambitious goal of absolute efficiency. To verify this, I spent countless hours looking at the assembly code generated by different compilers on different architectures.
Check Boost's Generic Image Library - it has good tutorial, and there is video presentation from author.

Do (Unused) GLSL uniforms/in/out Contribute to Register Pressure?

I don't know how uniforms are represented in memory.
Uniforms seem like they could take up valuable register space, but they're ultimately passed in/through/out into global memory, right?
Does the situation change when the uniforms are unused? Can the compiler optimize them away?--I have gotten invalid (-1) as a binding location when this is the case, so I assume yes.

Uniforms are represented in whatever manor the GLSL compiler and OpenGL implementation deem fit. Some implementations make certain uniforms actual constants within the assembly, such that changing a uniform is actually patching the assembly in-situ. Some have special memory for uniforms.
It's all hardware dependent.
Compilers are allowed to optimize uniforms out; this is where the term "active uniform" comes from. The OpenGL API for querying uniform information only works for active uniforms: those that are actually detected to be in-use by the compiler.

First of all the GLSL specification doesn't say anything about the actual implementation of it's concepts, so the following elaborations are of course to be read as "could be any way, but nowadays it's usually that way".
As to my (maybe limited) knowledge about graphics hardware, uniforms usually live in the so-called constant memory, which is part of the global device memory (and on newer hardware should even be cached), since they cannot be changed by the shader program anyway and are global to all invocations of the program (that could and should run on different multiprocessors in parallel). So they don't take up any per-multiprocessor register space themselves.
You are also right in that the GLSL compiler can (and usually will) optimize away any unused uniforms (and also attributes), but only if those are not used in any possible branch of execution, of course. So what you experienced with getting a uniform location of -1 is perfectly valid (and usually desired) behaviour.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js