I recently encountered some confusion while using a GLSL function which modified (and copied out) one of its input parameters. Let's suppose this is the function:
float v(inout uint idx) {
return 3.14 * idx++;
}
Now let's use that function in a potentially ambiguous way:
uint idx = 0;
const vec4 values = vec4(v(idx), v(idx), v(idx), v(idx));
We might reasonably assume that after the call to the vec4 constructor returns, our vector values should equal {0.00, 3.14, 6.28, 9.42} and idx should equal 4. However, it occured to me to wonder if the order of evaluation of function arguments in GLSL is well defined, and if so whether the above assumed ordering is correct. Alternatively, could this result in (implementation dependent) undefined behavior?
So of course I consulted the GLSL spec (v4.6, rev3, §6.1.1, p116, "Function Calling Conventions"), which has the following to say:
All arguments are evaluated at call time, exactly once, in order, from left to right.
So far so good. But then farther down the page:
The order in which output parameters are copied back to the caller is undefined.
I'm not entirely clear on the significance of this second statement.
Does it mean that for the function float doWork(inout uint v1, inout uint v2) {...} that the order in which v1 and v2 are copied back is undefined? This would matter if you did something like passing the same local variable in place of both parameters.
Alternatively, does it instead mean that in the earlier example, the order in which the variable idx is updated is undefined, and as such the ordering of values is also undefined?
Or perhaps both of these cases are undefined? That is, perhaps all copy-back operations on the entire line of code happen in an unordered manner?
It goes without saying that using multiple variables to hold the values prior to the vec4 constructor call would trivially avoid this question entirely, but that's not the point. Rather, I'd like to know how this part of the standard was meant to be interpreted and whether or not my first example would result in idx containing an undefined value.
Related
This is something I've always wondered: is it easier for the compiler to optimise functions where existing variables are re-used, where new (ideally const) intermediate variables are created, or where creating variables is avoided in favour of directly using expressions?
For example, consider the functions below:
// 1. Use expression as and when needed, no new variables
void MyFunction1(int a, int b)
{
SubFunction1(a + b);
SubFunction2(a + b);
SubFunction3(a + b);
}
// 2. Re-use existing function parameter variable to compute
// result once, and use result multiple times.
// (I've seen this approach most in old-school C code)
void MyFunction2(int a, int b)
{
a += b;
SubFunction1(a);
SubFunction2(a);
SubFunction3(a);
}
// 3. Use a new variable to compute result once,
// and use result multiple times.
void MyFunction3(int a, int b)
{
int sum = a + b;
SubFunction1(sum);
SubFunction2(sum);
SubFunction3(sum);
}
// 4. Use a new const variable to compute result once,
// and use result multiple times.
void MyFunction4(int a, int b)
{
const int sum = a + b;
SubFunction1(sum);
SubFunction2(sum);
SubFunction3(sum);
}
My intuition is that:
In this particular situation, function 4 is easiest to optimise because it explicitly states the intention for the use of the data. It is telling the compiler: "We are summing the two input arguments, the result of which will not be modified, and we are passing on the result in an identical way to each subsequent function call." I expect that the value of the sum variable will just be put into a register, and no actual underlying memory access will occur.
Function 1 is the next easiest to optimise, though it requires more inference on the part of the compiler. The compiler must spot that a + b is used in an identical way for each function call, and it must know that the result of a + b is identical each time that expression is used. I would still expect the result of a + b to be put into a register rather than committed to memory. However, if the input arguments were more complicated than plain ints, I can see this being more difficult to optimise (rules on temporaries would apply for C++).
Function 3 is the next easiest after that: the result is not put into a const variable, but the compiler can see that sum is not modified anywhere in the function (assuming that the subsequent functions do not take a mutable reference to it), so it can just store the value in a register similarly to before. This is less likely than in function 4's case, though.
Function 4 gives the least assistance for optimisations, since it directly modifies an incoming function argument. I'm not 100% sure what the compiler would do here: I don't think it's unreasonable to expect it to be intelligent enough to spot that a is not used anywhere else in the function (similarly to sum in function 3), but I wouldn't guarantee it. This could require modifying stack memory depending on how the function arguments are passed in (I'm not too familiar with the ins and outs of how function calls work at that level of detail).
Are my assumptions here correct? Are there more factors to take into account?
EDIT: A couple of clarifications in response to comments:
If C and C++ compilers would approach the above examples in different ways, I'd be interested to know why. I can understand that C++ would optimise things differently depending on what constraints there are on whichever objects might be inputs to these functions, but for primitive types like int I would expect them to use identical heuristics.
Yes, I could compile with optimisations and look at the assembly output, but I don't know assembly, hence I'm asking here instead.
Good modern compilers generally do not “care” about the names you use to store values. They perform lifetime analyses of the values and generate code based on that. For example, given:
int x = complicated expression 0;
... code using x
x = complicated expression 1;
... code using x
the compiler will see that complicated expression 0 is used in the first section of code and complicated expression 1 is used in the second section of code, and the name x is irrelevant. The result will be the same as if the code used different names:
int x0 = complicated expression 0;
... code using x0
int x1 = complicated expression 1;
... code using x1
So there is no point in reusing a variable for a different purpose; it will not help the compiler save memory or otherwise optimize.
Even if the code were in a loop, such as:
int x;
while (some condition)
{
x = complicated expression;
... code using x
}
the compiler will see that complicated expression is born at the beginning of the loop body and ends by the end of the loop body.
What this means is you do not have to worry about what the compiler will do with the code. Instead, your decisions should be guided mostly by what is clearer to write and more likely to avoid bugs:
Avoid reusing a variable for more than one purpose. For example, if somebody is later updating your function to add a new feature, they might miss the fact you have changed the function parameter with a += b; and use a later in the code as if it still contained the original parameter.
Do freely create new variables to hold repeated expressions. int sum = a + b; is fine; it expresses the intent and makes it clearer to readers when the same expression is used in multiple places.
Limit the scope of variables (and identifiers generally). Declare them only in the innermost scope where they are needed, such as inside a loop rather than outside. The avoids a variable being used accidentally where it is no longer appropriate.
GCC can suggest functions for attribute pure and attribute const with the flags -Wsuggest-attribute=pure and -Wsuggest-attribute=const.
The GCC documentation says:
Many functions have no effects except the return value and their return value depends only on the parameters and/or global variables. Such a function can be subject to common subexpression elimination and loop optimization just as an arithmetic operator would be. These functions should be declared with the attribute pure.
But what can happen if you attach __attribute__((__pure__)) to a function that doesn't match the above description, and does have side effects? Is it simply the possibility that the function will be called fewer times than you would want it to be, or is it possible to create undefined behaviour or other kinds of serious problems?
Similarly for __attribute__((__const__)) which is stricter again - the documentation states:
Basically this is just slightly more strict class than the pure attribute below, since function is not allowed to read global memory.
But what can actually happen if you attach __attribute__((__const__)) to a function that does access global memory?
I would prefer technical answers with explanations of actual possible scenarios within the scope of GCC / G++, rather than the usual "nasal demons" handwaving that appears whenever undefined behaviour gets mentioned.
But what can happen if you attach __attribute__((__pure__))
to a function that doesn't match the above description,
and does have side effects?
Exactly. Here's a short example:
extern __attribute__((pure)) int mypure(const char *p);
int call_pure() {
int x = mypure("Hello");
int y = mypure("Hello");
return x + y;
}
My version of GCC (4.8.4) is clever enough to remove second call to mypure (result is 2*mypure()). Now imagine if mypure were printf - the side effect of printing string "Hello" would be lost.
Note that if I replace call_pure with
char s[];
int call_pure() {
int x = mypure("Hello");
s[0] = 1;
int y = mypure("Hello");
return x + y;
}
both calls will be emitted (because assignment to s[0] may change output value of mypure).
Is it simply the possibility that the function will be called fewer times
than you would want it to be, or is it possible to create
undefined behaviour or other kinds of serious problems?
Well, it can cause UB indirectly. E.g. here
extern __attribute__((pure)) int get_index();
char a[];
int i;
void foo() {
i = get_index(); // Returns -1
a[get_index()]; // Returns 0
}
Compiler will most likely drop second call to get_index() and use the first returned value -1 which will result in buffer overflow (well, technically underflow).
But what can actually happen if you attach __attribute__((__const__))
to a function that does access global memory?
Let's again take the above example with
int call_pure() {
int x = mypure("Hello");
s[0] = 1;
int y = mypure("Hello");
return x + y;
}
If mypure were annotated with __attribute__((const)), compiler would again drop the second call and optimize return to 2*mypure(...). If mypure actually reads s, this will result in wrong result being produced.
EDIT
I know you asked to avoid hand-waving but here's some generic explanation. By default function call blocks a lot of optimizations inside compiler as it has to be treated as a black box which may have arbitrary side effects (modify any global variable, etc.). Annotating function with const or pure instead allows compiler to treat it more like expression which allows for more aggressive optimization.
Examples are really too numerous to give. The one which I gave above is common subexpression elimination but we could as well easily demonstrate benefits for loop invariants, dead code elimination, alias analysis, etc.
Ran into this yesterday, I will try to give clear and simple examples which fail for me with MSVC12 (VS2013, 120) and MSVC14 (VS2015, 140). Everything is implicitly /arch:SSE+ with x64.
I will trivialize the issue to a simple matrix transpose example using defined macros _MM_TRANSPOSE4_PS for illustration purposes. This one is implemented in terms of shuffles, rather than moving L/H 8 byte blocks around.
float4x4 Transpose(const float4x4& m) {
matrix4x4 n = LoadMatrix(m);
_MM_TRANSPOSE4_PS(n.row[0], n.row[1], n.row[2], n.row[3]);
return StoreMatrix(n);
}
The matrix4x4 is merely a POD struct containing four __m128 members, everything is tidily aligned on a 16-byte boundary, even though it is somewhat implicit:
__declspec(align(16)) struct matrix4x4 {
__m128 row[4];
};
All of this fails on /O1, /O2 and /Ox:
// Doesn't work.
float4x4 resultsPlx = Transpose( GiveMeATemporary() );
// Changing Transpose to take float4x4, or copy a temporary
float4x4 Transpose(float4x4 m) { ... }
// Trying again, doesn't work.
float4x4 resultsPlx = Transpose( GiveMeATemporary() );
Curiously enough, this works:
// A constant reference to an rvalue, a temporary
const float4x4& temporary = GiveMeATemporary();
float4x4 resultsPlx = Transpose(temporary);
Same goes for pointer-based transfers, which is logical as the underlying mechanisms are the same. The relevant part of the C++11 specification is §12.2/5:
The second context is when a reference is bound to a temporary. The
temporary to which the reference is bound or the temporary that is the
complete object to a subobject of which the temporary is bound
persists for the lifetime of the reference except as specified below.
A temporary bound to a reference member in a constructor’s
ctor-initializer (§12.6.2 [class.base.init]) persists until the
constructor exits. A temporary bound to a reference parameter in a
function call (§5.2.2 [expr.call]) persists until the completion of
the full expression containing the call.
This implies it should survive until the calling environment goes out of scope, which is far after the function returns. So, what gives? In all other cases, the variables get "optimized away", with the following exception:
Access violation reading location 0xFFFFFFFFFFFFFFFF
While the solution is obvious, prevent the user from passing temporaries directly with pointer-based transfers like some other libraries, I had hoped to actually make it a little bit more elegant without &s clogging the view.
You can add (non-virtual) member functions to a struct without really affecting the layout. So add destructor to print "I'm here %p" when the structure is destroyed, and print "I'm there" in your function. (Include the this address you you can make sense of other temporary copies being used).
Then you can observe the lifetime in the optimized code. See if that is your problem: I am suspicious that bad lifetime actually means anything because the place it was is still valid address in your stack frame.
Furthermore, changing the bits where a floatnis supposed to live might at worst give you a not-a-number or special value, and the vector processing does not throw or fault in that case, but puts a flag value as the result for that bad element. There is no pointer, so why is it dereferencing −1 ?
I think the misfire is caused by something more interesting.
Run it in the debugger and see what instruction causes that.
I'm trying to understand how passing parameters is implemented in shader languages.
I've read several articles and documentation, but still I have some doubts. In particular I'm trying to understand the differences with a C++ function call, with a particular emphasis on performances.
There are slightly differences between HLSL,Cg and GLSL but I guess the underline implementation is quite similar.
What I've understood so far:
Unless otherwise specified a function parameter is always passed by value (is this true even for matrix?)
Passing by value in this context hasn't the same implications as with C++. No recursion is supported, so the stack isn't used and most function are inlined and arguments directly put into registers.
functions are often inlined by default (HLSL) or at least inline keyword is always respected by the compiler (Cg)
Are the considerations above right?
Now 2 specific question:
Passing a matrix as function parameter
inline float4 DoSomething(in Mat4x4 mat, in float3 vec)
{
...
}
Considering the function above, in C++ that would be nasty and would be definitely better to use references : const Mat4x4&.
What about shaders? Is this a bad approach? I read that for example inout qualifier could be used to pass a matrix by reference, but actually it implicates that matrix be modified by the called function..
Does the number (and type of arguments) have any implication? For example is better use functions with a limited set of arguments?Or avoid passing matrices?
Is inout modifier a valid way to improve performance here? If so, anyone does know how a typical compiler implement this?
Are there any difference between HLSL an GLSL on this?
Does anyone have hints on this?
According to the spec, values are always copied. For in parameters, the are copied at call time, for out parameters at return time, and for inout parameters at both call and return time.
In the language of the spec (GLSL 4.50, section 6.1.1 "Function Calling Conventions"):
All arguments are evaluated at call time, exactly once, in order, from left to right. Evaluation of an in parameter results in a value that is copied to the formal parameter. Evaluation of an out parameter results in an l-value that is used to copy out a value when the function returns. Evaluation of an inout parameter results in both a value and an l-value; the value is copied to the formal parameter at call time and the lvalue is used to copy out a value when the function returns.
An implementation is of course free to optimize anything it wants as long as the result is the same as it would be with the documented behavior. But I don't think you can expect it to work in any specify way.
For example, it wouldn't be save to pass all inout parameters by reference. Say if you had this code:
vec4 Foo(inout mat4 mat1, inout mat4 mat2) {
mat1 = mat4(0.0);
mat2 = mat4(1.0);
return mat1 * vec4(1.0);
}
mat4 myMat;
vec4 res = Foo(myMat, myMat);
The correct result for this is a vector containing all 0.0 components.
If the arguments were passed by reference, mat1 and mat2 inside Foo() would alias the same matrix. This means that the assignment to mat2 also changes the value of mat1, and the result is a vector with all 1.0 components. Which would be wrong.
This is of course a very artificial example, but the optimization has to be selective to work correctly in all cases.
Your first bullet point does not work when you consider arguments qualified using inout.
The real issue is what you do with the parameter inside the function, if you modify a parameter qualified with in then it cannot be "passed by reference" and a copy will have to be made. On modern hardware this probably is not a big deal, but Shader Model 2.0 was pretty limited in terms of number of temp registers and I ran into these kinds of issues more than once when GLSL and Cg first came out.
For reference, consider the following GLSL code:
vec4 DoSomething (mat4 mat, vec3 vec)
{
// Pretty straight forward, no temporary registers are required to pass arguments.
return vec4 (mat [0] + vec4 (vec, 0.0));
}
vec4 DoSomethingCopy (mat4 mat, vec3 vec)
{
mat [0][0] = 0.0; // This requires the compiler to make a local copy of mat
return vec4 (mat [0] + vec4 (vec, 0.0));
}
vec4 DoSomethingInOut (inout mat4 mat, in vec3 vec)
{
mat [0][0] = 0.0; // No copy required, but the original mat is modified
return vec4 (mat [0] + vec4 (vec, 0.0));
}
I cannot really comment on performance, my only bad experiences had to do with hitting actual hardware limits on older GPUs. Of course you should assume that any time something has to be copied it is going to negatively impact performance.
All shader functions are inlined (recursive function are forbidden). The concept of reference/pointer is invalid here too. The only case when some code will be generated is when you write on an input parameter. However, if the original registers aren't used anymore the compiler will probably use the same registers, and the copy (mov operation) won't be needed.
Bottom line: function invocation is free.
Is there some kind of subtle difference between those:
void a1(float &b) {
b=1;
};
a1(b);
and
void a1(float *b) {
(*b)=1;
};
a1(&b);
?
They both do the same (or so it seems from main() ), but the first one is obviously shorter, however most of the code I see uses second notation. Is there a difference? Maybe in case it's some object instead of float?
Both do the same, but one uses references and one uses pointers.
See my answer here for a comprehensive list of all the differences.
Yes. The * notation says that what's being pass on the stack is a pointer, ie, address of something. The & says it's a reference. The effect is similar but not identical:
Let's take two cases:
void examP(int* ip);
void examR(int& i);
int i;
If I call examP, I write
examP(&i);
which takes the address of the item and passes it on the stack. If I call examR,
examR(i);
I don't need it; now the compiler "somehow" passes a reference -- which practically means it gets and passes the address of i. On the code side, then
void examP(int* ip){
*ip += 1;
}
I have to make sure to dereference the pointer. ip += 1 does something very different.
void examR(int& i){
i += 1;
}
always updates the value of i.
For more to think about, read up on "call by reference" versus "call by value". The & notion gives C++ call by reference.
In the first example with references, you know that b can't be NULL. With the pointer example, b might be the NULL pointer.
However, note that it is possible to pass a NULL object through a reference, but it's awkward and the called procedure can assume it's an error to have done so:
a1(*(float *)NULL);
In the second example the caller has to prefix the variable name with '&' to pass the address of the variable.
This may be an advantage - the caller cannot inadvertently modify a variable by passing it as a reference when they thought they were passing by value.
Aside from syntactic sugar, the only real difference is the ability for a function parameter that is a pointer to be null. So the pointer version can be more expressive if it handles the null case properly. The null case can also have some special meaning attached to it. The reference version can only operate on values of the type specified without a null capability.
Functionally in your example, both versions do the same.
The first has the advantage that it's transparent on the call-side. Imagine how it would look for an operator:
cin >> &x;
And how it looks ugly for a swap invocation
swap(&a, &b);
You want to swap a and b. And it looks much better than when you first have to take the address. Incidentally, bjarne stroustrup writes that the major reason for references was the transparency that was added at the call side - especially for operators. Also see how it's not obvious anymore whether the following
&a + 10
Would add 10 to the content of a, calling the operator+ of it, or whether it adds 10 to a temporary pointer to a. Add that to the impossibility that you cannot overload operators for only builtin operands (like a pointer and an integer). References make this crystal clear.
Pointers are useful if you want to be able to put a "null":
a1(0);
Then in a1 the method can compare the pointer with 0 and see whether the pointer points to any object.
One big difference worth noting is what's going on outside, you either have:
a1(something);
or:
a1(&something);
I like to pass arguments by reference (always a const one :) ) when they are not modified in the function/method (and then you can also pass automatic/temporary objects inside) and pass them by pointer to signify and alert the user/reader of the code calling the method that the argument may and probably is intentionally modified inside.