My assumption is that the GLSL compiler simply inlines all function calls, making them inexpensive. However, if function calls in GLSL implemented stack frames etc etc then they could be quite expensive. Does anyone know whether GLSL function calls are expensive at all?
Generally, function calls should be inexpensive even when not inlined, as no such thing as a stack frame exists (no recursion in GLSL!). Therefore, as such, a function call shouldn't be a forbidding overhead on any architecture (maybe 1-2 cycles).
However, function calls often happen in the context of a conditional branch, such as for example if(foo) bar(); else baz();, which per se are very expensive on GPUs when branches within a workgroup are divergent (that is, not all threads take exactly the same path).
If only a single thread takes, or could take, a different path within a workgroup, the GPU must either execute both paths followed by a conditional move (the usual case on previous-generation hardware), or a sync point is (implicitly) inserted on newest-generation hardware. In this case, only the path that is taken is evaluated by each thread (which arguably saves some power) but effectively all threads run lockstep, and the short path takes exactly as long as the long path. Worded differently, all pixels (or vertices, or work items) in the workgroup are processed as fast as the slowest in the group.
Function calls can be expensive, dependent of your device. I suggest you take a look at the GLSL optimizer by Aras (From Unity):
https://github.com/aras-p/glsl-optimizer
Related
Scenario: You are writing a complex algorithm using SIMD. A handful of constants and/or infrequently changing values are used. Ultimately, the algorithm ends up using more than 16 ymm, resulting in the use of stack pointers (e.g. opcode contains vaddps ymm0,ymm1,ymmword ptr [...] instead of vaddps ymm0,ymm1,ymm7).
In order to make the algorithm fit into the available registers, the constants can be "inlined". For example:
const auto pi256{ _mm256_set1_ps(PI) };
for (outer condition)
{
...
const auto radius_squared{ _mm256_mul_ps(radius, radius) };
...
for (inner condition)
{
...
const auto area{ _mm256_mul_ps(radius_squared, pi256) };
...
}
}
... becomes ...
for (outer condition)
{
...
for (inner condition)
{
...
const auto area{ _mm256_mul_ps(_mm256_mul_ps(radius, radius), _mm256_set1_ps(PI)) };
...
}
}
Whether the disposable variable in question is a constant, or is infrequently calculated (calculated outer loop), how can one determine which approach achieves the best throughput? Is it a matter of some concept like "ptr adds 2 extra latency"? Or is it nondeterministic such that it differs on a case-by-case basis and can only be fully optimized through trial-and-error + profiling?
A good optimizing compiler should generate the same machine code for both versions. Just define your vector constants as locals, or use them anonymously for maximum readability; let the compiler worry about register allocation and pick the least expensive way to deal with running out of registers if that happens.
Your best bet for helping the compiler is to use fewer different constants if possible. e.g. instead of _mm_and_si128 with both set1_epi16(0x00FF) and 0xFF00, use _mm_andn_si128 to mask the other way. You usually can't do anything to influence which things it chooses to keep in registers vs. not, but fortunately compilers are pretty good at this because it's also essential for scalar code.
A compiler will hoist constants out of the loop (even inlining a helper function containing constants), or if only used in one side of a branch, bring the setup into that side of the branch.
The source code computes exactly the same thing with no difference in visible side-effects, so the as-if rule allows the compiler the freedom to do this.
I think compilers normally do register allocation and choose what to spill/reload (or just use a read-only vector constant) after doing CSE (common subexpression elimination) and identifying loop invariants and constants that can be hoisted.
When it finds it doesn't have enough registers to keep all variables and constants in regs inside the loop, the first choice for something to not keep in a register would normally be a loop-invariant vector, either a compile-time constant or something computed before the loop.
An extra load that hits in L1d cache is cheaper than storing (aka spilling) / reloading a variable inside the loop. Thus, compilers will choose to load constants from memory regardless of where you put the definition in the source code.
Part of the point of writing in C++ is that you have a compiler to make this decision for you. Since it's allowed to do the same thing for both sources, doing different things would be a missed-optimization for at least one of the cases. (The best thing to do in any particular case depends on surrounding code, but normally using vector constants as memory source operands is fine when the compiler runs low on regs.)
Is it a matter of some concept like "ptr adds 2 extra latency"?
Micro-fusion of a memory source operand doesn't lengthen the critical path from the non-constant input to the output. The load uop can start as soon as the address is ready, and for vector constants it's usually either a RIP-relative or [rsp+constant] addressing mode. So usually the load is ready to execute as soon as it's issued into the out-of-order part of the core. Assuming an L1d cache hit (since it will stay hot in cache if loaded every loop iteration), this is only ~5 cycles, so it will easily be ready in time if there's a dependency-chain bottleneck on the vector register input.
It doesn't even hurt front-end throughput. Unless you're bottlenecked on load-port throughput (2 loads per clock on modern x86 CPUs), it typically makes no difference. (Even with highly accurate measurement techniques.)
I got a profile result states that overhead of calling a function is very large.
It is currently a bottle neck of my program.
The function is in a template class :-
template<class U> class CustomArray{
....
public: U& operator[](int n){ //<-- 2.8%
... some cheap assertion ... //<-- 0.2%
return database()[n]; //<-- 0.3% (just add address to allocated mem)
} //<-- 2.7%
}
(^ The image was edited a little to protect me from my boss.)
Question
Is it possible? Is profiler wrong?
If so, how to optimize it?
I have tried inline keyword (no different). This function is already inline, isn't it?
I am using Visual Studio 2015's profiler (optimization -O2).
The result is very inconsistent with How much overhead is there in calling a function in C++?.
Edit: I confirm that Profiling Collection = Sampling (not Instrumention).
Let's assume you are using the default sampling method of profiling in Visual Studio.
Such profilers usually work at the assembly level, for example, by sampling the current instruction pointer periodically. They then use debug data to try to map that data back to source lines. For heavily optimized and inlined code, this mapping isn't always reliable (indeed, some instruction origin may not originate from any line, or it may effectively be shared among several).
In addition to making profiling tricky, this also means statements like "function call has a 10x overhead of a normal statement" isn't really meaningful: there is no "typical" function call and there certainly is no typical "normal statement". Functions can vary from totally free (when inlined or even eliminated), to somewhat expensive (mis-predicted virtual calls1) and statements span an even greater range from free to almost unlimited in cost (but a common case would be a cache miss taking hundreds of cycles).
On top of that, sampling methods often have inherent error or skew. For example, an expensive instruction may tend to spread its samples out among subsequent instructions rather than being assigned all the samples itself. This leads to additional error at the instruction level.
All this adds up to mean that while sampling results may be quite accurate for broad-stroke profiling (i.e., identifying features on the order of hundreds of cycles), you can't always read too much into very fine-grained results such as your one-line function above.
If you do want to read into those results, the first step is to see if the sampling mode has an assembly level view and to use that view, since at least then you remove entirely the assembly-to-source mapping issue.
1 Is there anything worse that could reasonably be considered a "function call" in C++?
I'm coding a physics simulation and I'm now feeling the need for optimizing it. I'm thinking about improving one point: one of the methods of one of my class (which I call a billion times in several cases) defines everytime a probability distribution. Here is the code:
void myClass::myMethod(){ //called billions of times in several cases
uniform_real_distribution<> probd(0,1);
uniform_int_distribution<> probh(1,h-2);
uniform_int_distribution<> probv(1,v-2);
//rest of the code
}
Could I pass the distribution as member of the class so that I won't have to define them everytime? And just initialize them in the constructor and redefine them when h and v change? Can it be a good optimizing progress? And last question, could it be something that is already corrected by the compiler (g++ in my case) when compiled with flag -O3 or -O2?
Thank you in advance!
Update: I coded it and timed both: the program turned actually a bit slower (a few percents) so I'm back at what I started with: creating the probability distributions at each loop
Answer A: I shouldn't think so, for a uniform distribution it's just going to copy the parameter values into place, maybe with a small amount of arithmetic, and that will be well optimized.
However, I believe distribution objects can have state. They can use part of the random data from a call to the generator, and are permitted save the rest of the randomness to use next time the distribution is used, in order to reduce the total number of calls to the generator. So when you destroy a distribution object you might be discarding some possibly-costly random data.
Answer B: stop guessing and test it.
Time your code, then add static to the definition of probd and time it again.
Yes
Yes
Well, there may be some advantage, but AFAIK those objects aren't really heavyweight/expensive to construct. Also, with locals you may gain something in data locality and in assumptions the optimizer can make.
I don't think they are automatically moved as class variables (especially if your class is POD - in that case I doubt the compiler will dare to modify its layout); most probably, instead, they are completely optimized away - only the code of the called methods - in particular operator() - may remain, referring directly to h and v. But this must be checked by looking at the generated assembly.
Incidentally, if you have a performance problem, besides optimizing obvious points (non-optimal algorithms used in inner loops, continuous memory allocations, removing useless copies of big objects, ...) you should really try to use a profiler to find the real "hot spots" in your code, and concentrate to optimize them instead of going randomly through all the code.
uniform_real_distribution maintains a state of type param_type which is two double values (using default template parameters). The constructor assigns to these and is otherwise trivial, the destructor is trivial.
Therefore, constructing a temporary within your function has an overhead of storing 2 double values as compared to initializing 1 pointer (or reference) or going through an indirection via this. In theory, it might therefore be faster (though, what appears to be faster, or what would make sense to run faster isn't necessary any faster). Since it's not much work, it's certainly worth trying and timing whether there's a difference, even if it is a micro-optimization.
Some 3-4 extra cycles are normally neglegible, but since you're saying "billions of times" it may of course very well make a measurable difference. 3 cycles times one billion is 1 second on a 3GHz machine.
Of course, optimization without profiling is always somewhat... awkward. You might very well find that a different part in your code that's called billions of times saves a lot more cycles.
EDIT:
Since you're not going to modify it, and since the first distribution is initialized with literal values, you might actually make it a constant (such as a constexpr or namespace level static const). That should, regardless of the other two, allow the compiler to generate the most efficient code in any case for that one.
Assuming that we have lots of threads that will access global memory sequentially, which option performs faster in the overall? I'm in doubt because __threadfence() takes into account all shared and global memory writes but the writes are coalesced. In the other hand atomicExch() takes into account just the important memory addresses but I don't know if the writes are coalesced or not.
In code:
array[threadIdx.x] = value;
Or
atomicExch(&array[threadIdx.x] , value);
Thanks.
On Kepler GPUs, I would bet on atomicExch since atomics are very fast on Kepler. On Fermi, it may be a wash, but given that you have no collisions, atomicExch could still perform well.
Please make an experiment and report the results.
Those two do very different things.
atomicExch ensures that no two threads try to modify a given cell at a time. If such conflict would occur, one or more threads may be stalled. If you know beforehand that no two threads access the same cell, there is no point to use any atomic... function.
__threadfence() delays the current thread (and only the current thread!) to ensure that any subsequent writes by given thread do actually happen later.
As such, __threadfence() on its own, without any follow-up code is not very interesting.
For that reason, I don't think there is a point to compare the efficiency of those two. Maybe if you could show a bit more concrete use case I could relate...
Note, that neither of those actually give you any guarantees on the actual order of execution of the threads.
In c++, what is a good heuristic for estimating the compute time benefits of inlining a function, particularly when the function is called very frequently and accounts for >= 10% of the program's execution time (eg. the evaluation function of a brute force or stochastic optimization process). Even though inlining may be ultimately beyond my control, I am still curious.
There is no general answer. It depends on the hardware, the number and
type of its arguments, and what is done in the function. And how often
it is called, and where. On a Sparc, for example, arguments (and the
return value) are passed in registers, and each function gets 16 new
registers: if the function is complex enough, those new registers may
avoid spilling that would occur if the function were inlined, and the
non-inline version may end up faster than the inlined one. On an Intel,
which is register poor, and passes arguments in registers, just the
opposite might be true, for the same function in the same program. More
generally, inlining may increase program size, reducing locality. Or
for very simple functions, it may reduce program size; but that again
depends on the architecture. The only possible way to know is to try
both, measuring the time. And even then you'll only know for that
particular program, on that particular hardware.
A function call and return on some architectures take as few as one instruction each (although they're generally not RISC-like single-cycle instructions.) In general, you can compare that to the number of cycles represented by the body of the function. A simple property access might be only a single instruction, and so putting it into a non-inlined function will triple the number of instructions to execute it -- obviously a great candidate for inlining. On the other hand, a function that formats a string for printing might represent hundreds of instructions, so two more isn't going to make any difference at all.
If your bottleneck is in a recursive function, and assuming that the level of recursion is not minimal (i.e. average recursion is not just a few levels), you are better off in working with the algorithm in the function rather than with inlining.
Try, if possible, to transform the recursion into a loop or tail-recursion (that can be implicitly transformed into a loop by the compiler), or try to determine where in the function the cost is being spent. Try to minimize the impact of the internal operations (maybe you are dynamically allocating memory that could have auto storage duration, or maybe you can factor a common operation to be performed external to the function in a wrapper and passed in as an extra argument,...)
*EDIT after the comment that recursion was not intended, but rather iteration*
If the compiler has access to the definition of the function, it will make the right decision for you in most cases. If it does not have access to the definition, just move the code around so that it does see it. Maybe make the function a static function to provide an extra hint that it won't be used anywhere else, or even mark it as inline (knowing that this will not force inlining), but avoid using special attributes that will force inlining, as the compiler probably does it better than any simple heuristic that can be produced without looking at the code.
All inlining saves you is the entry/exit cost of the function, so it's only worth considering if the function does almost nothing.
Certainly if the function itself contains a function call, it's probably not worth considering.
Even if the function does very little, it has to be called so much that it owns the program counter a significant percent of the time, before any speedup of the function would be noticeable.
The behaviour here is somewhat compiler dependant. With a recursive function obviously inlining behaviour can in theory be infinite. The 'inline' keyword is only a hint to the compiler, it can choose it ignore if it can't do anything with it. Some compilers will inline the recursive function to a certain depth.
As for the 'how much will this speed things up' - unfortunately we can't provide any sort of answer to that question as 'it depends' - how much work is the function doing vs the overhead of the function call mechanism itself. Why don't you set up a test and see?
Our experience, 20+ years of writing computationally intensive C++, is that inlining is no silver bullet. You really do need to profile your code to see whether inlining will increase performance. For us except for low level 2D and 3D point and vector manipulations inlining is a waste of time. You are far better off working out a better algorithm than trying to micromanage clock ticks.