performance of passing arguments by value - c++

in the process of refactoring some code, want to change a function like this
bool A::function() {
return this->a == this->b || this->c == this->d || this->e == this->f || this->g == this->h ;
}
to something like this
bool A::function(int a, int b, int c, int d, int e, int g) {
return a == b || c == d || e == this->f || g == this->h ;
}
this function is supposed to be called each time inside a main loop which would have at most 10M elements
The people I'm working with are reluctant to use the second version because of the performance cost of passing 6 ints.
I'm pretty sure that this is negligeable, considering that each iteration of the loop goes through a LOT of code, and it roughly takes ~1 minute to proces the 10M elements.
Is the cost of passing 6 int by value all the time so hight? if not, how can I make them change their mind?
edit :
about inlining, I told them that the penality would be 0 if the function was inlined but their answer was basically "we can't know for sure if it will be inlined", which I seem to recall is true (up to the compiler)

I suspect that you won't see any big difference between these two variants in reasonably optimised code. However, the proof of that would be to actually change the code and compare the different times. (And more so if 10M entries are being processed in a minute, that's 6 microseconds per item, so around 30000-200000 instructions on a modern processor - adding 6 argument passes won't budge it one or the other way, I'd say - unless this function is called many times in the loop, of course).
And yes, if the function is inlined, the result would be identical code for the two alternatives - but are your colleagues say, you can't know for sure that it is inlined or not - the only way to really determine that is to have a look at the generated machine-code (-S or use objdump or similar).

In terms of performance, I would suggest you profile your code, and see if there is a difference that matters. Passing ints around is usually very cheap and open to automatic optimization, so I doubt you would see a measurable performance hit.
Also worth pointing out that the two functions are different. The second doesn't necessarily use the member variables and the first does. If you're always comparing member variables, why pass them as parameters? Extra unnecessary parameters means more source code and a greater scope for bugs.

Write the code and as Shane says, profile it, or I prefer to grab a few stack samples because you can see exactly what's going on.
If you find the program counter in the instructions that pass those int arguments, on more than one sample, then they are costing a significant fraction of time, and you should do something about it.
On the other hand, the samples might tell you something else is the main time-taker, and maybe you should fix that first.
Then the program will be faster, and if you do the whole process again, it might come back to your original question.

Related

Performance of operator '==' with boolean variable?

I guess this situation came across in every programmer,where we can use comparison operator '==',in my case situation is like this,a c++ pgm
code 1:This has been used in all files except constructor
if(a==10)
{
//do something;
}
but i can do the same as above with the following way,
i set a bool variable to true when variable a becomes 10 in constructor itself,i.e
constructor_name()
{
boolean variable_name=TRUE;//when a == 10;
}
then i use the following code in my all files instead of code 1,
code 3:
if(variable_name)
{
//do same as first code
}
which is better for performance ,the code 1 or code 3.I hope i have illustrated my situation so than you can understand.Please help me.Thanks in advance.
You shouldn't micro-optimize. You will hardly notice any difference between your 2 version in performance (maybe you will save 1 CPU cycle), but it is not worth the time and effort, especially because nowadays CPUs are really fast.
Only optimize if you profile and find a bottleneck in your code.
Look at it this way, if you store the boolean variable in the class, it uses memory (1 byte) for maybe saving 1 CPU cycle. Depending on how often you create the class, that can scale up (even though the amount would still be ridiculously small). You maybe saved 1 cycle, but you lost 1 byte.
If you wrote this in production code, I am sure that others would find it confusing (I would), and wonder why you put a isTen boolean in the class, instead of just comparing the value using operator==.
Also, there may be a bug if you change a outisde of the constructor to 10, then isTen would still be false, but a is 10!
I found the below thing would make a difference,
consider variable a is an integer variable and it takes 4 bytes(assuming 4bytes for int),then compiler has to perform comparision of 4 bytes of memory where as a bool variable takes 1byte ,i guess this makes a differance in performance.

Code reordering due to optimization

I've heard so many times that an optimizer may reorder your code that I'm starting to believe it.
Are there any examples or typical cases where this might happen and how can I Avoid such a thing (eg I want a benchmark to be impervious to this)?
There are LOTS of different kinds of "code-motion" (moving code around), and it's caused by lots of different parts of the optimisation process:
move these instructions around, because it's a waste of time to wait for the memory read to complete without putting at least one or two instructions between the memory read and the operation using the content we got from memory
Move things out of loops, because it only needs to happen once (if you call x = sin(y) once or 1000 times without changing y, x will have the same value, so no point in doing that inside a loop. So compiler moves it out.
Move code around based on "compiler expects this code to hit more often than the other bit, so better cache-hit ratio if we do it this way" - for example error handling being moved away from the source of the error, because it's unlikely that you get an error [compilers often understand commonly used functions and that they typically result in success].
Inlining - code is moved from the actual function into the calling function. This often leads to OTHER effects such as reduction in pushing/poping registers from the stack and arguments can be kept where they are, rather than having to move them to the "right place for arguments".
I'm sure I've missed some cases in the above list, but this is certainly some of the most common.
The compiler is perfectly within its rights to do this, as long as it doesn't have any "observable difference" (other than the time it takes to run and the number of instructions used - those "don't count" in observable differences when it comes to compilers)
There is very little you can do to avoid compiler from reordering your code - you can write code that ensures the order to some degree. So for example, we can have code like this:
{
int sum = 0;
for(i = 0; i < large_number; i++)
sum += i;
}
Now, since sum isn't being used, the compiler can remove it. Adding some code that checks prints the sum would ensure that it's "used" according to the compiler.
Likewise:
for(i = 0; i < large_number; i++)
{
do_stuff();
}
if the compiler can figure out that do_stuff doesn't actually change any global value, or similar, it will move code around to form this:
do_stuff();
for(i = 0; i < large_number; i++)
{
}
The compiler may also remove - in fact almost certainly will - the, now, empty loop so that it doesn't exist at all. [As mentioned in the comments: If do_stuff doesn't actually change anything outside itself, it may also be removed, but the example I had in mind is where do_stuff produces a result, but the result is the same each time]
(The above happens if you remove the printout of results in the Dhrystone benchmark for example, since some of the loops calculate values that are never used other than in the printout - this can lead to benchmark results that exceed the highest theoretical throughput of the processor by a factor of 10 or so - because the benchmark assumes the instructions necessary for the loop were actually there, and says it took X nominal operations to execute each iteration)
There is no easy way to ensure this doesn't happen, aside from ensuring that do_stuff either updates some variable outside the function, or returns a value that is "used" (e.g. summing up, or something).
Another example of removing/omitting code is where you store values repeatedly to the same variable multiple times:
int x;
for(i = 0; i < large_number; i++)
x = i * i;
can be replaced with:
x = (large_number-1) * (large_number-1);
Sometimes, you can use volatile to ensure that something REALLY happens, but in a benchmark, that CAN be detrimental, since the compiler also can't optimise code that it SHOULD optimise (if you are not careful with the how you use volatile).
If you have some SPECIFIC code that you care particularly about, it would be best to post it (and compile it with several state of the art compilers, and see what they actually do with it).
[Note that moving code around is definitely not a BAD thing in general - I do want my compiler (whether it is the one I'm writing myself, or one that I'm using that was written by someone else) to make optimisation by moving code, because, as long as it does so correctly, it will produce faster/better code by doing so!]
Most of the time, reordering is only allowed in situations where the observable effects of the program are the same - this means you shouldn't be able to tell.
Counterexamples do exist, for example the order of operands is unspecified and an optimizer is free to rearrange things. You can't predict the order of these two function calls for example:
int a = foo() + bar();
Read up on sequence points to see what guarantees are made.

Removing code portions does not match the profiler's data

I am doing a little proof of concept profile and optimize type of example. However, I ran into something that I can't quite explain, and I'm hoping that someone here can clear this up.
I wrote a very short snippet of code:
int main (void)
{
for (int j = 0; j < 1000; j++)
{
a = 1;
b = 2;
c = 3;
for (int i = 0; i < 100000; i++)
{
callbackWasterOne();
callbackWasterTwo();
}
printf("Two a: %d, b: %d, c: %d.", a, b, c);
}
return 0;
}
void callbackWasterOne(void)
{
a = b * c;
}
void callbackWasterTwo(void)
{
b = a * c;
}
All it does is call two very basic functions that just multiply numbers together. Since the code is identical, I expect the profiler (oprofile) to return roughly the same number.
I run this code 10 times per profile, and I got the following values for how long time is spent on each function:
main: average = 5.60%, stdev = 0.10%
callbackWasterOne = 43.78%, stdev = 1.04%
callbackWasterTwo = 50.24%, stdev = 0.98%
rest is in miscellaneous things like printf and no-vmlinux
The difference between the time for callbackWasterOne and callbackWasterTwo is significant enough (to me at least) given that they have the same code, that I switched their order in my code and reran the profiler with the following results now:
main: average = 5.45%, stdev = 0.40%
callbackWasterOne = 50.69%, stdev = 0.49%
callbackWasterTwo = 43.54%, stdev = 0.18%
rest is in miscellaneous things like printf and no-vmlinux
So evidently the profiler samples one more than the other based on the execution order. Not good. Disregarding this, I decided to see the effects of removing some code and I got this for execution times (averages):
Nothing removed: 0.5295s
call to callbackWasterOne() removed from for loop: 0.2075s
call to callbackWasterTwo() removed from for loop: 0.2042s
remove both calls from for loop: 0.1903s
remove both calls and the for loop: 0.0025s
remove contents of callbackWasterOne: 0.379s
remove contents of callbackWasterTwo: 0.378s
remove contents of both: 0.382s
So here is what I'm having trouble understanding:
When I remove just one of the calls from the for loop, the execution time drops by ~60%, which is greater than the time spent by that one function + the main in the first place! How is this possible?
why is the effect of removing both calls from the loop so little compared to removing just one? I can't figure out this non-linearity. I understand that the for loop is expensive, but in that case (if most of the remaining time can be attributed to the for loop that performs the function calls), why would removing one of the calls cause such a large improvement in the first place?
I looked at the disassembly and the two functions are the same in code. The calls to them are the same, and removing the call simply deletes the one call line.
Other info that might be relevant
I'm using Ubuntu 14.04LTS
The code is complied by Eclipse with no optimization (O0)
I time the code by running it in terminal using "time"
I use OProfile with count = 10000 and 10 repetitions.
Here are the results from when I do this with -O1 optimization:
main: avg = 5.89%, stdev = 0.14%
callbackWasterOne: avg = 44.28%, stdev = 2.64%
callbackWasterTwo: avg = 49.66%, stdev = 2.54% (greater than before)
Rest is miscellaneous
Results of removing various bits (execution time averages):
Nothing removed: 0.522s
Remove callbackWasterOne call: 0.149s (71.47% decrease)
Remove callbackWasterTwo call: 0.123% (76.45% decrease)
Remove both calls: 0.0365s (93.01% decrease) (what I would expect given the profile data just above)
So removing one call now is much better than before, and removing both still carries a benefit (probably because the optimizer understands that nothing happens in the loop). Still, removing one is much more beneficial than I would have anticipated.
Results of the two functions using different variables:
I defined 3 more variables for callbackWasterTwo() to use instead of reusing same ones. Now the results are what I would have expected.
main: avg = 10.87%, stdev = 0.19% (average is greater, but maybe due to those new variables)
callbackWasterOne: avg = 46.08%, stdev = 0.53%
callbackWasterTwo: avg = 42.82%, stdev = 0.62%
Rest is miscellaneous
Results of removing various bits (execution time averages):
Nothing removed: 0.520s
Remove callbackWasterOne call: 0.292s (43.83% decrease)
Remove callbackWasterTwo call: 0.291% (44.07% decrease)
Remove both calls: 0.065s (87.55% decrease)
So now removing both calls is pretty much equivalent (within stdev) to removing one call + the other.
Since the result of removing either function is pretty much the same (43.83% vs 44.07%), I am going to go out on a limb and say that perhaps the profiler data (46% vs 42%) is still skewed. Perhaps it is the way it samples (going to vary the counter value next and see what happens).
It appears that the success of optimization relates pretty strongly to the code reuse fraction. The only way to achieve "exactly" (you know what I mean) the speedup noted by the profiler is to optimize on completely independent code. Anyway this is all interesting.
I am still looking for some explanation ideas for the 70% decrease in the -O1 case though...
I did this with 10 functions (different formulas in each, but using some combination of 6 different variables, 3 at a time, all multiplication):
These results are disappointing to say the least. I know the functions are identical, and yet, the profiler indicates that some take significantly longer. No matter which one I remove ("fast" or "slow" one), the results are the same ;) So this leaves me to wonder, how many people are incorrectly relying on the profiler to indicate the incorrect areas of code to fix? If I unknowingly saw these results, what could possible tell me to go fix the 5% function rather than the 20% (even though they are exactly the same)? What if the 5% one was much easier to fix, with a large potential benefit? And of course, this profiler might just not be very good, but it is popular! People use it!
Here is a screenshot. I don't feel like typing it in again:
My conclusion: I am overall rather disappointed with Oprofile. I decided to try out callgrind (valgrind) through command line on the same function and it gave me far more reasonable results. In fact, the results were very reasonable (all functions spent ~ the same amount of time executing). I think Callgrind samples far more than Oprofile ever did.
Callgrind will still not explain the difference in improvement when a function is removed, but at least it gives the correct baseline information...
Ah, I see you did look at the assembly. This question is indeed interesting in its own right, but in general there's no point profiling unoptimized code, since there's so much boilerplate that could easily be reduced even in -O1.
If it's really only the call that's missing, then that could explain the timing differences -- there's lots of boilerplate from the -O0 stack manipulation code (any caller-saved registers have to be pushed onto the stack, and any arguments too, then afterwards the any return value has to be treated and the opposite stack manipulation has to be done) which contributes to the time it takes to call the functions, but is not necessarily completely attributed to the functions themselves by oprofile since that code is executed before/after the function is actually called.
I suspect the reason the second function seems to always take less time is that there's less (or no) stack juggling that needs to be done -- the parameter values are already on the stack thanks to the previous function call, and so, as you've seen, only the call to the function has to be executed, without any other extra work.

C++ using boolean evaluations for array positions (jump table)

I have a C++ IF statement which looks like (pseudo code- all variables are ints):
if(x < y){
c += d;
}
else{
c += f;
}
and I am thinking of trying to remove the IF statement and instead, load the values d and f into a two-element array:
array[0] = d
array[1] = f
and then I would like to be able to refer to the array elements '0' or '1' based upon the underlying type of boolean (at least in C- 0 or 1). Is there any way to do this? So my code would change to be something like:
c += array[(x<y)] if this is true, c increments by f, otherwise if its false, c increments by d.
Can I do this, using the boolean result to look up the array index?
Of course you can do it. However, chances are that you are only going to make it worse. If you think that you are removing a branch in this case — you are mistaken. Assuming a production quality compiler and x86_64 architecture, your first version will result in a nice conditional move (i.e. cmovge). The second version, however, will result in extra level of indirection and reading memory (i.e. mov eax,DWORD PTR [rax*4+0x4005d0].
If you accept suggestions, I have a very bad feeling that you are on a very, very wrong path right now. When you are optimizing your program, you have to first measure/profile to determine a bottleneck. Only when you know what are bottlenecks, you can start optimizing them. When optimizing, you have to measure/profile it again to see whether there is an improvement or not. What you seem to be doing is not trusting your compiler, guessing, and doing false-optimization. I recommend you stop right there, or else it will go down the hill from there, trust me.
You could replace the if statement with the following if you want more compact code.
c += (x < y) ? d : f;
Yes that will work. Although it will make your code harder to understand and modern compilers will eliminate the if statement anyways (when translating to assembler).

Why does this constexpr code cause GCC to eat all my RAM?

The following program will call fun 2 ^ (MAXD + 1) times. The maximum recursion depth should never go above MAXD, though (if my thinking is correct). Thus it may take some time to compile, but it should not eat my RAM.
#include<iostream>
const int MAXD = 20;
constexpr int fun(int x, int depth=0){
return depth == MAXD ? x : fun(fun(x + 1, depth + 1) + 1, depth + 1);
}
int main(){
constexpr int i = fun(1);
std::cout << i << std::endl;
}
The problem is that eating my RAM is exactly what it does. When I turn MAXD up to 30, my laptop starts to swap after GCC 4.7.2 quickly allocates 3 gb or so. I have not yet tried it with clang 3.1, as I don't have access to it right now.
My only guess is that this has something to do with GCC trying to be too clever and memoize the function calls, like it does with templates. If this is so, does it not seem strange that they don't have a limit on how much memoization they do, like the size of a MRU cache table or something? I have not found a switch to disable it.
Why would I do this?
I am toying with the idea of making an advanced compile time library, like genetic programming or something. Since the compilers do not have compile time tail call optimization, I am worried that anything that loops will need recursion and (even if I turn up the maximum recursion depth parameter, which seems slightly ugly to require) will quickly allocate all my RAM and fill it with pointless stack frames. Thus I came up with the above solution for getting arbitrarily many function calls without a deep stack. Such a function could be used for folding/looping or trampolining.
EDIT:
Now I have tried it in clang 3.1, and it will not leak memory at all, no matter how long I make it work (i.e how high I make MAXD). CPU usage is almost 100% and memory usage is almost 0%, just like expected. Perhaps this is just a bug in GCC then.
This may not be the definitive document regarding constexpr, but it's the primary doc linked to from the gcc constexpr wiki.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2235.pdf
... and it says...
We (still) prohibit recursion in all its form in constant expressions.
That is not strictly necessary because an implementation limit on
recursion depth in constant expression evaluation would save us from
the possibility of the compiler recursing forever. However, until we
see a convincing use case for recursion, we don’t propose to allow it.
So, I expect you're bumping up against language boundary and the way that gcc has chosen to implement constexpr (maybe attempting to generate the entire function inline, then evaluating/executing it)
Your answer is in your comment "by running the function runtime and observing that while I can make it run for a long time", which is caused by your inner most recursive call to fun(x + 1, depth + 1).
When you changed it to a runtime function rather than a compile time function by removing constexpr and observed that it ran a long time that's an indicator that it is recursing very deeply.
When the function is executed by the compiler it has to recurse deeply, but doesn't use the stack for recursion since it isn't actually generating and executing machine code.