Why is assignment slower when there's an implicit conversion? - c++

If there was similar questions please direct me there, I searched quiet some time but didn't find anything.
Backround:
I was just playing around and found some behavior I can't completely explain...
For primitive types, it looks like when there's an implicit conversion, the assignment operator = takes longer time, compared to an explicit assignment.
int iTest = 0;
long lMax = std::numeric_limits<long>::max();
for (int i=0; i< 100000; ++i)
{
// I had 3 such loops, each running 1 of the below lines.
iTest = lMax;
iTest = (int)lMax;
iTest = static_cast<int>(lMax);
}
The result is that the c style cast and c++ style static_cast performs the same on average (differs each time, but no visible difference). AND They both outperforms the implicit assignment.
Result:
iTest=-1, lMax=9223372036854775807
(iTest = lMax) used 276 microseconds
iTest=-1, lMax=9223372036854775807
(iTest = (int)lMax) used 191 microseconds
iTest=-1, lMax=9223372036854775807
(iTest = static_cast<int>(lMax)) used 187 microseconds
Question:
Why is the implicit conversion results in larger latency? I can guess it has to be detected in the assignment that int overflows, so adjusted to -1. But what exactly is going on in the assignment?
Thanks!

If you want to know why something is happening under the covers, the best place to look is ... wait for it ... under the covers :-)
That means examining the assembler language that is produced by your compiler.
A C++ environment is best thought of as an abstract machine for running C++ code. The standard (mostly) dictates behaviour rather than implementation details. Once you leave the bounds of the standard and start thinking about what happens underneath, the C++ source code is of little help anymore - you need to examine the actual code that the computer is running, the stuff output by the compiler (usually machine code).
It may be that the compiler is throwing away the loop because it's calculating the same thing every time so only needs do it once. It may be that it throws away the code altogether if it can determine you don't use the result.
There was a time many moons ago, when the VAX Fortran compiler (I did say many moons) outperformed its competitors by several orders of magnitude in a given benchmark.
That was for that exact reason. It had determined the results of the loop weren't used so had optimised the entire loop out of existence.
The other thing you might want to watch out for is the measuring tools themselves. When you're talking about durations of 1/10,000th of a second, your results can be swamped by the slightest bit of noise.
There are ways to alleviate these effects such as ensuring the thing you're measuring is substantial (over ten seconds for example), or using statistical methods to smooth out any noise.
But the bottom line is, it may be the measuring methodology causing the results you're seeing.

#include <limits>
int iTest = 0;
long lMax = std::numeric_limits<long>::max();
void foo1()
{
iTest = lMax;
}
void foo2()
{
iTest = (int)lMax;
}
void foo3()
{
iTest = static_cast<int>(lMax);
}
Compiled with GCC 5 using -O3 yields:
__Z4foo1v:
movq _lMax(%rip), %rax
movl %eax, _iTest(%rip)
ret
__Z4foo2v:
movq _lMax(%rip), %rax
movl %eax, _iTest(%rip)
ret
__Z4foo3v:
movq _lMax(%rip), %rax
movl %eax, _iTest(%rip)
ret
They are all exactly the same.
Since you didn't provide a complete example I can only guess that the difference is due to something you aren't showing us.

Related

Function pointer performance; slower on a single call than multiple calls?

I am interested in the execution speed of a function called through a pointer. I found initially that calling a function pointer through a pointer passed in as a parameter is slower than calling a locally declared function pointer. Please see the following code; you can see I have two function calls, both of which ultimately execute a lambda through a function pointer.
#include <chrono>
#include <iostream>
using namespace std;
__attribute__((noinline)) int plus_one(int x) {
return x + 1;
}
typedef int (*FUNC)(int);
#define OUTPUT_TIME(msg) std::cout << "Execution time (ns) of " << msg << ": " << std::chrono::duration_cast<chrono::nanoseconds>(t_end - t_start).count() << std::endl;
#define START_TIMING() auto const t_start = std::chrono::high_resolution_clock::now();
#define END_TIMING(msg) auto const t_end = std::chrono::high_resolution_clock::now(); OUTPUT_TIME(msg);
auto constexpr g_count = 1000000;
__attribute__((noinline)) int speed_test_no_param() {
int r;
auto local_lambda = [](int a) {
return plus_one(a);
};
FUNC f = local_lambda;
START_TIMING();
for (auto i = 0; i < g_count; ++i)
r = f(100);
END_TIMING("speed_test_no_param");
return r;
}
__attribute__((noinline)) int speed_test_with_param(FUNC &f) {
int r;
START_TIMING();
for (auto i = 0; i < g_count; ++i)
r = f(100);
END_TIMING("speed_test_with_param");
return r;
}
int main() {
int ret = 0;
auto main_lambda = [](int a) {
return plus_one(a);
};
ret += speed_test_no_param();
FUNC fp = main_lambda;
ret += speed_test_with_param(fp);
return ret;
}
Built on Ubuntu 20.04 with:
g++ -ggdb -ffunction-sections -O3 -std=c++17 -DNDEBUG=1 -DRELEASE=1 -c speed_test.cpp -o speed_test.o && g++ -o speed_test -Wl,-gc-sections -Wl,--start-group speed_test.o -Wl,--rpath='$ORIGIN' -Wl,--end-group
The results were not surprising; for any given number of runs, we see that the version without the parameter is clearly the fastest. Here is just one run; all of the many times I have run, this yields the same result:
Execution time (ns) of speed_test_no_param: 74
Execution time (ns) of speed_test_with_param: 1173849
When I dig into the assembly, I found what I believe is the reason for this. The code for speed_test_no_param() is:
0x000055555555534b call 0x555555555310 <plus_one(int)>
... whereas the code for speed_test_with_param is more complicated; a fetch of the address of the lambda, then a jump to the plus_one function:
0x000055555555544e call QWORD PTR [rbx]
...
0x0000555555555324 jmp 0x555555555310 <plus_one(int)>
(On compiler explorer at https://godbolt.org/z/b4hqYx7Eo. Different compiler but similar assembly; timing code commented out.)
What I didn't expect though is that when I reduce the number of calls down to 1 from 1000000 (auto constexpr g_count = 1), the results are flipped with the parameter version being the fastest:
Execution time (ns) of speed_test_no_param: 61
Execution time (ns) of speed_test_with_param: 31
I have also run this many times; the parameter version is always the fastest.
I do not understand why this is; I don't now believe a call through a parameter is slower than a local variable due to this conflicting evidence, but looking at the assembly suggests it really should be.
Can someone please explain?
UPDATE
As per the comment below, ordering matters. When I call speed_test_with_param() first, speed_test_no_param() is the fastest of the two! Yet when I call speed_test_no_param() first, speed_test_with_param() is the fastest! Any explanation to this would be greatly appreciated!
With multiple loop iterations in the C++ source, the fast version is only doing one in asm, because you gave the optimizer enough visibility to prove that's equivalent.
Why ordering matters with just one iteration: probably warm-up effects in the library code for std::chrono. Idiomatic way of performance evaluation?
Can you confirm that my suspicion that the call without the parameter technically should be the fastest, because with the parameter involves a memory read to find the location to call?
Much more significant is whether the compiler can constant-propagate the function pointer and see what function is being called; notice how speed_test_with_param has an actual loop that calls g_count times, but speed_test_no_param can see it's calling plus_one. Clang sees through the local lambda and the noinline to notice it has no side-effects, so it only calls it once.
It doesn't inline, but it still does inter-procedural optimization. With GCC, you could block that by using __attribute__((noipa)). GCC's noclone attribute can also stop it from making a copy of the function with constant-propagation into it, but noipa is I think stronger. noinline isn't sufficient for benchmarking stuff that becomes trivial to optimize when the compiler can see everything. But I don't think clang has anything like that.
You can make functions opaque to the optimizer by putting them in separate source files and not using -flto or other option like gcc -fwhole-program
The only reason store/reload is involved with the function pointer is because you passed it by reference for no reason, even though it's just a single pointer. If you pass it by value (https://godbolt.org/z/WEvvsvoxb) you can see call rbx in the loop.
Apparently clang couldn't hoist the load because it wasn't sure the caller's function-pointer wouldn't be modified by the call, because it was making a stand-alone version of speed_test_with_param that would work with any caller and any arg, not just the one main passes. So constprop didn't happen.
An indirect call can mispredict more easily, and yes store/reload adds a few cycles more latency before the prediction can be checked.
So yes, in general you'd expect it to be slower when the function to be called is a function-pointer arg, not a compile-time-constant fptr initialized within the calling function where the compiler can see the definition of what it's calling even if you artificially limit it.
If it becomes a call some_name instead of call rbx, that's still faster even if it does still have to loop like you were trying to make it.
(Microbenchmarking is hard, especially when you're trying to benchmark a C++ concept which can optimize differently depending on context; you have to know enough about compilers, optimization, and assembly to realize what makes the difference and what you're actually measuring. There isn't a meaningful answer to some questions, like "how fast or slow is the + operator?", even if you limit it to integers, because it can optimize away with constants, or vectorize, or not depending on how it's used.)
You're benchmarking a single iteration, which subjects you to cache effects and other warmup costs. The entire reason we normally run benchmarks several times is to amortize out these kinds of effects.
Caching refers to the memory hierarchy: your actual RAM is significantly slower than your CPU (and disk even more so), so to speed things up your CPU has a cache (often, multiple caches) which stores the most recently accessed bits of memory. The first time you start your program, it will need to be loaded from disk into RAM; thereafter, it will need to be loaded from RAM into the CPU caches. Uncached memory accesses can be orders of magnitudes slower than cached memory accesses. As your program runs, various bits of code and data will be loaded from RAM and cached; hence, subsequent executions of the same bit of code will often be faster than the first execution.
Other effects can include things like lazy dynamic linking and lazy initializations, wherein certain functions will perform extra work the first time they're called (for example, resolving dynamic library loads or initializing static data). These can all contribute to the first iteration being slower than subsequent iterations.
To address these issues, always make sure to run your benchmarks multiple times - and when possible, run your entire benchmark suite a few times in one process and take the lowest (fastest) run.

C or C++ : for loop variable

My question is a very basic one. In C or C++:
Let's say the for loop is as follows,
for(int i=0; i<someArray[a+b]; i++) {
....
do operations;
}
My question is whether the calculation a+b, is performed for each for loop or it is computed only once at the beginning of the loop?
For my requirements, the value a+b is constant. If a+b is computed and the value someArray[a+b]is accessed each time in the loop, I would use a temporary variable for someArray[a+b]to get better performance.
You can find out, when you look at the generated code
g++ -S file.cpp
and
g++ -O2 -S file.cpp
Look at the output file.s and compare the two versions. If someArray[a+b] can be reduced to a constant value for all loop cycles, the optimizer will usually do so and pull it out into a temporary variable or register.
It will behave as if it was computed each time. If the compiler is optimising and is capable of proving that the result does not change, it is allowed to move the computation out of the loop. Otherwise, it will be recomputed each time.
If you're certain the result is constant, and speed is important, use a variable to cache it.
is performed for each for loop or it is computed only once at the beginning of the loop?
If the compiler is not optimizing this code then it will be computed each time. Safer is to use a temporary variable it should not cost too much.
First, the C and C++ standards do not specify how an implementation must evaluate i<someArray[a+b], just that the result must be as if it were performed each iteration (provided the program conforms to other language requirements).
Second, any C and C++ implementation of modest quality will have the goal of avoiding repeated evaluation of expressions whose value does not change, unless optimization is disabled.
Third, several things can interfere with that goal, including:
If a, b, or someArray are declared with scope visible outside the function (e.g., are declared at file scope) and the code in the loop calls other functions, the C or C++ implementation may be unable to determine whether a, b, or someArray are altered during the loop.
If the address of a, b, or someArray or its elements is taken, the C or C++ implementation may be unable to determine whether that address is used to alter those objects. This includes the possibility that someArray is an array passed into the function, so its address is known to other entities outside the function.
If a, b, or the elements of someArray are volatile, the C or C++ implementation must assume they can be changed at any time.
Consider this code:
void foo(int *someArray, int *otherArray)
{
int a = 3, b = 4;
for(int i = 0; i < someArray[a+b]; i++)
{
… various operations …
otherArray[i] = something;
}
}
In this code, the C or C++ implementation generally cannot know whether otherArray points to the same array (or an overlapping part) as someArray. Therefore, it must assume that otherArray[i] = something; may change someArray[a+b].
Note that I have answered regarding the larger expression someArray[a+b] rather than just the part you asked about, a+b. If you are only concerned about a+b, then only the factors that affect a and b are relevant, obviously.
Depends on how good the compiler is, what optimization levels you use and how a and b are declared.
For example, if a and/or b has volatile qualifier then compiler has to read it/them everytime. In that case, compiler can't choose to optimize it with the value of a+b. Otherwise, look at the code generated by the compiler to understand what your compiler does.
There's no standard behaviour on how this is calculated in neither C not C++.
I will bet that if a and b do not change over the loop it is optimized. Moreover, if someArray[a+b] is not touched it is also optimized. This is actually more important since since fetching operations are quite expensive.
That is with any half-decent compiler with most basic optimizations. I will also go as far as saying that people who say it does always evaluate are plain wrong. It is not always for certain, and it is most probably optimized whenever possible.
The calculation is performed each for loop. Although the optimizer can be smart and optimize it out, you would be better off with something like this:
// C++ lets you create a const reference; you cannot do it in C, though
const some_array_type &last(someArray[a+b]);
for(int i=0; i<last; i++) {
...
}
It calculates every time. Use a variable :)
You can compile it and check the assembly code just to make sure.
But I think most compilers are clever enough to optimize this kind of stuff. (If you are using some optimization flag)
It might be calculated every time or it might be optimised. It will depends on whether a and b exist in a scope that the compiler can guarantee that no external function can change their values. That is, if they are in a global context, the compiler cannot guarantee that a function you call in the loop will modify them (unless you don't call any functions). If they are only in local context, then the compiler can attempt to optimise that calculation away.
Generating both optimised and unoptimised assembly code is the easiest way to check. However, the best thing to do is not care because the cost of that sum is so incredibly cheap. Modern processors are very very fast and the thing that is slow is pulling in data from RAM to the cache. If you want to optimised your code, profile it; don't guess.
The calculation a+b would be carried out every iteration of the loop, and then the lookup into someArray is carried out every iteration of the loop, so you could probably save a lot of processor time by having a temporary variable set outside the loop, for example(if the array is an array of ints say):
int myLoopVar = someArray[a+b]
for(int i=0; i<myLoopVar; i++)
{
....
do operations;
}
Very simplified explanation:
If the value at array position a+b were a mere 5 for example, that would be 5 calculations and 5 lookups, so 10 operations, which would be replaced by 8 by using a variable outside the loop (5 accesses (1 per iteration of the loop), 1 calculation of a+b, 1 lookup and 1 assignment to the new variable) not so great a saving. If however you are dealing with larger values, for example the value stored in the array at a+b id 100, you would potentially be doing 100 calculations and 100 lookups, versus 103 operations if you have a variable outside the loop (100 accesses(1 per iteration of the loop), 1 calculation of a+b, 1 lookup and 1 assignment to the new variable).
The majority of the above however is dependant on the compiler: depending upon which switches you utilise, what optimisations the compiler can apply automatically etc., the code may well be optimised without you having to do any changes to your code. Best thing to do is weigh up the pros and cons of each approach specifically for your current implementation, as what may suit a large number of iterations may not be most efficient for a small number, or perhaps memory may be an issue which would dictate a differing style to your program . . . Well you get the idea :)
Let me know if you need any more info:)
for the following code:
int a = 10, b = 10;
for(int i=0; i< (a+b); i++) {} // a and b do not change in the body of loop
you get the following assembly:
L3:
addl $1, 12(%esp) ;increment i
L2:
movl 4(%esp), %eax ;move a to reg AX
movl 8(%esp), %edx ;move b to reg BX
addl %edx, %eax ;AX = AX + BX, i.e. AX = a + b
cmpl 12(%esp), %eax ;compare AX with i
jg L3 ;if AX > i, jump to L3 label
if you apply the compiler optimization, you get the following assembly:
movl $20, %eax ;move 20 (a+b) to AX
L3:
subl $1, %eax ;decrement AX
jne L3 ;jump if condition met
movl $0, %eax ;move 0 to AX
basically, in this case, with my compiler (MinGW 4.8.0), the loop will do "the calculation" regardless of whether you're changing the conditional variables within the loop or not (haven't posted assembly for this, but take my word for it, or even better, don't and disassemble the code yourself).
when you apply the optimization, the compiler will do some magic and churn out a set of instructions that are completely unrecognizable.
if you dont feel like optimizing your loop through a compiler action (-On), then declaring one variable and assigning it a+b will reduce your assembly by an instruction or two.
int a = 10, b = 10;
const int c = a + b;
for(int i=0; i< c; i++) {}
assembly:
L3:
addl $1, 12(%esp)
L2:
movl 12(%esp), %eax
cmpl (%esp), %eax
jl L3
movl $0, %eax
keep in mind, the assembly code i posted here is only the relevant snippet, there's a bit more, but it's not relevant as far as the question goes

How efficient is an if statement compared to a test that doesn't use an if? (C++)

I need a program to get the smaller of two numbers, and I'm wondering if using a standard "if x is less than y"
int a, b, low;
if (a < b) low = a;
else low = b;
is more or less efficient than this:
int a, b, low;
low = b + ((a - b) & ((a - b) >> 31));
(or the variation of putting int delta = a - b at the top and rerplacing instances of a - b with that).
I'm just wondering which one of these would be more efficient (or if the difference is too miniscule to be relevant), and the efficiency of if-else statements versus alternatives in general.
(Disclaimer: the following deals with very low-level optimizations that are most often not necessary. If you keep reading, you waive your right to complain that computers are fast and there is never any reason to worry about this sort of thing.)
One advantage of eliminating an if statement is that you avoid branch prediction penalties.
Branch prediction penalties are generally only a problem when the branch is not easily predicted. A branch is easily predicted when it is almost always taken/not taken, or it follows a simple pattern. For example, the branch in a loop statement is taken every time except the last one, so it is easily predicted. However, if you have code like
a = random() % 10
if (a < 5)
print "Less"
else
print "Greater"
then this branch is not easily predicted, and will frequently incur the prediction penalty associated with clearing the cache and rolling back instructions that were executed in the wrong part of the branch.
One way to avoid these kinds of penalties is to use the ternary (?:) operator. In simple cases, the compiler will generate conditional move instructions rather than branches.
So
int a, b, low;
if (a < b) low = a;
else low = b;
becomes
int a, b, low;
low = (a < b) ? a : b
and in the second case a branching instruction is not necessary. Additionally, it is much clearer and more readable than your bit-twiddling implementation.
Of course, this is a micro-optimization which is unlikely to have significant impact on your code.
Simple answer: One conditional jump is going to be more efficient than two subtractions, one addition, a bitwise and, and a shift operation combined. I've been sufficiently schooled on this point (see the comments) that I'm no longer even confident enough to say that it's usually more efficient.
Pragmatic answer: Either way, you're not paying nearly as much for the extra CPU cycles as you are for the time it takes a programmer to figure out what that second example is doing. Program for readability first, efficiency second.
Compiling this on gcc 4.3.4, amd64 (core 2 duo), Linux:
int foo1(int a, int b)
{
int low;
if (a < b) low = a;
else low = b;
return low;
}
int foo2(int a, int b)
{
int low;
low = b + ((a - b) & ((a - b) >> 31));
return low;
}
I get:
foo1:
cmpl %edi, %esi
cmovle %esi, %edi
movl %edi, %eax
ret
foo2:
subl %esi, %edi
movl %edi, %eax
sarl $31, %eax
andl %edi, %eax
addl %esi, %eax
ret
...which I'm pretty sure won't count for branch predictions, since the code doesn't jump. Also, the non if-statement version is 2 instructions longer. I think I will continue coding, and let the compiler do it's job.
Like with any low-level optimization, test it on the target CPU/board setup.
On my compiler (gcc 4.5.1 on x86_64), the first example becomes
cmpl %ebx, %eax
cmovle %eax, %esi
The second example becomes
subl %eax, %ebx
movl %ebx, %edx
sarl $31, %edx
andl %ebx, %edx
leal (%rdx,%rax), %esi
Not sure if the first one is faster in all cases, but I would bet it is.
The biggest problem is that your second example won't work on 64-bit machines.
However, even neglecting that, modern compilers are smart enough to consider branchless prediction in every case possible, and compare the estimated speeds. So, you second example will most likely actually be slower
There will be no difference between the if statement and using a ternary operator, as even most dumb compilers are smart enough to recognize this special case.
[Edit] Because I think this is such an interesting topic, I've written a blog post on it.
Either way, the assembly will only be a few instructions and either way it'll take picoseconds for those instructions to execute.
I would profile the application an concentrate your optimization efforts to something more worthwhile.
Also, the time saved by this type of optimization will not be worth the time wasted by anyone trying to maintain it.
For simple statements like this, I find the ternary operator very intuitive:
low = (a < b) ? a : b;
Clear and concise.
For something as simple as this, why not just experiment and try it out?
Generally, you'd profile first, identify this as a hotspot, experiment with a change, and view the result.
I wrote a simple program that compares both techniques passing in random numbers (so that we don't see perfect branch prediction) with Visual C++ 2010. The difference between the approaches on my machine for 100,000,000 iteration? Less than 50ms total, and the if version tended to be faster. Looking at the codegen, the compiler successfully converted the simple if to a cmovl instruction, avoiding a branch altogether.
One thing to be wary of when you get into really bit-fiddly kinds of hacks is how they may interact with compiler optimizations that take place after inlining. For example, the readable procedure
int foo (int a, int b) {
return ((a < b) ? a : b);
}
is likely to be compiled into something very efficient in any case, but in some cases it may be even better. Suppose, for example, that someone writes
int bar = foo (x, x+3);
After inlining, the compiler will recognize that 3 is positive, and may then make use of the fact that signed overflow is undefined to eliminate the test altogether, to get
int bar = x;
It's much less clear how the compiler should optimize your second implementation in this context. This is a rather contrived example, of course, but similar optimizations actually are important in practice. Of course you shouldn't accept bad compiler output when performance is critical, but it's likely wise to see if you can find clear code that produces good output before you resort to code that the next, amazingly improved, version of the compiler won't be able to optimize to death.
One thing I will point out that I haven't noticed mention that an optimization like this can easily be overwhelmed by other issues. For example, if you are running this routine on two large arrays of numbers (or worse yet, pairs of number scattered in memory), the cost of fetching the values on today's CPUs can easily stall the CPU's execution pipelines.
I'm just wondering which one of these
would be more efficient (or if the
difference is to miniscule to be
relevant), and the efficiency of
if-else statements versus alternatives
in general.
Desktop/server CPUs are optimized for pipelining. Second is theoretically faster because CPU doesn't have to branch and can utilize multiple ALUs to evaluate parts of expression in parallel. More non-branching code with intermixed independent operations are best for such CPUs. (But even that is negated now by modern "conditional" CPU instructions which allow to make the first code branch-less too.)
On embedded CPUs branching if often less expensive (relatively to everything else), nor they have many spare ALUs to evaluate operations out-of-order (that's if they support out-of-order execution at all). Less code/data is better - caches are small too. (I have even seen uses of buble-sort in embedded applications: the algorithm uses least of memory/code and fast enough for small amounts of information.)
Important: do not forget about the compiler optimizations. Using many tricks, the compilers sometimes can remove the branching themselves: inlining, constant propagation, refactoring, etc.
But in the end I would say that yes, the difference is minuscule to be relevant. In long term, readable code wins.
The way things go on the CPU front, it is more rewarding to invest time now in making the code multi-threaded and OpenCL capable.
Why low = a; in the if and low = a; in the else? And, why 31? If 31 has anything to do with CPU word size, what if the code is to be run on a CPU of different size?
The if..else way looks more readable. I like programs to be as readable to humans as they are to the compilers.
profile results with gcc -o foo -g -p -O0, Solaris 9 v240
%Time Seconds Cumsecs #Calls msec/call Name
36.8 0.21 0.21 8424829 0.0000 foo2
28.1 0.16 0.37 1 160. main
17.5 0.10 0.4716850667 0.0000 _mcount
17.5 0.10 0.57 8424829 0.0000 foo1
0.0 0.00 0.57 4 0. atexit
0.0 0.00 0.57 1 0. _fpsetsticky
0.0 0.00 0.57 1 0. _exithandle
0.0 0.00 0.57 1 0. _profil
0.0 0.00 0.57 1000 0.000 rand
0.0 0.00 0.57 1 0. exit
code:
int
foo1 (int a, int b, int low)
{
if (a < b)
low = a;
else
low = b;
return low;
}
int
foo2 (int a, int b, int low)
{
low = (a < b) ? a : b;
return low;
}
int main()
{
int low=0;
int a=0;
int b=0;
int i=500;
while (i--)
{
for(a=rand(), b=rand(); a; a--)
{
low=foo1(a,b,low);
low=foo2(a,b,low);
}
}
return 0;
}
Based on data, in the above environment, the exact opposite of several beliefs stated here were not found to be true. Note the 'in this environment' If construct was faster than ternary ? : construct
I had written ternary logic simulator not so long ago, and this question was viable to me, as it directly affects my interpretator execution speed; I was required to simulate tons and tons of ternary logic gates as fast as possible.
In a binary-coded-ternary system one trit is packed in two bits. Most significant bit means negative and least significant means positive one. Case "11" should not occur, but it must be handled properly and threated as 0.
Consider inline int bct_decoder( unsigned bctData ) function, which should return our formatted trit as regular integer -1, 0 or 1; As i observed there are 4 approaches: i called them "cond", "mod", "math" and "lut"; Lets investigate them
First is based on jz|jnz and jl|jb conditional jumps, thus cond. Its performance is not good at all, because relies on a branch predictor. And even worse - it varies, because it is unknown if there will be one branch or two a priori. And here is an example:
inline int bct_decoder_cond( unsigned bctData ) {
unsigned lsB = bctData & 1;
unsigned msB = bctData >> 1;
return
( lsB == msB ) ? 0 : // most possible -> make zero fastest branch
( lsB > msB ) ? 1 : -1;
}
This is slowest version, it could involve 2 branches in worst case and this is something where binary logic fails. On my 3770k it prodices around 200MIPS on average on random data. (here and after - each test is average from 1000 tries on randomly filled 2mb dataset)
Next one relies on modulo operator and its speed is somewhere in between first and third, but is definetely faster - 600 MIPS:
inline int bct_decoder_mod( unsigned bctData ) {
return ( int )( ( bctData + 1 ) % 3 ) - 1;
}
Next one is branchless approach, which involves only maths, thus math; it does not assume jump instrunctions at all:
inline int bct_decoder_math( unsigned bctData ) {
return ( int )( bctData & 1 ) - ( int )( bctData >> 1 );
}
This does what is should, and behaves really great. To compare, performance estimate is 1000 MIPS, and it is 5x faster than branched version. Probably branched version is slowed down due to lack of native 2-bit signed int support. But in my application it is quite good version in itself.
If this is not enough then we can go futher, having something special. Next is called lookup table approach:
inline int bct_decoder_lut( unsigned bctData ) {
static const int decoderLUT[] = { 0, 1, -1, 0 };
return decoderLUT[ bctData & 0x3 ];
}
In my case one trit occupied only 2 bits, so lut table was only 2b*4 = 8 bytes, and was worth trying. It fits in cache and works blazing fast at 1400-1600 MIPS, here is where my measurement accuracy is going down. And that is is 1.5x speedup from fast math approach. That's because you just have precalculated result and single AND instruction. Sadly caches are small and (if your index length is greater than several bits) you simply cannot use it.
So i think i answered your question, on what what could branched/branchless code be like. Answer is much better and with detailed samples, real world application and real performance measurements results.
Updated answer taking the current (2018) state of compiler vectorization. Please see danben's answer for the general case where vectorization is not a concern.
TLDR summary: avoiding ifs can help with vectorization.
Because SIMD would be too complex to allow branching on some elements, but not others, any code containing an if statement will fail to be vectorized unless the compiler knows a "superoptimization" technique that can rewrite it into a branchless set of operations. I don't know of any compilers that are doing this as an integrated part of the vectorization pass (Clang does some of this independently, but not specificly toward helping vectorization AFAIK)
Using the OP's provided example:
int a, b, low;
low = b + ((a - b) & ((a - b) >> 31));
Many compilers can vectorize this to be something approximately equivalent to:
__m128i low128i(__m128i a, __m128i b){
__m128i diff, tmp;
diff = _mm_sub_epi32(a,b);
tmp = _mm_srai_epi32(diff, 31);
tmp = _mm_and_si128(tmp,diff);
return _mm_add_epi32(tmp,b);
}
This optimization would require the data to be layed out in a fashion that would allow for it, but it could be extended to __m256i with avx2 or __m512i with avx512 (and even unroll loops further to take advantage of additional registers) or other simd instructions on other architectures. Another plus is that these instructions are all low latency, high-throughput instructions (latencies of ~1 and reciprocal throughputs in the range of 0.33 to 0.5 - so really fast relative to non-vectorized code)
I see no reason why compilers couldn't optimize an if statement to a vectorized conditional move (except that the corresponding x86 operations only work on memory locations and have low throughput and other architectures like arm may lack it entirely) but it could be done by doing something like:
void lowhi128i(__m128i *a, __m128i *b){ // does both low and high
__m128i _a=*a, _b=*b;
__m128i lomask = _mm_cmpgt_epi32(_a,_b),
__m128i himask = _mm_cmpgt_epi32(_b,_a);
_mm_maskmoveu_si128(_b,lomask,a);
_mm_maskmoveu_si128(_a,himask,b);
}
However this would have a much higher latency due to memory reads and writes and lower throughput (higher/worse reciprocal throughput) than the example above.
Unless you're really trying to buckle down on efficiency, I don't think this is something you need to worry about.
My simple thought though is that the if would be quicker because it's comparing one thing, while the other code is doing several operations. But again, I imagine that the difference is minuscule.
If it is for Gnu C++, try this
int min = i <? j;
I have not profiled it but I think it is definitely the one to beat.

C++ Declaring int in the for loop

Haven't used C++ in a while. I've been depending on my Java compiler to do optimization.
What's is the most optimized way to do a for loop in C++? Or it is all the same now with moderm compilers? In the 'old days' there was a difference.
for (int i=1; i<=100; i++)
OR
int i;
for (i=1; i<=100; i++)
OR
int i = 1;
for ( ; i<=100; i++)
Is it the same in C?
EDIT:
Okay, so the overwhelming consensus is to use the first case and let the complier optimize with it if it want to.
I'd say that trivial things like this are probably optimized by the compiler, and you shouldn't worry about them. The first option is the most readable, so you should use that.
EDIT: Adding what other answers said, there is also the difference that if you declare the variable in the loop initializer, it will stop to exist after the loop ends.
The difference is scope.
for(int i = 1; i <= 100; ++i)
is generally preferable because then the scope of i is restricted to the for loop. If you declare it before the for loop, then it continues to exist after the for loop has finished and could clash with other variables. If you're only using it in the for loop, there's no reason to let it exist longer than that.
Let's say the original poster had a loop they really wanted optimized - every instruction counted. How can we figure out - empirically - the answer to his question?
gcc at least has a useful, if uncommonly used switch, '-S'. It dumps the assembly code version of the .c file and can be used to answer questions like the OP poses. I wrote a simple program:
int main( )
{
int sum = 0;
for(int i=1;i<=10;++i)
{
sum = sum + i;
}
return sum;
}
And ran: gcc -O0 -std=c99 -S main.c, creating the assembly version of the main program. Here's the contents of main.s (with some of the fluff removed):
movl $0, -8(%rbp)
movl $1, -4(%rbp)
jmp .L2
.L3:
movl -4(%rbp), %eax
addl %eax, -8(%rbp)
addl $1, -4(%rbp)
.L2:
cmpl $10, -4(%rbp)
jle .L3
You don't need to be an assembly expert to figure out what's going on. movl moves values, addl adds things, cmpl compares and jle stands for 'jump if less than', $ is for constants. It's loading 0 into something - that must be 'sum', 1 into something else - ah, 'i'! A jump to L2 where we do the compare to 10, jump to L3 to do the add. Fall through to L2 for the compare again. Neat! A for loop.
Change the program to:
int main( )
{
int sum = 0;
int i=1;
for( ;i<=10;++i)
{
sum = sum + i;
}
return sum;
}
Rerun gcc and the resultant assembly will be very similar. There's some stuff going on with recording line numbers, so they won't be identical, but the assembly ends up being the same. Same result with the last case. So, even without optimization, the code's just about the same.
For fun, rerun gcc with '-O3' instead of '-O0' to enable optimization and look at the .s file.
main:
movl $55, %eax
ret
gcc not only figured out we were doing a for loop, but also realized it was to be run a constant number of times did the loop for us at compile time, chucked out 'i' and 'sum' and hard coded the answer - 55! That's FAST - albeit a bit contrived.
Moral of the story? Spend your time on ensuring your code is clean and well designed. Code for readability and maintainability. The guys that live on mountain dew and cheetos are way smarter than us and have taken care of most of these simple optimization problems for us. Have fun!
It's the same. The compiler will optimize these to the same thing.
Even if they weren't the same, the difference compared to the actual body of your loop would be negligible. You shouldn't worry about micro-optimizations like this. And you shouldn't make micro-optimizations unless you are performance profiling to see if it actually makes a difference.
It's the same in term of speed. Compiler will optimize if you do not have a later use of i.
In terms of style - I'd put the definition in the loop construct, as it reduces the risk that you'll conflict if you define another i later.
Don't worry about micro-optimizations, let the compiler do it. Pick whichever is most readable. Note that declaring a variable within a for initialization statement scopes the variable to the for statement (C++03 § 6.5.3 1), though the exact behavior of compilers may vary (some let you pick). If code outside the loop uses the variable, declare it outside the loop. If the variable is truly local to the loop, declare it in the initializer.
It has already been mentioned that the main difference between the two is scope. Make sure you understand how your compiler handles the scope of an int declared as
for (int i = 1; ...;...)
I know that when using MSVC++6, i is still in scope outside the loop, just as if it were declared before the loop. This behavior is different from VS2005, and I'd have to check, but I think the last version of gcc that I used. In both of those compilers, that variable was only in scope inside the loop.
for(int i = 1; i <= 100; ++i)
This is easiest to read, except for ANSI C / C89 where it is invalid.
A c++ for loop is literally a packaged while loop.
for (int i=1; i<=100; i++)
{
some foobar ;
}
To the compiler, the above code is exactly the same as the code below.
{
int i=1 ;
while (i<=100){
some foobar ;
i++ ;
}
}
Note the int i=1 ; is contained within a dedicated scope that encloses only it and the while loop.
It's all the same.

Can C++ compilers optimize "if" statements inside "for" loops?

Consider an example like this:
if (flag)
for (condition)
do_something();
else
for (condition)
do_something_else();
If flag doesn't change inside the for loops, this should be semantically equivalent to:
for (condition)
if (flag)
do_something();
else
do_something_else();
Only in the first case, the code might be much longer (e.g. if several for loops are used or if do_something() is a code block that is mostly identical to do_something_else()), while in the second case, the flag gets checked many times.
I'm curious whether current C++ compilers (most importantly, g++) would be able to optimize the second example to get rid of the repeated tests inside the for loop. If so, under what conditions is this possible?
Yes, if it is determined that flag doesn't change and can't be changed by do_something or do_something_else, it can be pulled outside the loop. I've heard of this called loop hoisting, but Wikipedia has an entry called "loop invariant code motion".
If flags is a local variable, the compiler should be able to do this optimization since it's guaranteed to have no effect on the behavior of the generated code.
If flags is a global variable, and you call functions inside your loop it might not perform the optimization - it may not be able to determine if those functions modify the global.
This can also be affected by the sort of optimization you do - optimizing for size would favor the non-hoisted version while optimizing for speed would probably favor the hoisted version.
In general, this isn't the sort of thing that you should worry about, unless profiling tells you that the function is a hotspot and you see that less than efficient code is actually being generated by going over the assembly the compiler outputs. Micro-optimizations like this you should always just leave to the compiler unless you absolutely have to.
Tried with GCC and -O3:
void foo();
void bar();
int main()
{
bool doesnt_change = true;
for (int i = 0; i != 3; ++i) {
if (doesnt_change) {
foo();
}
else {
bar();
}
}
}
Result for main:
_main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
call ___main
call __Z3foov
call __Z3foov
call __Z3foov
xorl %eax, %eax
leave
ret
So it does optimize away the choice (and unrolls smaller loops).
This optimization is not done if doesnt_change is global.
I'm sure if the compiler can determine that the flag will remain constant, it can do some shufflling:
const bool flag = /* ... */;
for (..;..;..;)
{
if (flag)
{
// ...
}
else
{
// ...
}
}
If the flag is not const, the compiler cannot necessarily optimize the loop, because it can't be sure flag won't change. It can if it does static analysis, but not all compilers do, I think. const is the sure-fire way of telling the compiler the flag won't change, after that it's up to the compiler.
As usual, profile and find out if it's really a problem.
I would be wary to say that it will. Can it guarantee that the value won't be modified by this, or another thread?
That said, the second version of the code is generally more readable and it would probably be the last thing to optimize in a block of code.
As many have said: it depends.
If you want to be sure, you should try to force a compile-time decision. Templates often come in handy for this:
for (condition)
do_it<flag>();
Generally, yes. But there is no guarantee, and the places where the compiler will do it are probably rare.
What most compilers do without a problem is hoisting immutable evaluations out of the loop, e.g. if your condition is
if (a<b) ....
when a and b are not affected by the loop, the comparison will be made once before the loop.
This means if the compiler can determine the condition does not change, the test is cheap and the jump wenll predicted. This in turn means the test itself costs one cycle or no cycle at all (really).
In which cases splitting the loop would be beneficial?
a) a very tight loop where the 1 cycle is a significant cost
b) the entire loop with both parts does not fit the code cache
Now, the compiler can only make assumptions about the code cache, and usually can order the code in a way that one branch will fit the cache.
Without any testing, I'dexpect a) the only case where such an optimization would be applied, becasue it's nto always the better choice:
In which cases splitting the loop would be bad?
When splitting the loop increases code size beyond the code cache, you will take a significant hit. Now, that only affects you if the loop itself is called within another loop, but that's something the compiler usually can't determine.
[edit]
I couldn't get VC9 to split the following loop (one of the few cases where it might actually be beneficial)
extern volatile int vflag = 0;
int foo(int count)
{
int sum = 0;
int flag = vflag;
for(int i=0; i<count; ++i)
{
if (flag)
sum += i;
else
sum -= i;
}
return sum;
}
[edit 2]
note that with int flag = true; the second branch does get optimized away. (and no, const doesn't make a difference here ;))
What does that mean? Either it doesn't support that, it doesn't matter, ro my analysis is wrong ;-)
Generally, I'd asume this is an optimization that is valuable only in a very few cases, and can be done by hand easily in most scenarios.
It's called a loop invariant and the optimization is called loop invariant code motion and also code hoisting. The fact that it's in a conditional will definitely make the code analysis more complex and the compiler may or may not invert the loop and the conditional depending on how clever the optimizer is.
There is a general answer for any specific case of this kind of question, and that's to compile your program and look at the generated code.