Does optimization remove unnecessary `cend()` calls? [duplicate] - c++

I was getting through "Exceptional C++" by Herb Sutter lately, and I have serious doubts about a particular recommendation he gives in Item 6 - Temporary Objects.
He offers to find unnecessary temporary objects in the following code:
string FindAddr(list<Employee> emps, string name)
{
for (list<Employee>::iterator i = emps.begin(); i != emps.end(); i++)
{
if( *i == name )
{
return i->addr;
}
}
return "";
}
As one of the example, he recommends to precompute the value of emps.end() before the loop, since there is a temporary object created on every iteration:
For most containers (including list), calling end() returns a
temporary object that must be constructed and destroyed. Because the
value will not change, recomputing (and reconstructing and
redestroying) it on every loop iteration is both needlessly
inefficient and unaesthetic. The value should be computed only once,
stored in a local object, and reused.
And he suggests replacing by the following:
list<Employee>::const_iterator end(emps.end());
for (list<Employee>::const_iterator i = emps.begin(); i != end; ++i)
For me, this is unnecessary complication. Even if one replaces ugly type declarations with compact auto, he still gets two lines of code instead of one. Even more, he has this end variable in the outer scope.
I was sure modern compilers will optimize this piece of code anyway, because I'm actually using const_iterator here and it is easy to check whether the loop content is accessing the container somehow. Compilers got smarter within the last 13 years, right?
Anyway, I will prefer the first version with i != emps.end() in most cases, where I'm not so much worried about performance. But I want to know for sure, whether this is a kind of construction I could rely on a compiler to optimize?
Update
Thanks for your suggestions on how to make this useless code better. Please note, my question is about compiler, not programming techniques. The only relevant answers for now are from NPE and Ellioh.

UPD: The book you are speaking about has been published in 1999, unless I'm mistaking. That's 14 years ago, and in modern programming 14 years is a lot of time. Many recommendations that were good and reliable in 1999, may be completely obsolete by now. Though my answer is about a single compiler and a single platform, there is also a more general idea.
Caring about extra variables, reusing a return value of trivial methods and similar tricks of old C++ is a step back towards the C++ of 1990s. Trivial methods like end() should be inlined quite well, and the result of inlining should be optimized as a part of the code it is called from. 99% situations do not require manual actions such as creating an end variable at all. Such things should be done only if:
You KNOW that on some of the compilers/platforms you should run on the code is not optimized well.
It has become a bottleneck in your program ("avoid premature optimization").
I've looked at what is generated by 64-bit g++:
gcc version 4.6.3 20120918 (prerelease) (Ubuntu/Linaro 4.6.3-10ubuntu1)
Initially I thought that with optimizations on it should be ok and there should be no difference between two versions. But looks like things are strange: the version you considered non-optimal is actually better. I think, the moral is: there is no reason to try being smarter than a compiler. Let's see both versions.
#include <list>
using namespace std;
int main() {
list<char> l;
l.push_back('a');
for(list<char>::iterator i=l.begin(); i != l.end(); i++)
;
return 0;
}
int main1() {
list<char> l;
l.push_back('a');
list<char>::iterator e=l.end();
for(list<char>::iterator i=l.begin(); i != e; i++)
;
return 0;
}
Then we should compile this with optimizations on (I use 64-bit g++, you may try your compiler) and disassemble main and main1:
For main:
(gdb) disas main
Dump of assembler code for function main():
0x0000000000400650 <+0>: push %rbx
0x0000000000400651 <+1>: mov $0x18,%edi
0x0000000000400656 <+6>: sub $0x20,%rsp
0x000000000040065a <+10>: lea 0x10(%rsp),%rbx
0x000000000040065f <+15>: mov %rbx,0x10(%rsp)
0x0000000000400664 <+20>: mov %rbx,0x18(%rsp)
0x0000000000400669 <+25>: callq 0x400630 <_Znwm#plt>
0x000000000040066e <+30>: cmp $0xfffffffffffffff0,%rax
0x0000000000400672 <+34>: je 0x400678 <main()+40>
0x0000000000400674 <+36>: movb $0x61,0x10(%rax)
0x0000000000400678 <+40>: mov %rax,%rdi
0x000000000040067b <+43>: mov %rbx,%rsi
0x000000000040067e <+46>: callq 0x400610 <_ZNSt8__detail15_List_node_base7_M_hookEPS0_#plt>
0x0000000000400683 <+51>: mov 0x10(%rsp),%rax
0x0000000000400688 <+56>: cmp %rbx,%rax
0x000000000040068b <+59>: je 0x400698 <main()+72>
0x000000000040068d <+61>: nopl (%rax)
0x0000000000400690 <+64>: mov (%rax),%rax
0x0000000000400693 <+67>: cmp %rbx,%rax
0x0000000000400696 <+70>: jne 0x400690 <main()+64>
0x0000000000400698 <+72>: mov %rbx,%rdi
0x000000000040069b <+75>: callq 0x400840 <std::list<char, std::allocator<char> >::~list()>
0x00000000004006a0 <+80>: add $0x20,%rsp
0x00000000004006a4 <+84>: xor %eax,%eax
0x00000000004006a6 <+86>: pop %rbx
0x00000000004006a7 <+87>: retq
Look at the commands located at 0x0000000000400683-0x000000000040068b. That's the loop body and it seems to be perfectly optimized:
0x0000000000400690 <+64>: mov (%rax),%rax
0x0000000000400693 <+67>: cmp %rbx,%rax
0x0000000000400696 <+70>: jne 0x400690 <main()+64>
For main1:
(gdb) disas main1
Dump of assembler code for function main1():
0x00000000004007b0 <+0>: push %rbp
0x00000000004007b1 <+1>: mov $0x18,%edi
0x00000000004007b6 <+6>: push %rbx
0x00000000004007b7 <+7>: sub $0x18,%rsp
0x00000000004007bb <+11>: mov %rsp,%rbx
0x00000000004007be <+14>: mov %rsp,(%rsp)
0x00000000004007c2 <+18>: mov %rsp,0x8(%rsp)
0x00000000004007c7 <+23>: callq 0x400630 <_Znwm#plt>
0x00000000004007cc <+28>: cmp $0xfffffffffffffff0,%rax
0x00000000004007d0 <+32>: je 0x4007d6 <main1()+38>
0x00000000004007d2 <+34>: movb $0x61,0x10(%rax)
0x00000000004007d6 <+38>: mov %rax,%rdi
0x00000000004007d9 <+41>: mov %rsp,%rsi
0x00000000004007dc <+44>: callq 0x400610 <_ZNSt8__detail15_List_node_base7_M_hookEPS0_#plt>
0x00000000004007e1 <+49>: mov (%rsp),%rdi
0x00000000004007e5 <+53>: cmp %rbx,%rdi
0x00000000004007e8 <+56>: je 0x400818 <main1()+104>
0x00000000004007ea <+58>: mov %rdi,%rax
0x00000000004007ed <+61>: nopl (%rax)
0x00000000004007f0 <+64>: mov (%rax),%rax
0x00000000004007f3 <+67>: cmp %rbx,%rax
0x00000000004007f6 <+70>: jne 0x4007f0 <main1()+64>
0x00000000004007f8 <+72>: mov (%rdi),%rbp
0x00000000004007fb <+75>: callq 0x4005f0 <_ZdlPv#plt>
0x0000000000400800 <+80>: cmp %rbx,%rbp
0x0000000000400803 <+83>: je 0x400818 <main1()+104>
0x0000000000400805 <+85>: nopl (%rax)
0x0000000000400808 <+88>: mov %rbp,%rdi
0x000000000040080b <+91>: mov (%rdi),%rbp
0x000000000040080e <+94>: callq 0x4005f0 <_ZdlPv#plt>
0x0000000000400813 <+99>: cmp %rbx,%rbp
0x0000000000400816 <+102>: jne 0x400808 <main1()+88>
0x0000000000400818 <+104>: add $0x18,%rsp
0x000000000040081c <+108>: xor %eax,%eax
0x000000000040081e <+110>: pop %rbx
0x000000000040081f <+111>: pop %rbp
0x0000000000400820 <+112>: retq
The code for the loop is similar, it is:
0x00000000004007f0 <+64>: mov (%rax),%rax
0x00000000004007f3 <+67>: cmp %rbx,%rax
0x00000000004007f6 <+70>: jne 0x4007f0 <main1()+64>
But there is alot of extra stuff around the loop. Apparently, extra code has made the things WORSE.

I've compiled the following slightly hacky code using g++ 4.7.2 with -O3 -std=c++11, and got identical assembly for both functions:
#include <list>
#include <string>
using namespace std;
struct Employee: public string { string addr; };
string FindAddr1(list<Employee> emps, string name)
{
for (list<Employee>::const_iterator i = emps.begin(); i != emps.end(); i++)
{
if( *i == name )
{
return i->addr;
}
}
return "";
}
string FindAddr2(list<Employee> emps, string name)
{
list<Employee>::const_iterator end(emps.end());
for (list<Employee>::const_iterator i = emps.begin(); i != end; i++)
{
if( *i == name )
{
return i->addr;
}
}
return "";
}
In any event, I think the choice between the two versions should be primarily based on grounds of readability. Without profiling data, micro-optimizations like this to me look premature.

Contrary to popular belief, I don't see any difference between VC++ and gcc in this respect. I did a quick check with both g++ 4.7.2 and MS C++ 17 (aka VC++ 2012).
In both cases I compared the code generated with the code as in the question (with headers and such added to let it compile), to the following code:
string FindAddr(list<Employee> emps, string name)
{
auto end = emps.end();
for (list<Employee>::iterator i = emps.begin(); i != end; i++)
{
if( *i == name )
{
return i->addr;
}
}
return "";
}
In both cases the result was essentially identical for the two pieces of code. VC++ includes line-number comments in the code, which changed because of the extra line, but that was the only difference. With g++ the output files were identical.
Doing the same with std::vector instead of std::list, gave pretty much the same result -- no significant difference. For some reason, g++ did switch the order of operands for one instruction, from cmp esi, DWORD PTR [eax+4] to cmp DWORD PTR [eax+4], esi, but (again) this is utterly irrelevant.
Bottom line: no, you're not likely to gain anything from manually hoisting the code out of the loop with a modern compiler (at least with optimization enabled -- I was using /O2b2 with VC++ and /O3 with g++; comparing optimization with optimization turned off seems pretty pointless to me).

A couple of things... the first is that in general the cost of building an iterator (in Release mode, unchecked allocators) is minimal. They are usually wrappers around a pointer. With checked allocators (default in VS) you might have some cost, but if you really need the performance, after testing rebuild with unchecked allocators.
The code need not be as ugly as what you posted:
for (list<Employee>::const_iterator it=emps.begin(), end=emps.end();
it != end; ++it )
The main decision on whether you want to use one or the other approaches should be in terms of what operations are being applied to the container. If the container might be changing it's size then you might want to recompute the end iterator in each iteration. If not, you can just precompute once and reuse as in the code above.

If you really need the performance, you let your shiny new C++11 compiler write it for you:
for (const auto &i : emps) {
/* ... */
}
Yes, this is tongue-in-cheek (sort of). Herb's example here is now out of date. But since your compiler doesn't support it yet, let's get to the real question:
Is this a kind of construction I could rely on a compiler to optimize?
My rule of thumb is that the compiler writers are way smarter than I am. I can't rely on a compiler to optimize any one piece of code, because it might choose to optimize something else that has a bigger impact. The only way to know for sure is to try out both approaches on your compiler on your system and see what happens. Check your profiler results. If the call to .end() sticks out, save it in a separate variable. Otherwise, don't worry about it.

Containers like vector returns variable, which stores pointer to the end, on end() call, that optimized. If you've written container which does some lookups, etc on end() call consider writing
for (list<Employee>::const_iterator i = emps.begin(), end = emps.end(); i != end; ++i)
{
...
}
for speed

Use std algorithms
He's right of course; calling end can instantiate and destroy a temporary object, which is generally bad.
Of course, the compiler can optimise this away in a lot of cases.
There is a better and more robust solution: encapsulate your loops.
The example you gave is in fact std::find, give or take the return value. Many other loops also have std algorithms, or at least something similar enough that you can adapt - my utility library has a transform_if implementation, for example.
So, hide loops in a function and take a const& to end. Same fix as your example, but much much cleaner.

Related

Are arguments loaded into the cache for empty functions?

I know that C++ compilers optimize empty (static) functions.
Based on that knowledge I wrote a piece of code that should get optimized away whenever I some identifier is defined (using the -D option of the compiler).
Consider the following dummy example:
#include <iostream>
#ifdef NO_INC
struct T {
static inline void inc(int& v, int i) {}
};
#else
struct T {
static inline void inc(int& v, int i) {
v += i;
}
};
#endif
int main(int argc, char* argv[]) {
int a = 42;
for (int i = 0; i < argc; ++i)
T::inc(a, i);
std::cout << a;
}
The desired behavior would be the following:
Whenever the NO_INC identifier is defined (using -DNO_INC when compiling), all calls to T::inc(...) should be optimized away (due to the empty function body). Otherwise, the call to T::inc(...) should trigger an increment by some given value i.
I got two questions regarding this:
Is my assumption correct that calls to T::inc(...) do not affect the performance negatively when I specify the -DNO_INC option because the call to the empty function is optimized?
I wonder if the variables (a and i) are still loaded into the cache when T::inc(a, i) is called (assuming they are not there yet) although the function body is empty.
Thanks for any advice!
Compiler Explorer is an very useful tool to look at the assembly of your generated program, because there is no other way to figure out if the compiler optimized something or not for sure. Demo.
With actually incrementing, your main looks like:
main: # #main
push rax
test edi, edi
jle .LBB0_1
lea eax, [rdi - 1]
lea ecx, [rdi - 2]
imul rcx, rax
shr rcx
lea esi, [rcx + rdi]
add esi, 41
jmp .LBB0_3
.LBB0_1:
mov esi, 42
.LBB0_3:
mov edi, offset std::cout
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
xor eax, eax
pop rcx
ret
As you can see, the compiler completely inlined the call to T::inc and does the incrementing directly.
For an empty T::inc you get:
main: # #main
push rax
mov edi, offset std::cout
mov esi, 42
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
xor eax, eax
pop rcx
ret
The compiler optimized away the entire loop!
Is my assumption correct that calls to t.inc(...) do not affect the performance negatively when I specify the -DNO_INC option because the call to the empty function is optimized?
Yes.
If my assumption holds, does it also hold for more complex function bodies (in the #else branch)?
No, for some definition of "complex". Compilers use heuristics to determine whether it's worth it to inline a function or not, and bases its decision on that and on nothing else.
I wonder if the variables (a and i) are still loaded into the cache when t.inc(a, i) is called (assuming they are not there yet) although the function body is empty.
No, as demonstrated above, the loop doesn't even exist.
Is my assumption correct that calls to t.inc(...) do not affect the performance negatively when I specify the -DNO_INC option because the call to the empty function is optimized? If my assumption holds, does it also hold for more complex function bodies (in the #else branch)?
You are right. I have modified your example (i.e. removed cout which clutters the assembly) in compiler explorer to make it more obvious what happens.
The compiler optimizes everything away and outouts
main: # #main
movl $42, %eax
retq
Only 42 is leaded in eax and returned.
For the more complex case, however, more instructions are needed to compute the return value. See here
main: # #main
testl %edi, %edi
jle .LBB0_1
leal -1(%rdi), %eax
leal -2(%rdi), %ecx
imulq %rax, %rcx
shrq %rcx
leal (%rcx,%rdi), %eax
addl $41, %eax
retq
.LBB0_1:
movl $42, %eax
retq
I wonder if the variables (a and i) are still loaded into the cache when t.inc(a, i) is called (assuming they are not there yet) although the function body is empty.
They are only loaded, when the compiler cannot reason that they are unused. See the second example of compiler explorer.
By the way: You do not need to make an instance of T (i.e. T t;) in order to call a static function within a class. This is defeating the purpose. Call it like T::inc(...) rahter than t.inc(...).
Because the inline keword is used, you can safely assume 1. Using these functions shouldn't negatively affect performance.
Running your code through
g++ -c -Os -g
objdump -S
confirms this; An extract:
int main(int argc, char* argv[]) {
T t;
int a = 42;
1020: b8 2a 00 00 00 mov $0x2a,%eax
for (int i = 0; i < argc; ++i)
1025: 31 d2 xor %edx,%edx
1027: 39 fa cmp %edi,%edx
1029: 7d 06 jge 1031 <main+0x11>
v += i;
102b: 01 d0 add %edx,%eax
for (int i = 0; i < argc; ++i)
102d: ff c2 inc %edx
102f: eb f6 jmp 1027 <main+0x7>
t.inc(a, i);
return a;
}
1031: c3 retq
(I replaced the cout with return for better readability)

Why does an inline function have lower efficiency than an in-built function?

I was trying a question on arrays in InterviewBit. In this question I made an inline function returning the absolute value of an integer. But I was told that my algorithm was not efficient on submitting it. But when I changed to using abs() from C++ library it gave a correct answer verdict.
Here is my function that got an inefficient verdict -
inline int abs(int x){return x>0 ? x : -x;}
int Solution::coverPoints(vector<int> &X, vector<int> &Y) {
int l = X.size();
int i = 0;
int ans = 0;
while (i<l-1){
ans = ans + max(abs(X[i]-X[i+1]), abs(Y[i]-Y[i+1]));
i++;
}
return ans;
}
Here's the one that got the correct answer -
int Solution::coverPoints(vector<int> &X, vector<int> &Y) {
int l = X.size();
int i = 0;
int ans = 0;
while (i<l-1){
ans = ans + max(abs(X[i]-X[i+1]), abs(Y[i]-Y[i+1]));
i++;
}
return ans;
}
Why did this happened, as I thought that inline functions are fastest as no calling is done? Or is the site having an error? And if the site is correct, what does C++ abs() use that is faster than inline abs()?
I don't agree with their verdict. They are clearly wrong.
On current, optimizing compilers, both solutions produce the exact same output. And even, if they didn't produce the exact same, they would produce as efficient code as the library one (it could be a little surprising that everything matches: the algorithm, the registers used. Maybe because the actual library implementation is the same as OP's one?).
No sane optimizing compiler will create branch in your abs() code (if it can be done without a branch), as other answer suggests. If the compiler is not optimizing, then it may not inline library abs(), so it won't be fast either.
Optimizing abs() is one of the easiest thing to do for a compiler (just add an entry for it in the peephole optimizer, and done).
Furthermore, I've seen library implementations in the past, where abs() were implemented as a non-inline, library function (it was long time ago, though).
Proof that both implementations are the same:
GCC:
myabs:
mov edx, edi ; argument passed in EDI by System V AMD64 calling convention
mov eax, edi
sar edx, 31
xor eax, edx
sub eax, edx
ret
libabs:
mov edx, edi ; argument passed in EDI by System V AMD64 calling convention
mov eax, edi
sar edx, 31
xor eax, edx
sub eax, edx
ret
Clang:
myabs:
mov eax, edi ; argument passed in EDI by System V AMD64 calling convention
neg eax
cmovl eax, edi
ret
libabs:
mov eax, edi ; argument passed in EDI by System V AMD64 calling convention
neg eax
cmovl eax, edi
ret
Visual Studio (MSVC):
libabs:
mov eax, ecx ; argument passed in ECX by Windows 64-bit calling convention
cdq
xor eax, edx
sub eax, edx
ret 0
myabs:
mov eax, ecx ; argument passed in ECX by Windows 64-bit calling convention
cdq
xor eax, edx
sub eax, edx
ret 0
ICC:
myabs:
mov eax, edi ; argument passed in EDI by System V AMD64 calling convention
cdq
xor edi, edx
sub edi, edx
mov eax, edi
ret
libabs:
mov eax, edi ; argument passed in EDI by System V AMD64 calling convention
cdq
xor edi, edx
sub edi, edx
mov eax, edi
ret
See for yourself on Godbolt Compiler Explorer, where you can inspect the machine code generated by various compilers. (Link kindly provided by Peter Cordes.)
Your abs performs branching based on a condition. While the built-in variant just removes the sign bit from the integer, most likely using just a couple of instructions. Possible assembly example (taken from here):
cdq
xor eax, edx
sub eax, edx
The cdq copies the sign of the register eax to register edx. For example, if it is a positive number, edx will be zero, otherwise, edx will be 0xFFFFFF which denotes -1. The xor operation with the origin number will change nothing if it is a positive number (any number xor 0 will not change). However, when eax is negative, eax xor 0xFFFFFF yields (not eax). The final step is to subtract edx from eax. Again, if eax is positive, edx is zero, and the final value is still the same. For negative values, (~ eax) – (-1) = –eax which is the value wanted.
As you can see this approach uses only three simple arithmetic instructions and no conditional branching at all.
Edit: After some research it turned out that many built-in implementations of abs use the same approach, return __x >= 0 ? __x : -__x;, and such a pattern is an obvious target for compiler optimization to avoid unnecessary branching.
However, that does not justify the use of custom abs implementation as it violates the DRY principle and no one can guarantee that your implementation is going to be just as good for more sophisticated scenarios and/or unusual platforms. Typically one should think about rewriting some of the library functions only when there is a definite performance problem or some other defect detected in existing implementation.
Edit2: Just switching from int to float shows considerable performance degradation:
float libfoo(float x)
{
return ::std::fabs(x);
}
andps xmm0, xmmword ptr [rip + .LCPI0_0]
And a custom version:
inline float my_fabs(float x)
{
return x>0.0f?x:-x;
}
float myfoo(float x)
{
return my_fabs(x);
}
movaps xmm1, xmmword ptr [rip + .LCPI1_0] # xmm1 = [-0.000000e+00,-0.000000e+00,-0.000000e+00,-0.000000e+00]
xorps xmm1, xmm0
xorps xmm2, xmm2
cmpltss xmm2, xmm0
andps xmm0, xmm2
andnps xmm2, xmm1
orps xmm0, xmm2
online compiler
Your solution might arguably be "cleaner" by the textbook if you used the standard library version, but I think the evaluation is wrong. There is no truly good, justifiable reason for your code being rejected.
This is one of those cases where someone is formally correct (by the textbook), but insists on knowing the only correct solution in a sheer stupid way rather than accepting an alternate solution and saying "...but this here would be best practice, you know".
Technically, it's a correct, practical approach to say "use the standard library, that's what it is for, and it's likely optimized as much as can be". Even though the "optimized as much as can be" part can, in some situations, very well be wrong due to some constraints that the standard puts onto certain alogorithms and/or containers.
Now, opinions, best practice, and religion aside. Factually, if you compare the two approaches...
int main(int argc, char**)
{
40f360: 53 push %rbx
40f361: 48 83 ec 20 sub $0x20,%rsp
40f365: 89 cb mov %ecx,%ebx
40f367: e8 a4 be ff ff callq 40b210 <__main>
return std::abs(argc);
40f36c: 89 da mov %ebx,%edx
40f36e: 89 d8 mov %ebx,%eax
40f370: c1 fa 1f sar $0x1f,%edx
40f373: 31 d0 xor %edx,%eax
40f375: 29 d0 sub %edx,%eax
//}
int main(int argc, char**)
{
40f360: 53 push %rbx
40f361: 48 83 ec 20 sub $0x20,%rsp
40f365: 89 cb mov %ecx,%ebx
40f367: e8 a4 be ff ff callq 40b210 <__main>
return (argc > 0) ? argc : -argc;
40f36c: 89 da mov %ebx,%edx
40f36e: 89 d8 mov %ebx,%eax
40f370: c1 fa 1f sar $0x1f,%edx
40f373: 31 d0 xor %edx,%eax
40f375: 29 d0 sub %edx,%eax
//}
... they result in exactly the same, identical instructions.
But even if the compiler did use a compare followed by a conditional move (which it may do in more complicated "branching assignments" and which it will do for example in the case of min/max), that's maybe one CPU cycle or so slower than the bit hacks, so unless you do this several million times, the statement "not efficient" is kinda doubtful anyway.
One cache miss, and you have a hundred times the penalty of a conditional move.
There are valid arguments for and against either approach, which I won't discuss in length. My point is, turning down the OP's solution as "totally wrong" because of such a petty, unimportant detail is rather narrow-minded.
EDIT:
(Fun trivia)
I just tried, for fun and no profit, on my Linux Mint box which uses a somewhat older version of GCC (5.4 as compared to 7.1 above).
Due to me including <cmath> without much of a thought (hey, a function like abs very clearly belongs to math, doesn't it!) rather than <cstdlib> which hosts the integer overload, the result was, well... surprising. Calling the library function was much inferior to the single-expression wrapper.
Now, in defense of the standard library, if you include <cstdlib>, then, again, the produced output is exactly identical in either case.
For reference, the test code looked like:
#ifdef DRY
#include <cmath>
int main(int argc, char**)
{
return std::abs(argc);
}
#else
int abs(int v) noexcept { return (v >= 0) ? v : -v; }
int main(int argc, char**)
{
return abs(argc);
}
#endif
...resulting in
4004f0: 89 fa mov %edi,%edx
4004f2: 89 f8 mov %edi,%eax
4004f4: c1 fa 1f sar $0x1f,%edx
4004f7: 31 d0 xor %edx,%eax
4004f9: 29 d0 sub %edx,%eax
4004fb: c3 retq
Now, It is apparently quite easy to fall into the trap of unwittingly using the wrong standard library function (I demonstrated how easy it is myself!). And all that without the slightest warning from the compiler, such as "hey, you know, you're using a double overload on an integer value (well, obviously there's no warning, it's a valid conversion).
With that in mind, there may be yet another "justification" why the OP providing his own one-liner wasn't all that terribly bad and wrong. After all, he could have made that same mistake.

Unneccessary pop instructions in functions with early if statement

while playing around with godbolt.org I noticed that gcc (6.2, 7.0 snapshot), clang (3.9) and icc (17) when compiling something close to
int a(int* a, int* b) {
if (b - a < 2) return *a = ~*a;
// register intensive code here e.g. sorting network
}
compiles (-O2/-O3) this into somthing like this:
push r15
mov rax, rcx
push r14
sub rax, rdx
push r13
push r12
push rbp
push rbx
sub rsp, 184
mov QWORD PTR [rsp], rdx
cmp rax, 7
jg .L95
not DWORD PTR [rdx]
.L162:
add rsp, 184
pop rbx
pop rbp
pop r12
pop r13
pop r14
pop r15
ret
which obviously has a huge overhead in case of b - a < 2. In case of -Os gcc compiles to:
mov rax, rcx
sub rax, rdx
cmp rax, 7
jg .L74
not DWORD PTR [rdx]
ret
.L74:
Which leads me to beleave that there is no code keeping the compiler from emitting this shorter code.
Is there a reason why compilers do this ? Is there a way to get them compiling to the shorter version without compiling for size?
Here's an example on Godbolt that reproduces this. It seems to have something to do with the complex part being recursive
This is a known compiler limitation, see my comments on the question. IDK why it exists; maybe it's hard for compilers to decide what they can do without spilling when they haven't finished saving regs yet.
Pulling the early-out check into a wrapper is often useful when it's small enough to inline.
Looks like modern gcc can actually sidestep this compiler limitation sometimes.
Using your example on the Godbolt compiler explorer, adding a second caller is enough to get even gcc6.1 -O2 to split the function for you, so it can inline the early-out into the second caller and into the externally visible square() (which ends with jmp square(int*, int*) [clone .part.3] if the early-out return path isn't taken).
code on Godbolt, note I added -std=gnu++14, which is required for clang to compiler your code.
void square_inlinewrapper(int* a, int* b) {
//if (b - a < 16) return; // gcc inlines this part for us, and calls a private clone of the function!
return square(a, b);
}
# gcc6.1 -O2 (default / generic -march= and -mtune=)
mov rax, rsi
sub rax, rdi
cmp rax, 63
jg .L9
rep ret
.L9:
jmp square(int*, int*) [clone .part.3]
square() itself compiles to the same thing, calling the private clone which has the bulk of the code. The recursive calls from inside the clone call the wrapper function, so they don't do the extra push/pop work when it's not needed.
Even gcc7 doesn't do this when there's no other caller, even at -O3. It does still transform one of the recursive calls into a loop, but the other one just calls the big function again.
Clang 3.9 and icc17 don't clone the function, either, so you should write the inlineable wrapper manually (and change the main body of the function to use it for recursive calls, if the check is needed there).
You might want to name the wrapper square, and rename just the main body to a private name (like static void square_impl).

x86, C++, gcc and memory alignment

I have this simple C++ code:
int testFunction(int* input, long length) {
int sum = 0;
for (long i = 0; i < length; ++i) {
sum += input[i];
}
return sum;
}
#include <stdlib.h>
#include <iostream>
using namespace std;
int main()
{
union{
int* input;
char* cinput;
};
size_t length = 1024;
input = new int[length];
//cinput++;
cout<<testFunction(input, length-1);
}
If I compile it with g++ 4.9.2 with -O3, it runs fine. I expected that if I uncomment the penultimate line it would run slower, however it outright crashes with SIGSEGV.
Program received signal SIGSEGV, Segmentation fault.
0x0000000000400754 in main ()
(gdb) disassemble
Dump of assembler code for function main:
0x00000000004006e0 <+0>: sub $0x8,%rsp
0x00000000004006e4 <+4>: movabs $0x100000000,%rdi
0x00000000004006ee <+14>: callq 0x400690 <_Znam#plt>
0x00000000004006f3 <+19>: lea 0x1(%rax),%rdx
0x00000000004006f7 <+23>: and $0xf,%edx
0x00000000004006fa <+26>: shr $0x2,%rdx
0x00000000004006fe <+30>: neg %rdx
0x0000000000400701 <+33>: and $0x3,%edx
0x0000000000400704 <+36>: je 0x4007cc <main+236>
0x000000000040070a <+42>: cmp $0x1,%rdx
0x000000000040070e <+46>: mov 0x1(%rax),%esi
0x0000000000400711 <+49>: je 0x4007f1 <main+273>
0x0000000000400717 <+55>: add 0x5(%rax),%esi
0x000000000040071a <+58>: cmp $0x3,%rdx
0x000000000040071e <+62>: jne 0x4007e1 <main+257>
0x0000000000400724 <+68>: add 0x9(%rax),%esi
0x0000000000400727 <+71>: mov $0x3ffffffc,%r9d
0x000000000040072d <+77>: mov $0x3,%edi
0x0000000000400732 <+82>: mov $0x3fffffff,%r8d
0x0000000000400738 <+88>: sub %rdx,%r8
0x000000000040073b <+91>: pxor %xmm0,%xmm0
0x000000000040073f <+95>: lea 0x1(%rax,%rdx,4),%rcx
0x0000000000400744 <+100>: xor %edx,%edx
0x0000000000400746 <+102>: nopw %cs:0x0(%rax,%rax,1)
0x0000000000400750 <+112>: add $0x1,%rdx
=> 0x0000000000400754 <+116>: paddd (%rcx),%xmm0
0x0000000000400758 <+120>: add $0x10,%rcx
0x000000000040075c <+124>: cmp $0xffffffe,%rdx
0x0000000000400763 <+131>: jbe 0x400750 <main+112>
0x0000000000400765 <+133>: movdqa %xmm0,%xmm1
0x0000000000400769 <+137>: lea -0x3ffffffc(%r9),%rcx
---Type <return> to continue, or q <return> to quit---
Why does it crash? Is it a compiler bug? Am I causing some undefined behavior? Does the compiler expect that ints are always 4-byte-aligned?
I also tested it on clang and there's no crash.
Here's g++'s assembly output: http://pastebin.com/CJdCDCs4
The code input = new int[length]; cinput++; causes undefined behaviour because the second statement is reading from a union member that is not active.
Even ignoring that, testFunction(input, length-1) would again have undefined behaviour for the same reason.
Even ignoring that, the sum loop accesses an object through a glvalue of the wrong type, which has undefined behaviour.
Even ignoring that, reading from an uninitialized object, as your sum loop does, would again have undefined behaviour.
gcc has vectorized the loop with SSE instructions. paddd (like most SSE instructions) requires 16 byte alignment. I haven't looked at the code previous to paddd in detail but I expect that it assumes 4 byte alignment initially, iterates with scalar code (where misalignment only incurs a performance penalty, not a crash) until it can assume 16 byte alignment, then enters the SIMD loop, processing 4 ints at a time. By adding an offset of 1 byte you are breaking the precondition of 4 byte alignment for the array of ints, and after that all bets are off. If you're going to be doing nasty stuff with misaligned data (and I highly recommend you don't) then you should disable automatic vectorization (gcc -fno-tree-vectorize).
The instruction that crashed is paddd (you highlighted it). The name is short for "packed add doubleword" (see e.g. here) - it is a part of the SSE instruction set. These instructions require aligned pointers; for example, the link above has a description of exceptions that paddd may cause:
GP(0)
...(128-bit operations only)
If a memory operand is not aligned on a 16-byte boundary, regardless of segment.
This is exactly your case. The compiler arranged the code in such a way that it could use these fast 128-bit operations like paddd, and you subverted it with your union trick.
I can guess that code generated by clang doesn't use SSE, so it's not sensitive to alighnment. If so, it's also probably much slower (but you won't notice it with just 1024 iterations).

Performance of pIter != cont.end() in for loop

I was getting through "Exceptional C++" by Herb Sutter lately, and I have serious doubts about a particular recommendation he gives in Item 6 - Temporary Objects.
He offers to find unnecessary temporary objects in the following code:
string FindAddr(list<Employee> emps, string name)
{
for (list<Employee>::iterator i = emps.begin(); i != emps.end(); i++)
{
if( *i == name )
{
return i->addr;
}
}
return "";
}
As one of the example, he recommends to precompute the value of emps.end() before the loop, since there is a temporary object created on every iteration:
For most containers (including list), calling end() returns a
temporary object that must be constructed and destroyed. Because the
value will not change, recomputing (and reconstructing and
redestroying) it on every loop iteration is both needlessly
inefficient and unaesthetic. The value should be computed only once,
stored in a local object, and reused.
And he suggests replacing by the following:
list<Employee>::const_iterator end(emps.end());
for (list<Employee>::const_iterator i = emps.begin(); i != end; ++i)
For me, this is unnecessary complication. Even if one replaces ugly type declarations with compact auto, he still gets two lines of code instead of one. Even more, he has this end variable in the outer scope.
I was sure modern compilers will optimize this piece of code anyway, because I'm actually using const_iterator here and it is easy to check whether the loop content is accessing the container somehow. Compilers got smarter within the last 13 years, right?
Anyway, I will prefer the first version with i != emps.end() in most cases, where I'm not so much worried about performance. But I want to know for sure, whether this is a kind of construction I could rely on a compiler to optimize?
Update
Thanks for your suggestions on how to make this useless code better. Please note, my question is about compiler, not programming techniques. The only relevant answers for now are from NPE and Ellioh.
UPD: The book you are speaking about has been published in 1999, unless I'm mistaking. That's 14 years ago, and in modern programming 14 years is a lot of time. Many recommendations that were good and reliable in 1999, may be completely obsolete by now. Though my answer is about a single compiler and a single platform, there is also a more general idea.
Caring about extra variables, reusing a return value of trivial methods and similar tricks of old C++ is a step back towards the C++ of 1990s. Trivial methods like end() should be inlined quite well, and the result of inlining should be optimized as a part of the code it is called from. 99% situations do not require manual actions such as creating an end variable at all. Such things should be done only if:
You KNOW that on some of the compilers/platforms you should run on the code is not optimized well.
It has become a bottleneck in your program ("avoid premature optimization").
I've looked at what is generated by 64-bit g++:
gcc version 4.6.3 20120918 (prerelease) (Ubuntu/Linaro 4.6.3-10ubuntu1)
Initially I thought that with optimizations on it should be ok and there should be no difference between two versions. But looks like things are strange: the version you considered non-optimal is actually better. I think, the moral is: there is no reason to try being smarter than a compiler. Let's see both versions.
#include <list>
using namespace std;
int main() {
list<char> l;
l.push_back('a');
for(list<char>::iterator i=l.begin(); i != l.end(); i++)
;
return 0;
}
int main1() {
list<char> l;
l.push_back('a');
list<char>::iterator e=l.end();
for(list<char>::iterator i=l.begin(); i != e; i++)
;
return 0;
}
Then we should compile this with optimizations on (I use 64-bit g++, you may try your compiler) and disassemble main and main1:
For main:
(gdb) disas main
Dump of assembler code for function main():
0x0000000000400650 <+0>: push %rbx
0x0000000000400651 <+1>: mov $0x18,%edi
0x0000000000400656 <+6>: sub $0x20,%rsp
0x000000000040065a <+10>: lea 0x10(%rsp),%rbx
0x000000000040065f <+15>: mov %rbx,0x10(%rsp)
0x0000000000400664 <+20>: mov %rbx,0x18(%rsp)
0x0000000000400669 <+25>: callq 0x400630 <_Znwm#plt>
0x000000000040066e <+30>: cmp $0xfffffffffffffff0,%rax
0x0000000000400672 <+34>: je 0x400678 <main()+40>
0x0000000000400674 <+36>: movb $0x61,0x10(%rax)
0x0000000000400678 <+40>: mov %rax,%rdi
0x000000000040067b <+43>: mov %rbx,%rsi
0x000000000040067e <+46>: callq 0x400610 <_ZNSt8__detail15_List_node_base7_M_hookEPS0_#plt>
0x0000000000400683 <+51>: mov 0x10(%rsp),%rax
0x0000000000400688 <+56>: cmp %rbx,%rax
0x000000000040068b <+59>: je 0x400698 <main()+72>
0x000000000040068d <+61>: nopl (%rax)
0x0000000000400690 <+64>: mov (%rax),%rax
0x0000000000400693 <+67>: cmp %rbx,%rax
0x0000000000400696 <+70>: jne 0x400690 <main()+64>
0x0000000000400698 <+72>: mov %rbx,%rdi
0x000000000040069b <+75>: callq 0x400840 <std::list<char, std::allocator<char> >::~list()>
0x00000000004006a0 <+80>: add $0x20,%rsp
0x00000000004006a4 <+84>: xor %eax,%eax
0x00000000004006a6 <+86>: pop %rbx
0x00000000004006a7 <+87>: retq
Look at the commands located at 0x0000000000400683-0x000000000040068b. That's the loop body and it seems to be perfectly optimized:
0x0000000000400690 <+64>: mov (%rax),%rax
0x0000000000400693 <+67>: cmp %rbx,%rax
0x0000000000400696 <+70>: jne 0x400690 <main()+64>
For main1:
(gdb) disas main1
Dump of assembler code for function main1():
0x00000000004007b0 <+0>: push %rbp
0x00000000004007b1 <+1>: mov $0x18,%edi
0x00000000004007b6 <+6>: push %rbx
0x00000000004007b7 <+7>: sub $0x18,%rsp
0x00000000004007bb <+11>: mov %rsp,%rbx
0x00000000004007be <+14>: mov %rsp,(%rsp)
0x00000000004007c2 <+18>: mov %rsp,0x8(%rsp)
0x00000000004007c7 <+23>: callq 0x400630 <_Znwm#plt>
0x00000000004007cc <+28>: cmp $0xfffffffffffffff0,%rax
0x00000000004007d0 <+32>: je 0x4007d6 <main1()+38>
0x00000000004007d2 <+34>: movb $0x61,0x10(%rax)
0x00000000004007d6 <+38>: mov %rax,%rdi
0x00000000004007d9 <+41>: mov %rsp,%rsi
0x00000000004007dc <+44>: callq 0x400610 <_ZNSt8__detail15_List_node_base7_M_hookEPS0_#plt>
0x00000000004007e1 <+49>: mov (%rsp),%rdi
0x00000000004007e5 <+53>: cmp %rbx,%rdi
0x00000000004007e8 <+56>: je 0x400818 <main1()+104>
0x00000000004007ea <+58>: mov %rdi,%rax
0x00000000004007ed <+61>: nopl (%rax)
0x00000000004007f0 <+64>: mov (%rax),%rax
0x00000000004007f3 <+67>: cmp %rbx,%rax
0x00000000004007f6 <+70>: jne 0x4007f0 <main1()+64>
0x00000000004007f8 <+72>: mov (%rdi),%rbp
0x00000000004007fb <+75>: callq 0x4005f0 <_ZdlPv#plt>
0x0000000000400800 <+80>: cmp %rbx,%rbp
0x0000000000400803 <+83>: je 0x400818 <main1()+104>
0x0000000000400805 <+85>: nopl (%rax)
0x0000000000400808 <+88>: mov %rbp,%rdi
0x000000000040080b <+91>: mov (%rdi),%rbp
0x000000000040080e <+94>: callq 0x4005f0 <_ZdlPv#plt>
0x0000000000400813 <+99>: cmp %rbx,%rbp
0x0000000000400816 <+102>: jne 0x400808 <main1()+88>
0x0000000000400818 <+104>: add $0x18,%rsp
0x000000000040081c <+108>: xor %eax,%eax
0x000000000040081e <+110>: pop %rbx
0x000000000040081f <+111>: pop %rbp
0x0000000000400820 <+112>: retq
The code for the loop is similar, it is:
0x00000000004007f0 <+64>: mov (%rax),%rax
0x00000000004007f3 <+67>: cmp %rbx,%rax
0x00000000004007f6 <+70>: jne 0x4007f0 <main1()+64>
But there is alot of extra stuff around the loop. Apparently, extra code has made the things WORSE.
I've compiled the following slightly hacky code using g++ 4.7.2 with -O3 -std=c++11, and got identical assembly for both functions:
#include <list>
#include <string>
using namespace std;
struct Employee: public string { string addr; };
string FindAddr1(list<Employee> emps, string name)
{
for (list<Employee>::const_iterator i = emps.begin(); i != emps.end(); i++)
{
if( *i == name )
{
return i->addr;
}
}
return "";
}
string FindAddr2(list<Employee> emps, string name)
{
list<Employee>::const_iterator end(emps.end());
for (list<Employee>::const_iterator i = emps.begin(); i != end; i++)
{
if( *i == name )
{
return i->addr;
}
}
return "";
}
In any event, I think the choice between the two versions should be primarily based on grounds of readability. Without profiling data, micro-optimizations like this to me look premature.
Contrary to popular belief, I don't see any difference between VC++ and gcc in this respect. I did a quick check with both g++ 4.7.2 and MS C++ 17 (aka VC++ 2012).
In both cases I compared the code generated with the code as in the question (with headers and such added to let it compile), to the following code:
string FindAddr(list<Employee> emps, string name)
{
auto end = emps.end();
for (list<Employee>::iterator i = emps.begin(); i != end; i++)
{
if( *i == name )
{
return i->addr;
}
}
return "";
}
In both cases the result was essentially identical for the two pieces of code. VC++ includes line-number comments in the code, which changed because of the extra line, but that was the only difference. With g++ the output files were identical.
Doing the same with std::vector instead of std::list, gave pretty much the same result -- no significant difference. For some reason, g++ did switch the order of operands for one instruction, from cmp esi, DWORD PTR [eax+4] to cmp DWORD PTR [eax+4], esi, but (again) this is utterly irrelevant.
Bottom line: no, you're not likely to gain anything from manually hoisting the code out of the loop with a modern compiler (at least with optimization enabled -- I was using /O2b2 with VC++ and /O3 with g++; comparing optimization with optimization turned off seems pretty pointless to me).
A couple of things... the first is that in general the cost of building an iterator (in Release mode, unchecked allocators) is minimal. They are usually wrappers around a pointer. With checked allocators (default in VS) you might have some cost, but if you really need the performance, after testing rebuild with unchecked allocators.
The code need not be as ugly as what you posted:
for (list<Employee>::const_iterator it=emps.begin(), end=emps.end();
it != end; ++it )
The main decision on whether you want to use one or the other approaches should be in terms of what operations are being applied to the container. If the container might be changing it's size then you might want to recompute the end iterator in each iteration. If not, you can just precompute once and reuse as in the code above.
If you really need the performance, you let your shiny new C++11 compiler write it for you:
for (const auto &i : emps) {
/* ... */
}
Yes, this is tongue-in-cheek (sort of). Herb's example here is now out of date. But since your compiler doesn't support it yet, let's get to the real question:
Is this a kind of construction I could rely on a compiler to optimize?
My rule of thumb is that the compiler writers are way smarter than I am. I can't rely on a compiler to optimize any one piece of code, because it might choose to optimize something else that has a bigger impact. The only way to know for sure is to try out both approaches on your compiler on your system and see what happens. Check your profiler results. If the call to .end() sticks out, save it in a separate variable. Otherwise, don't worry about it.
Containers like vector returns variable, which stores pointer to the end, on end() call, that optimized. If you've written container which does some lookups, etc on end() call consider writing
for (list<Employee>::const_iterator i = emps.begin(), end = emps.end(); i != end; ++i)
{
...
}
for speed
Use std algorithms
He's right of course; calling end can instantiate and destroy a temporary object, which is generally bad.
Of course, the compiler can optimise this away in a lot of cases.
There is a better and more robust solution: encapsulate your loops.
The example you gave is in fact std::find, give or take the return value. Many other loops also have std algorithms, or at least something similar enough that you can adapt - my utility library has a transform_if implementation, for example.
So, hide loops in a function and take a const& to end. Same fix as your example, but much much cleaner.