x86, C++, gcc and memory alignment

x86, C++, gcc and memory alignment - c++

I have this simple C++ code:
int testFunction(int* input, long length) {
int sum = 0;
for (long i = 0; i < length; ++i) {
sum += input[i];
}
return sum;
}
#include <stdlib.h>
#include <iostream>
using namespace std;
int main()
{
union{
int* input;
char* cinput;
};
size_t length = 1024;
input = new int[length];
//cinput++;
cout<<testFunction(input, length-1);
}
If I compile it with g++ 4.9.2 with -O3, it runs fine. I expected that if I uncomment the penultimate line it would run slower, however it outright crashes with SIGSEGV.
Program received signal SIGSEGV, Segmentation fault.
0x0000000000400754 in main ()
(gdb) disassemble
Dump of assembler code for function main:
0x00000000004006e0 <+0>: sub $0x8,%rsp
0x00000000004006e4 <+4>: movabs $0x100000000,%rdi
0x00000000004006ee <+14>: callq 0x400690 <_Znam#plt>
0x00000000004006f3 <+19>: lea 0x1(%rax),%rdx
0x00000000004006f7 <+23>: and $0xf,%edx
0x00000000004006fa <+26>: shr $0x2,%rdx
0x00000000004006fe <+30>: neg %rdx
0x0000000000400701 <+33>: and $0x3,%edx
0x0000000000400704 <+36>: je 0x4007cc <main+236>
0x000000000040070a <+42>: cmp $0x1,%rdx
0x000000000040070e <+46>: mov 0x1(%rax),%esi
0x0000000000400711 <+49>: je 0x4007f1 <main+273>
0x0000000000400717 <+55>: add 0x5(%rax),%esi
0x000000000040071a <+58>: cmp $0x3,%rdx
0x000000000040071e <+62>: jne 0x4007e1 <main+257>
0x0000000000400724 <+68>: add 0x9(%rax),%esi
0x0000000000400727 <+71>: mov $0x3ffffffc,%r9d
0x000000000040072d <+77>: mov $0x3,%edi
0x0000000000400732 <+82>: mov $0x3fffffff,%r8d
0x0000000000400738 <+88>: sub %rdx,%r8
0x000000000040073b <+91>: pxor %xmm0,%xmm0
0x000000000040073f <+95>: lea 0x1(%rax,%rdx,4),%rcx
0x0000000000400744 <+100>: xor %edx,%edx
0x0000000000400746 <+102>: nopw %cs:0x0(%rax,%rax,1)
0x0000000000400750 <+112>: add $0x1,%rdx
=> 0x0000000000400754 <+116>: paddd (%rcx),%xmm0
0x0000000000400758 <+120>: add $0x10,%rcx
0x000000000040075c <+124>: cmp $0xffffffe,%rdx
0x0000000000400763 <+131>: jbe 0x400750 <main+112>
0x0000000000400765 <+133>: movdqa %xmm0,%xmm1
0x0000000000400769 <+137>: lea -0x3ffffffc(%r9),%rcx
---Type <return> to continue, or q <return> to quit---
Why does it crash? Is it a compiler bug? Am I causing some undefined behavior? Does the compiler expect that ints are always 4-byte-aligned?
I also tested it on clang and there's no crash.
Here's g++'s assembly output: http://pastebin.com/CJdCDCs4

The code input = new int[length]; cinput++; causes undefined behaviour because the second statement is reading from a union member that is not active.
Even ignoring that, testFunction(input, length-1) would again have undefined behaviour for the same reason.
Even ignoring that, the sum loop accesses an object through a glvalue of the wrong type, which has undefined behaviour.
Even ignoring that, reading from an uninitialized object, as your sum loop does, would again have undefined behaviour.

gcc has vectorized the loop with SSE instructions. paddd (like most SSE instructions) requires 16 byte alignment. I haven't looked at the code previous to paddd in detail but I expect that it assumes 4 byte alignment initially, iterates with scalar code (where misalignment only incurs a performance penalty, not a crash) until it can assume 16 byte alignment, then enters the SIMD loop, processing 4 ints at a time. By adding an offset of 1 byte you are breaking the precondition of 4 byte alignment for the array of ints, and after that all bets are off. If you're going to be doing nasty stuff with misaligned data (and I highly recommend you don't) then you should disable automatic vectorization (gcc -fno-tree-vectorize).

The instruction that crashed is paddd (you highlighted it). The name is short for "packed add doubleword" (see e.g. here) - it is a part of the SSE instruction set. These instructions require aligned pointers; for example, the link above has a description of exceptions that paddd may cause:
GP(0)
...(128-bit operations only)
If a memory operand is not aligned on a 16-byte boundary, regardless of segment.
This is exactly your case. The compiler arranged the code in such a way that it could use these fast 128-bit operations like paddd, and you subverted it with your union trick.
I can guess that code generated by clang doesn't use SSE, so it's not sensitive to alighnment. If so, it's also probably much slower (but you won't notice it with just 1024 iterations).

Related

Which is faster, a struct, or a primitive variable containing the same bytes?

Here is an example piece of code:
#include <stdint.h>
#include <iostream>
typedef struct {
uint16_t low;
uint16_t high;
} __attribute__((packed)) A;
typedef uint32_t B;
int main() {
//simply to make the answer unknowable at compile time
uint16_t input;
cin >> input;
A a = {15,input};
B b = 0x000f0000 + input;
//a equals b
int resultA = a.low-a.high;
int resultB = b&0xffff - (b>>16)&0xffff;
//use the variables so the optimiser doesn't get rid of everything
return resultA+resultB;
}
Both resultA and resultB calculate the exact same thing - but which is faster (assuming you don't know the answer at compile time).
I tried using Compiler Explorer to look at the output, and I got something - but with any optimisation no matter what I tried it outsmarted me and optimised the whole calculation away (at first, it optimised everything away since it's not used) - I tried using cin to make the answer unknowable at runtime, but then I couldn't even figure out how it was getting the answer at all (I think it managed to still figure it out at compile time?)
Here is the output of Compiler Explorer with no optimisation flag:
push rbp
mov rbp, rsp
sub rsp, 32
mov dword ptr [rbp - 4], 0
movabs rdi, offset std::cin
lea rsi, [rbp - 6]
call std::basic_istream<char, std::char_traits<char> >::operator>>(unsigned short&)
mov word ptr [rbp - 16], 15
mov ax, word ptr [rbp - 6]
mov word ptr [rbp - 14], ax
movzx eax, word ptr [rbp - 6]
add eax, 983040
mov dword ptr [rbp - 20], eax
Begin calculating result A
movzx eax, word ptr [rbp - 16]
movzx ecx, word ptr [rbp - 14]
sub eax, ecx
mov dword ptr [rbp - 24], eax
End of calculation
Begin calculating result B
mov eax, dword ptr [rbp - 20]
mov edx, dword ptr [rbp - 20]
shr edx, 16
mov ecx, 65535
sub ecx, edx
and eax, ecx
and eax, 65535
mov dword ptr [rbp - 28], eax
End of calculation
mov eax, dword ptr [rbp - 24]
add eax, dword ptr [rbp - 28]
add rsp, 32
pop rbp
ret
I will also post the -O1 output, but I can't make any sense of it (I'm quite new to low level assembly stuff).
main: # #main
push rax
lea rsi, [rsp + 6]
mov edi, offset std::cin
call std::basic_istream<char, std::char_traits<char> >::operator>>(unsigned short&)
movzx ecx, word ptr [rsp + 6]
mov eax, ecx
and eax, -16
sub eax, ecx
add eax, 15
pop rcx
ret
Something to consider. While doing operations with the integer is slightly harder, simply accessing it as an integer easier compared to the struct (which you'd have to convert with bitshifts I think?). Does this make a difference?
This originally came up in the context of memory, where I saw someone map a memory address to a struct with a field for the low bits and the high bits. I thought this couldn't possibly be faster than simply using an integer of the right size and bitshifting if you need the low or high bits. In this specific situation - which is faster?
[Why did I add C to the tag list? While the example code I used is in C++, the concept of struct vs variable is very applicable to C too]

Other than the fact that some ABIs require that structs be passed differently than integers, there won't be a difference.
Now, there are important semantic differences between two 16 bit ints and one 32 bit int. If you add to the lower 16 bit int, it will not "overflow" into the higher one, while if you add to the lower 16 bits of a 32 bit int, it will. This difference in possible behavior (even if you, yourself, "know" it could not happen in your code) could change what assembly code is generated by your compiler, and impact performance.
Which of those two would result in a faster result is not going to be knowable without actually testing or a full description of the actual exact problem. So it is a toss up there.
Which means the only real concern is the ABI one. This means, without whole program optimization, a function taking a struct and a function taking an int with the same binary layout will have a different assumptions about where the data is.
This only matters for by-value single arguments however.
The 90/10 rule applies; 90% of your code runs for less than 10% of the time. The odds are this will have no impact on your critical path.

When trying to answer questions of performance, examining unoptimized code is largely irrelevant.
As a matter of fact, even examining the results of -O1 optimization is not particularly useful, because it does not give you the best that the compiler can achieve. You should try at least -O2.
Regardless of the above, the sample code you provided is unsuitable for examination, because you should be making sure that the values of a and b are separately unknowable by the compiler. As the code stands, the compiler does not know what the value of input is, but it does know that a and b will have the same value, so it optimizes the code in ways that make it impossible to derive any useful conclusions from it.
As a general rule, compilers tend to do an exceptionally good job when dealing with structs that fit within machine words, to the point where generally, there is absolutely no performance difference between the two scenarios you are considering, and between any of the special cases you are pondering about.

Using GCC on compiler explorer the version with the struct produces fewer instructions in -O3 mode.
Code:
#include <stdint.h>
typedef struct {
uint16_t low;
uint16_t high;
} __attribute__((packed)) A;
typedef uint32_t B;
int f1(A a)
{
return a.low - a.high;
}
int f2(B b)
{
return b&0xffff - (b>>16)&0xffff;
}
Assembly:
_Z2f11A:
movzwl %di, %eax
shrl $16, %edi
subl %edi, %eax
ret
_Z2f2j:
movl %edi, %edx
movl $65535, %eax
shrl $16, %edx
subl %edx, %eax
andl %edi, %eax
ret
But this might be because the two functions don't do the same thing as - has a higher precedence than &. When comparing the B case which does the same thing as A, then the exact same assembly is produced.
Code:
int f3(B b)
{
return (b&0xffff) - ((b>>16)&0xffff);
}
Assembly:
_Z2f3j:
movzwl %di, %eax
shrl $16, %edi
subl %edi, %eax
ret
Note that the only way to find out if something is faster is to benchmark it in a real world use case.

Why does MSVC generate nop instructions for atomic loads on x64?

If you compile code such as
#include <atomic>
int load(std::atomic<int> *p) {
return p->load(std::memory_order_acquire) + p->load(std::memory_order_acquire);
}
you see that MSVC generates NOP padding after each memory load:
int load(std::atomic<int> *) PROC
mov edx, DWORD PTR [rcx]
npad 1
mov eax, DWORD PTR [rcx]
npad 1
add eax, edx
ret 0
Why is this? Is there any way to avoid it without relaxing the memory order (which would affect the correctness of the code)?

p->load() may eventually use the _ReadWriteBarrier compiler intrinsic.
According to this: https://developercommunity.visualstudio.com/t/-readwritebarrier-intrinsic-emits-unnecessary-code/1538997
the nops get inserted because of the flag /volatileMetadata which is now on by default. You can return to the old behavior by adding /volatileMetadata-, but doing so will result in worse performance if your code is ever run emulated. It’ll still be emulated correctly, but the emulator will have to pessimistically assume every load/store needs a barrier.
And compiling with /volatileMetadata- does indeed remove the npad.

Understanding volatile asm vs volatile variable

We consider the following program, that is just timing a loop:
#include <cstdlib>
std::size_t count(std::size_t n)
{
#ifdef VOLATILEVAR
volatile std::size_t i = 0;
#else
std::size_t i = 0;
#endif
while (i < n) {
#ifdef VOLATILEASM
asm volatile("": : :"memory");
#endif
++i;
}
return i;
}
int main(int argc, char* argv[])
{
return count(argc > 1 ? std::atoll(argv[1]) : 1);
}
For readability, the version with both volatile variable and volatile asm reads as follow:
#include <cstdlib>
std::size_t count(std::size_t n)
{
volatile std::size_t i = 0;
while (i < n) {
asm volatile("": : :"memory");
++i;
}
return i;
}
int main(int argc, char* argv[])
{
return count(argc > 1 ? std::atoll(argv[1]) : 1);
}
Compilation under g++ 8 with g++ -Wall -Wextra -g -std=c++11 -O3 loop.cpp -o loop gives roughly the following timings:
default: 0m0.001s
-DVOLATILEASM: 0m1.171s
-DVOLATILEVAR: 0m5.954s
-DVOLATILEVAR -DVOLATILEASM: 0m5.965s
The question I have is: why is that? The default version is normal since the loop is optimized away by the compiler. But I have harder time understanding why -DVOLATILEVAR is way longer than -DVOLATILEASM since both should force the loop to run.
Compiler explorer gives the following count function for -DVOLATILEASM:
count(unsigned long):
mov rax, rdi
test rdi, rdi
je .L2
xor edx, edx
.L3:
add rdx, 1
cmp rax, rdx
jne .L3
.L2:
ret
and for -DVOLATILEVAR (and the combined -DVOLATILEASM -DVOLATILEVAR):
count(unsigned long):
mov QWORD PTR [rsp-8], 0
mov rax, QWORD PTR [rsp-8]
cmp rdi, rax
jbe .L2
.L3:
mov rax, QWORD PTR [rsp-8]
add rax, 1
mov QWORD PTR [rsp-8], rax
mov rax, QWORD PTR [rsp-8]
cmp rax, rdi
jb .L3
.L2:
mov rax, QWORD PTR [rsp-8]
ret
Why is the exact reason of that? Why does the volatile qualification of the variable prevents the compiler from doing the same loop as the one with asm volatile?

When you make i volatile you tell the compiler that something that it doesn't know about can change its value. That means it is forced to load it's value every time you use it and it has to store it every time you write to it. When i is not volatile the compiler can optimize that synchronization away.

-DVOLATILEVAR forces the compiler to keep the loop counter in memory, so the loop bottlenecks on the latency of store/reload (store forwarding), ~5 cycles + the latency of an add 1 cycle.
Every assignment to and read from volatile int i is considered an observable side-effect of the program that the optimizer has to make happen in memory, not just a register. This is what volatile means.
There's also a reload for the compare, but that's only a throughput issue, not latency. The ~6 cycle loop carried data dependency means your CPU doesn't bottleneck on any throughput limits.
This is similar to what you'd get from -O0 compiler output, so have a look at my answer on Adding a redundant assignment speeds up code when compiled without optimization for more about loops like that, and x86 store-forwarding.
With only VOLATILEASM, the empty asm template (""), has to run the right number of times. Being empty, it doesn't add any instructions to the loop, so you're left with a 2-uop add / cmp+jne loop that can run at 1 iteration per clock on modern x86 CPUs.
Critically, the loop counter can stay in a register, despite the compiler memory barrier. A "memory" clobber is treated like a call to a non-inline function: it might read or modify any object that it might possibly have a reference to, but that does not include local variables that have never had their address escape the function. (i.e. we never called sscanf("0", "%d", &i) or posix_memalign(&i, 64, 1234). But if we did, then the "memory" barrier would have to spill / reload it, because an external function could have saved a pointer to the object.
i.e. a "memory" clobber is only a full compiler barrier for objects that could possibly be visible outside the current function. This is really only an issue when messing around and looking at compiler output to see what barriers do what, because a barrier can only matter for multi-threading correctness for variables that other threads could possible have a pointer to.
And BTW, your asm statement is already implicitly volatile because it has no output operands. (See Extended-Asm#Volatile in the gcc manual).
You can add a dummy output to make a non-volatile asm statement the compiler can optimize away, but unfortunately gcc still keep the empty loop after eliminating a non-volatile asm statement from it. If i's address has escaped the function, removing the asm statement entirely turns the loop into a single compare jump over a store, right before the function returns. I think it would be legal to simply return without ever storing to that local, because there's no a correct program can know that it managed to read i from another thread before i went out of scope.
But anyway, here's the source I used. As I said, note that there's always an asm statement here, and I'm controlling whether it's volatile or not.
#include <stdlib.h>
#include <stdio.h>
#ifndef VOLATILEVAR // compile with -DVOLATILEVAR=volatile to apply that
#define VOLATILEVAR
#endif
#ifndef VOLATILEASM // Different from your def; yours drops the whole asm statement
#define VOLATILEASM
#endif
// note I ported this to also be valid C, but I didn't try -xc to compile as C.
size_t count(size_t n)
{
int dummy; // asm with no outputs is implicitly volatile
VOLATILEVAR size_t i = 0;
sscanf("0", "%zd", &i);
while (i < n) {
asm VOLATILEASM ("nop # operand = %0": "=r"(dummy) : :"memory");
++i;
}
return i;
}
compiles (with gcc4.9 and newer -O3, neither VOLATILE enabled) to this weird asm.
(Godbolt compiler explorer with gcc and clang):
# gcc8.1 -O3 with sscanf(.., &i) but non-volatile asm
# the asm nop doesn't appear anywhere, but gcc is making clunky code.
.L8:
mov rdx, rax # i, <retval>
.L3: # first iter entry point
lea rax, [rdx+1] # <retval>,
cmp rax, rbx # <retval>, n
jb .L8 #,
Nice job, gcc.... gcc4.8 -O3 avoids pulling an extra mov inside the loop:
# gcc4.8 -O3 with sscanf(.., &i) but non-volatile asm
.L3:
add rdx, 1 # i,
cmp rbx, rdx # n, i
ja .L3 #,
mov rax, rdx # i.0, i # outside the loop
Anyway, without the dummy output operand, or with volatile, gcc8.1 gives us:
# gcc8.1 with sscanf(&i) and asm volatile("nop" ::: "memory")
.L3:
nop # operand = eax # dummy
mov rax, QWORD PTR [rsp+8] # tmp96, i
add rax, 1 # <retval>,
mov QWORD PTR [rsp+8], rax # i, <retval>
cmp rax, rbx # <retval>, n
jb .L3 #,
So we see the same store/reload of the loop counter, only difference from volatile i being the cmp doesn't need to reload it.
I used nop instead of just a comment because Godbolt hides comment-only lines by default, and I wanted to see it. For gcc, it's purely a text substitution: we're looking at the compiler's asm output with operands substituted into the template before it's sent to the assembler. For clang, there might be some effect because the asm has to be valid (i.e. actually assemble correctly).
If we comment out the scanf and remove the dummy output operand, we get a register-only loop with the nop in it. But keep the dummy output operand and the nop doesn't appear anywhere.

Does optimization remove unnecessary `cend()` calls? [duplicate]

I was getting through "Exceptional C++" by Herb Sutter lately, and I have serious doubts about a particular recommendation he gives in Item 6 - Temporary Objects.
He offers to find unnecessary temporary objects in the following code:
string FindAddr(list<Employee> emps, string name)
{
for (list<Employee>::iterator i = emps.begin(); i != emps.end(); i++)
{
if( *i == name )
{
return i->addr;
}
}
return "";
}
As one of the example, he recommends to precompute the value of emps.end() before the loop, since there is a temporary object created on every iteration:
For most containers (including list), calling end() returns a
temporary object that must be constructed and destroyed. Because the
value will not change, recomputing (and reconstructing and
redestroying) it on every loop iteration is both needlessly
inefficient and unaesthetic. The value should be computed only once,
stored in a local object, and reused.
And he suggests replacing by the following:
list<Employee>::const_iterator end(emps.end());
for (list<Employee>::const_iterator i = emps.begin(); i != end; ++i)
For me, this is unnecessary complication. Even if one replaces ugly type declarations with compact auto, he still gets two lines of code instead of one. Even more, he has this end variable in the outer scope.
I was sure modern compilers will optimize this piece of code anyway, because I'm actually using const_iterator here and it is easy to check whether the loop content is accessing the container somehow. Compilers got smarter within the last 13 years, right?
Anyway, I will prefer the first version with i != emps.end() in most cases, where I'm not so much worried about performance. But I want to know for sure, whether this is a kind of construction I could rely on a compiler to optimize?
Update
Thanks for your suggestions on how to make this useless code better. Please note, my question is about compiler, not programming techniques. The only relevant answers for now are from NPE and Ellioh.

UPD: The book you are speaking about has been published in 1999, unless I'm mistaking. That's 14 years ago, and in modern programming 14 years is a lot of time. Many recommendations that were good and reliable in 1999, may be completely obsolete by now. Though my answer is about a single compiler and a single platform, there is also a more general idea.
Caring about extra variables, reusing a return value of trivial methods and similar tricks of old C++ is a step back towards the C++ of 1990s. Trivial methods like end() should be inlined quite well, and the result of inlining should be optimized as a part of the code it is called from. 99% situations do not require manual actions such as creating an end variable at all. Such things should be done only if:
You KNOW that on some of the compilers/platforms you should run on the code is not optimized well.
It has become a bottleneck in your program ("avoid premature optimization").
I've looked at what is generated by 64-bit g++:
gcc version 4.6.3 20120918 (prerelease) (Ubuntu/Linaro 4.6.3-10ubuntu1)
Initially I thought that with optimizations on it should be ok and there should be no difference between two versions. But looks like things are strange: the version you considered non-optimal is actually better. I think, the moral is: there is no reason to try being smarter than a compiler. Let's see both versions.
#include <list>
using namespace std;
int main() {
list<char> l;
l.push_back('a');
for(list<char>::iterator i=l.begin(); i != l.end(); i++)
;
return 0;
}
int main1() {
list<char> l;
l.push_back('a');
list<char>::iterator e=l.end();
for(list<char>::iterator i=l.begin(); i != e; i++)
;
return 0;
}
Then we should compile this with optimizations on (I use 64-bit g++, you may try your compiler) and disassemble main and main1:
For main:
(gdb) disas main
Dump of assembler code for function main():
0x0000000000400650 <+0>: push %rbx
0x0000000000400651 <+1>: mov $0x18,%edi
0x0000000000400656 <+6>: sub $0x20,%rsp
0x000000000040065a <+10>: lea 0x10(%rsp),%rbx
0x000000000040065f <+15>: mov %rbx,0x10(%rsp)
0x0000000000400664 <+20>: mov %rbx,0x18(%rsp)
0x0000000000400669 <+25>: callq 0x400630 <_Znwm#plt>
0x000000000040066e <+30>: cmp $0xfffffffffffffff0,%rax
0x0000000000400672 <+34>: je 0x400678 <main()+40>
0x0000000000400674 <+36>: movb $0x61,0x10(%rax)
0x0000000000400678 <+40>: mov %rax,%rdi
0x000000000040067b <+43>: mov %rbx,%rsi
0x000000000040067e <+46>: callq 0x400610 <_ZNSt8__detail15_List_node_base7_M_hookEPS0_#plt>
0x0000000000400683 <+51>: mov 0x10(%rsp),%rax
0x0000000000400688 <+56>: cmp %rbx,%rax
0x000000000040068b <+59>: je 0x400698 <main()+72>
0x000000000040068d <+61>: nopl (%rax)
0x0000000000400690 <+64>: mov (%rax),%rax
0x0000000000400693 <+67>: cmp %rbx,%rax
0x0000000000400696 <+70>: jne 0x400690 <main()+64>
0x0000000000400698 <+72>: mov %rbx,%rdi
0x000000000040069b <+75>: callq 0x400840 <std::list<char, std::allocator<char> >::~list()>
0x00000000004006a0 <+80>: add $0x20,%rsp
0x00000000004006a4 <+84>: xor %eax,%eax
0x00000000004006a6 <+86>: pop %rbx
0x00000000004006a7 <+87>: retq
Look at the commands located at 0x0000000000400683-0x000000000040068b. That's the loop body and it seems to be perfectly optimized:
0x0000000000400690 <+64>: mov (%rax),%rax
0x0000000000400693 <+67>: cmp %rbx,%rax
0x0000000000400696 <+70>: jne 0x400690 <main()+64>
For main1:
(gdb) disas main1
Dump of assembler code for function main1():
0x00000000004007b0 <+0>: push %rbp
0x00000000004007b1 <+1>: mov $0x18,%edi
0x00000000004007b6 <+6>: push %rbx
0x00000000004007b7 <+7>: sub $0x18,%rsp
0x00000000004007bb <+11>: mov %rsp,%rbx
0x00000000004007be <+14>: mov %rsp,(%rsp)
0x00000000004007c2 <+18>: mov %rsp,0x8(%rsp)
0x00000000004007c7 <+23>: callq 0x400630 <_Znwm#plt>
0x00000000004007cc <+28>: cmp $0xfffffffffffffff0,%rax
0x00000000004007d0 <+32>: je 0x4007d6 <main1()+38>
0x00000000004007d2 <+34>: movb $0x61,0x10(%rax)
0x00000000004007d6 <+38>: mov %rax,%rdi
0x00000000004007d9 <+41>: mov %rsp,%rsi
0x00000000004007dc <+44>: callq 0x400610 <_ZNSt8__detail15_List_node_base7_M_hookEPS0_#plt>
0x00000000004007e1 <+49>: mov (%rsp),%rdi
0x00000000004007e5 <+53>: cmp %rbx,%rdi
0x00000000004007e8 <+56>: je 0x400818 <main1()+104>
0x00000000004007ea <+58>: mov %rdi,%rax
0x00000000004007ed <+61>: nopl (%rax)
0x00000000004007f0 <+64>: mov (%rax),%rax
0x00000000004007f3 <+67>: cmp %rbx,%rax
0x00000000004007f6 <+70>: jne 0x4007f0 <main1()+64>
0x00000000004007f8 <+72>: mov (%rdi),%rbp
0x00000000004007fb <+75>: callq 0x4005f0 <_ZdlPv#plt>
0x0000000000400800 <+80>: cmp %rbx,%rbp
0x0000000000400803 <+83>: je 0x400818 <main1()+104>
0x0000000000400805 <+85>: nopl (%rax)
0x0000000000400808 <+88>: mov %rbp,%rdi
0x000000000040080b <+91>: mov (%rdi),%rbp
0x000000000040080e <+94>: callq 0x4005f0 <_ZdlPv#plt>
0x0000000000400813 <+99>: cmp %rbx,%rbp
0x0000000000400816 <+102>: jne 0x400808 <main1()+88>
0x0000000000400818 <+104>: add $0x18,%rsp
0x000000000040081c <+108>: xor %eax,%eax
0x000000000040081e <+110>: pop %rbx
0x000000000040081f <+111>: pop %rbp
0x0000000000400820 <+112>: retq
The code for the loop is similar, it is:
0x00000000004007f0 <+64>: mov (%rax),%rax
0x00000000004007f3 <+67>: cmp %rbx,%rax
0x00000000004007f6 <+70>: jne 0x4007f0 <main1()+64>
But there is alot of extra stuff around the loop. Apparently, extra code has made the things WORSE.

I've compiled the following slightly hacky code using g++ 4.7.2 with -O3 -std=c++11, and got identical assembly for both functions:
#include <list>
#include <string>
using namespace std;
struct Employee: public string { string addr; };
string FindAddr1(list<Employee> emps, string name)
{
for (list<Employee>::const_iterator i = emps.begin(); i != emps.end(); i++)
{
if( *i == name )
{
return i->addr;
}
}
return "";
}
string FindAddr2(list<Employee> emps, string name)
{
list<Employee>::const_iterator end(emps.end());
for (list<Employee>::const_iterator i = emps.begin(); i != end; i++)
{
if( *i == name )
{
return i->addr;
}
}
return "";
}
In any event, I think the choice between the two versions should be primarily based on grounds of readability. Without profiling data, micro-optimizations like this to me look premature.

Contrary to popular belief, I don't see any difference between VC++ and gcc in this respect. I did a quick check with both g++ 4.7.2 and MS C++ 17 (aka VC++ 2012).
In both cases I compared the code generated with the code as in the question (with headers and such added to let it compile), to the following code:
string FindAddr(list<Employee> emps, string name)
{
auto end = emps.end();
for (list<Employee>::iterator i = emps.begin(); i != end; i++)
{
if( *i == name )
{
return i->addr;
}
}
return "";
}
In both cases the result was essentially identical for the two pieces of code. VC++ includes line-number comments in the code, which changed because of the extra line, but that was the only difference. With g++ the output files were identical.
Doing the same with std::vector instead of std::list, gave pretty much the same result -- no significant difference. For some reason, g++ did switch the order of operands for one instruction, from cmp esi, DWORD PTR [eax+4] to cmp DWORD PTR [eax+4], esi, but (again) this is utterly irrelevant.
Bottom line: no, you're not likely to gain anything from manually hoisting the code out of the loop with a modern compiler (at least with optimization enabled -- I was using /O2b2 with VC++ and /O3 with g++; comparing optimization with optimization turned off seems pretty pointless to me).

A couple of things... the first is that in general the cost of building an iterator (in Release mode, unchecked allocators) is minimal. They are usually wrappers around a pointer. With checked allocators (default in VS) you might have some cost, but if you really need the performance, after testing rebuild with unchecked allocators.
The code need not be as ugly as what you posted:
for (list<Employee>::const_iterator it=emps.begin(), end=emps.end();
it != end; ++it )
The main decision on whether you want to use one or the other approaches should be in terms of what operations are being applied to the container. If the container might be changing it's size then you might want to recompute the end iterator in each iteration. If not, you can just precompute once and reuse as in the code above.

If you really need the performance, you let your shiny new C++11 compiler write it for you:
for (const auto &i : emps) {
/* ... */
}
Yes, this is tongue-in-cheek (sort of). Herb's example here is now out of date. But since your compiler doesn't support it yet, let's get to the real question:
Is this a kind of construction I could rely on a compiler to optimize?
My rule of thumb is that the compiler writers are way smarter than I am. I can't rely on a compiler to optimize any one piece of code, because it might choose to optimize something else that has a bigger impact. The only way to know for sure is to try out both approaches on your compiler on your system and see what happens. Check your profiler results. If the call to .end() sticks out, save it in a separate variable. Otherwise, don't worry about it.

Containers like vector returns variable, which stores pointer to the end, on end() call, that optimized. If you've written container which does some lookups, etc on end() call consider writing
for (list<Employee>::const_iterator i = emps.begin(), end = emps.end(); i != end; ++i)
{
...
}
for speed

Use std algorithms
He's right of course; calling end can instantiate and destroy a temporary object, which is generally bad.
Of course, the compiler can optimise this away in a lot of cases.
There is a better and more robust solution: encapsulate your loops.
The example you gave is in fact std::find, give or take the return value. Many other loops also have std algorithms, or at least something similar enough that you can adapt - my utility library has a transform_if implementation, for example.
So, hide loops in a function and take a const& to end. Same fix as your example, but much much cleaner.

Performance of pIter != cont.end() in for loop

I was getting through "Exceptional C++" by Herb Sutter lately, and I have serious doubts about a particular recommendation he gives in Item 6 - Temporary Objects.
He offers to find unnecessary temporary objects in the following code:
string FindAddr(list<Employee> emps, string name)
{
for (list<Employee>::iterator i = emps.begin(); i != emps.end(); i++)
{
if( *i == name )
{
return i->addr;
}
}
return "";
}
As one of the example, he recommends to precompute the value of emps.end() before the loop, since there is a temporary object created on every iteration:
For most containers (including list), calling end() returns a
temporary object that must be constructed and destroyed. Because the
value will not change, recomputing (and reconstructing and
redestroying) it on every loop iteration is both needlessly
inefficient and unaesthetic. The value should be computed only once,
stored in a local object, and reused.
And he suggests replacing by the following:
list<Employee>::const_iterator end(emps.end());
for (list<Employee>::const_iterator i = emps.begin(); i != end; ++i)
For me, this is unnecessary complication. Even if one replaces ugly type declarations with compact auto, he still gets two lines of code instead of one. Even more, he has this end variable in the outer scope.
I was sure modern compilers will optimize this piece of code anyway, because I'm actually using const_iterator here and it is easy to check whether the loop content is accessing the container somehow. Compilers got smarter within the last 13 years, right?
Anyway, I will prefer the first version with i != emps.end() in most cases, where I'm not so much worried about performance. But I want to know for sure, whether this is a kind of construction I could rely on a compiler to optimize?
Update
Thanks for your suggestions on how to make this useless code better. Please note, my question is about compiler, not programming techniques. The only relevant answers for now are from NPE and Ellioh.

UPD: The book you are speaking about has been published in 1999, unless I'm mistaking. That's 14 years ago, and in modern programming 14 years is a lot of time. Many recommendations that were good and reliable in 1999, may be completely obsolete by now. Though my answer is about a single compiler and a single platform, there is also a more general idea.
Caring about extra variables, reusing a return value of trivial methods and similar tricks of old C++ is a step back towards the C++ of 1990s. Trivial methods like end() should be inlined quite well, and the result of inlining should be optimized as a part of the code it is called from. 99% situations do not require manual actions such as creating an end variable at all. Such things should be done only if:
You KNOW that on some of the compilers/platforms you should run on the code is not optimized well.
It has become a bottleneck in your program ("avoid premature optimization").
I've looked at what is generated by 64-bit g++:
gcc version 4.6.3 20120918 (prerelease) (Ubuntu/Linaro 4.6.3-10ubuntu1)
Initially I thought that with optimizations on it should be ok and there should be no difference between two versions. But looks like things are strange: the version you considered non-optimal is actually better. I think, the moral is: there is no reason to try being smarter than a compiler. Let's see both versions.
#include <list>
using namespace std;
int main() {
list<char> l;
l.push_back('a');
for(list<char>::iterator i=l.begin(); i != l.end(); i++)
;
return 0;
}
int main1() {
list<char> l;
l.push_back('a');
list<char>::iterator e=l.end();
for(list<char>::iterator i=l.begin(); i != e; i++)
;
return 0;
}
Then we should compile this with optimizations on (I use 64-bit g++, you may try your compiler) and disassemble main and main1:
For main:
(gdb) disas main
Dump of assembler code for function main():
0x0000000000400650 <+0>: push %rbx
0x0000000000400651 <+1>: mov $0x18,%edi
0x0000000000400656 <+6>: sub $0x20,%rsp
0x000000000040065a <+10>: lea 0x10(%rsp),%rbx
0x000000000040065f <+15>: mov %rbx,0x10(%rsp)
0x0000000000400664 <+20>: mov %rbx,0x18(%rsp)
0x0000000000400669 <+25>: callq 0x400630 <_Znwm#plt>
0x000000000040066e <+30>: cmp $0xfffffffffffffff0,%rax
0x0000000000400672 <+34>: je 0x400678 <main()+40>
0x0000000000400674 <+36>: movb $0x61,0x10(%rax)
0x0000000000400678 <+40>: mov %rax,%rdi
0x000000000040067b <+43>: mov %rbx,%rsi
0x000000000040067e <+46>: callq 0x400610 <_ZNSt8__detail15_List_node_base7_M_hookEPS0_#plt>
0x0000000000400683 <+51>: mov 0x10(%rsp),%rax
0x0000000000400688 <+56>: cmp %rbx,%rax
0x000000000040068b <+59>: je 0x400698 <main()+72>
0x000000000040068d <+61>: nopl (%rax)
0x0000000000400690 <+64>: mov (%rax),%rax
0x0000000000400693 <+67>: cmp %rbx,%rax
0x0000000000400696 <+70>: jne 0x400690 <main()+64>
0x0000000000400698 <+72>: mov %rbx,%rdi
0x000000000040069b <+75>: callq 0x400840 <std::list<char, std::allocator<char> >::~list()>
0x00000000004006a0 <+80>: add $0x20,%rsp
0x00000000004006a4 <+84>: xor %eax,%eax
0x00000000004006a6 <+86>: pop %rbx
0x00000000004006a7 <+87>: retq
Look at the commands located at 0x0000000000400683-0x000000000040068b. That's the loop body and it seems to be perfectly optimized:
0x0000000000400690 <+64>: mov (%rax),%rax
0x0000000000400693 <+67>: cmp %rbx,%rax
0x0000000000400696 <+70>: jne 0x400690 <main()+64>
For main1:
(gdb) disas main1
Dump of assembler code for function main1():
0x00000000004007b0 <+0>: push %rbp
0x00000000004007b1 <+1>: mov $0x18,%edi
0x00000000004007b6 <+6>: push %rbx
0x00000000004007b7 <+7>: sub $0x18,%rsp
0x00000000004007bb <+11>: mov %rsp,%rbx
0x00000000004007be <+14>: mov %rsp,(%rsp)
0x00000000004007c2 <+18>: mov %rsp,0x8(%rsp)
0x00000000004007c7 <+23>: callq 0x400630 <_Znwm#plt>
0x00000000004007cc <+28>: cmp $0xfffffffffffffff0,%rax
0x00000000004007d0 <+32>: je 0x4007d6 <main1()+38>
0x00000000004007d2 <+34>: movb $0x61,0x10(%rax)
0x00000000004007d6 <+38>: mov %rax,%rdi
0x00000000004007d9 <+41>: mov %rsp,%rsi
0x00000000004007dc <+44>: callq 0x400610 <_ZNSt8__detail15_List_node_base7_M_hookEPS0_#plt>
0x00000000004007e1 <+49>: mov (%rsp),%rdi
0x00000000004007e5 <+53>: cmp %rbx,%rdi
0x00000000004007e8 <+56>: je 0x400818 <main1()+104>
0x00000000004007ea <+58>: mov %rdi,%rax
0x00000000004007ed <+61>: nopl (%rax)
0x00000000004007f0 <+64>: mov (%rax),%rax
0x00000000004007f3 <+67>: cmp %rbx,%rax
0x00000000004007f6 <+70>: jne 0x4007f0 <main1()+64>
0x00000000004007f8 <+72>: mov (%rdi),%rbp
0x00000000004007fb <+75>: callq 0x4005f0 <_ZdlPv#plt>
0x0000000000400800 <+80>: cmp %rbx,%rbp
0x0000000000400803 <+83>: je 0x400818 <main1()+104>
0x0000000000400805 <+85>: nopl (%rax)
0x0000000000400808 <+88>: mov %rbp,%rdi
0x000000000040080b <+91>: mov (%rdi),%rbp
0x000000000040080e <+94>: callq 0x4005f0 <_ZdlPv#plt>
0x0000000000400813 <+99>: cmp %rbx,%rbp
0x0000000000400816 <+102>: jne 0x400808 <main1()+88>
0x0000000000400818 <+104>: add $0x18,%rsp
0x000000000040081c <+108>: xor %eax,%eax
0x000000000040081e <+110>: pop %rbx
0x000000000040081f <+111>: pop %rbp
0x0000000000400820 <+112>: retq
The code for the loop is similar, it is:
0x00000000004007f0 <+64>: mov (%rax),%rax
0x00000000004007f3 <+67>: cmp %rbx,%rax
0x00000000004007f6 <+70>: jne 0x4007f0 <main1()+64>
But there is alot of extra stuff around the loop. Apparently, extra code has made the things WORSE.

I've compiled the following slightly hacky code using g++ 4.7.2 with -O3 -std=c++11, and got identical assembly for both functions:
#include <list>
#include <string>
using namespace std;
struct Employee: public string { string addr; };
string FindAddr1(list<Employee> emps, string name)
{
for (list<Employee>::const_iterator i = emps.begin(); i != emps.end(); i++)
{
if( *i == name )
{
return i->addr;
}
}
return "";
}
string FindAddr2(list<Employee> emps, string name)
{
list<Employee>::const_iterator end(emps.end());
for (list<Employee>::const_iterator i = emps.begin(); i != end; i++)
{
if( *i == name )
{
return i->addr;
}
}
return "";
}
In any event, I think the choice between the two versions should be primarily based on grounds of readability. Without profiling data, micro-optimizations like this to me look premature.

Contrary to popular belief, I don't see any difference between VC++ and gcc in this respect. I did a quick check with both g++ 4.7.2 and MS C++ 17 (aka VC++ 2012).
In both cases I compared the code generated with the code as in the question (with headers and such added to let it compile), to the following code:
string FindAddr(list<Employee> emps, string name)
{
auto end = emps.end();
for (list<Employee>::iterator i = emps.begin(); i != end; i++)
{
if( *i == name )
{
return i->addr;
}
}
return "";
}
In both cases the result was essentially identical for the two pieces of code. VC++ includes line-number comments in the code, which changed because of the extra line, but that was the only difference. With g++ the output files were identical.
Doing the same with std::vector instead of std::list, gave pretty much the same result -- no significant difference. For some reason, g++ did switch the order of operands for one instruction, from cmp esi, DWORD PTR [eax+4] to cmp DWORD PTR [eax+4], esi, but (again) this is utterly irrelevant.
Bottom line: no, you're not likely to gain anything from manually hoisting the code out of the loop with a modern compiler (at least with optimization enabled -- I was using /O2b2 with VC++ and /O3 with g++; comparing optimization with optimization turned off seems pretty pointless to me).

A couple of things... the first is that in general the cost of building an iterator (in Release mode, unchecked allocators) is minimal. They are usually wrappers around a pointer. With checked allocators (default in VS) you might have some cost, but if you really need the performance, after testing rebuild with unchecked allocators.
The code need not be as ugly as what you posted:
for (list<Employee>::const_iterator it=emps.begin(), end=emps.end();
it != end; ++it )
The main decision on whether you want to use one or the other approaches should be in terms of what operations are being applied to the container. If the container might be changing it's size then you might want to recompute the end iterator in each iteration. If not, you can just precompute once and reuse as in the code above.

If you really need the performance, you let your shiny new C++11 compiler write it for you:
for (const auto &i : emps) {
/* ... */
}
Yes, this is tongue-in-cheek (sort of). Herb's example here is now out of date. But since your compiler doesn't support it yet, let's get to the real question:
Is this a kind of construction I could rely on a compiler to optimize?
My rule of thumb is that the compiler writers are way smarter than I am. I can't rely on a compiler to optimize any one piece of code, because it might choose to optimize something else that has a bigger impact. The only way to know for sure is to try out both approaches on your compiler on your system and see what happens. Check your profiler results. If the call to .end() sticks out, save it in a separate variable. Otherwise, don't worry about it.

Containers like vector returns variable, which stores pointer to the end, on end() call, that optimized. If you've written container which does some lookups, etc on end() call consider writing
for (list<Employee>::const_iterator i = emps.begin(), end = emps.end(); i != end; ++i)
{
...
}
for speed

Use std algorithms
He's right of course; calling end can instantiate and destroy a temporary object, which is generally bad.
Of course, the compiler can optimise this away in a lot of cases.
There is a better and more robust solution: encapsulate your loops.
The example you gave is in fact std::find, give or take the return value. Many other loops also have std algorithms, or at least something similar enough that you can adapt - my utility library has a transform_if implementation, for example.
So, hide loops in a function and take a const& to end. Same fix as your example, but much much cleaner.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

x86, C++, gcc and memory alignment - c++

Related

Which is faster, a struct, or a primitive variable containing the same bytes?

Why does MSVC generate nop instructions for atomic loads on x64?

Understanding volatile asm vs volatile variable

Does optimization remove unnecessary `cend()` calls? [duplicate]

Performance of pIter != cont.end() in for loop

Categories

Resources