int main(int argc, char *argv[]) {
int i = 0;
for (i = 0; i < 50; i++)
if (false) break;
}
Compiled and executed with VS 2010 (same issue in VS 2008). I put a breakpoint at the last line (closing bracket) and look via debugger into variable i. This code leaves i at 0. Why?
int main(int argc, char *argv[]) {
int i = 0;
for (i = 0; i < 50; i++)
if (false) break;;
}
After this - please notice the second semicolon after break - i is 50 as expected.
Can someone please explain this strange behaviour to me?
Looking at the generated assembly code with objdump -d -S we can see a possible reason that GDB jumps over the loop:
0000000000400584 <main>:
int main()
{
400584: 55 push %rbp
400585: 48 89 e5 mov %rsp,%rbp
volatile int i;
for (i = 0; i < 5; i++)
400588: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
40058f: eb 09 jmp 40059a <main+0x16>
400591: 8b 45 fc mov -0x4(%rbp),%eax
400594: 83 c0 01 add $0x1,%eax
400597: 89 45 fc mov %eax,-0x4(%rbp)
40059a: 8b 45 fc mov -0x4(%rbp),%eax
40059d: 83 f8 04 cmp $0x4,%eax
4005a0: 0f 9e c0 setle %al
4005a3: 84 c0 test %al,%al
4005a5: 75 ea jne 400591 <main+0xd>
4005a7: b8 00 00 00 00 mov $0x0,%eax
if (false)
break;
}
4005ac: c9 leaveq
4005ad: c3 retq
4005ae: 90 nop
4005af: 90 nop
Even when though compiled with optimizations turned off (-O0 flag to g++) no code is actually generated for the loop body. This might means that GDB will see the loop as a single statement, and will not step through the loop properly.
I used GCC version 4.4.5, and GDB version 7.0.1.
This code doesn’t compile on a conforming compiler, as it’s invalid C++ (void main).
That said, the resulting value of i is irrelevant: the compiler can do whatever it wants.
The reason is that i is never read outside the loop (which itself provably has no effect) and not declared volatile so the compiler is trivially able to prove that there is no observable side-effect, no matter the value of i.
In MSVC10, this is reproducible. I checked the disassembly. It seems a problem in the pdb file generation. The jump instruction to go back to the beginning of the loop is mixed with the next source line, that's it.
If you press step to next line, it will go back to the beginning of the for loop from the return statement and execute the loop 50 times as expected.
.
Related
I'm checking the Release build of my project done with the latest version of the VS 2017 C++ compiler. And I'm curious why did compiler choose to build the following code snippet:
//ncbSzBuffDataUsed of type INT32
UINT8* pDst = (UINT8*)(pMXB + 1);
UINT8* pSrc = (UINT8*)pDPE;
for(size_t i = 0; i < (size_t)ncbSzBuffDataUsed; i++)
{
pDst[i] = pSrc[i];
}
as such:
UINT8* pDst = (UINT8*)(pMXB + 1);
UINT8* pSrc = (UINT8*)pDPE;
for(size_t i = 0; i < (size_t)ncbSzBuffDataUsed; i++)
00007FF66441251E 4C 63 C2 movsxd r8,edx
00007FF664412521 4C 2B D1 sub r10,rcx
00007FF664412524 0F 1F 40 00 nop dword ptr [rax]
00007FF664412528 0F 1F 84 00 00 00 00 00 nop dword ptr [rax+rax]
00007FF664412530 41 0F B6 04 0A movzx eax,byte ptr [r10+rcx]
{
pDst[i] = pSrc[i];
00007FF664412535 88 01 mov byte ptr [rcx],al
00007FF664412537 48 8D 49 01 lea rcx,[rcx+1]
00007FF66441253B 49 83 E8 01 sub r8,1
00007FF66441253F 75 EF jne _logDebugPrint_in_MainXchgBuffer+0A0h (07FF664412530h)
}
versus just using a single REP MOVSB instruction? Wouldn't the latter be more efficient?
Edit: First up, there's an intrinsic for rep movsb which Peter Cordes tells us would be much faster here and I believe him (I guess I already did). If you want to force the compiler to do things this way, see: __movsb(): https://learn.microsoft.com/en-us/cpp/intrinsics/movsb.
As to why the compiler didn't do this for you, in the absence of any other ideas the answer might be register pressure. To use rep movsb The compiler would have to:
set up rsi (= source address)
set up rdi (= destination address)
set up rcx (= count)
issue the rep movsb
So now it has had to use up the three registers mandated by the rep movsb instruction, and it may prefer not to do that. Specifically rsi and rdi are expected to be preserved across a function call, so if the compiler can get away with using them in the body of any particular function it will, and (on initial entry to the method, at least) rcx holds the this pointer.
Also, with the code that we see the compiler has generated there, the r10 and rcxregisters might already contain the requisite source and destination addresses (we can't see that from your example), which would be handy for the compiler if so.
In practise, you will probably see the compiler make different choices in different situations. The type of optimisation requested (/O1 - optimise for size, vs /O2 - optimise for speed) will likely also affect this.
More on the x64 register passing convention here, and on the x64 ABI generally here.
Edit 2 (again inspired by Peter's comments):
The compiler probably decided not to vectorise the loop because it doesn't know if the pointers are aligned or might overlap. Without seeing more of the code, we can't be sure. But that's not strictly relevant to my answer, given what the OP actually asked about.
This is not really an answer, and I can't jam it all into a comment. I just want to share my additional findings. (This is probably relevant to the Visual Studio compilers only.)
What also makes a difference is how you structure your loops. For instance:
Assuming the following struct definitions:
#define PCALLBACK ULONG64
#pragma pack(push)
#pragma pack(1)
typedef struct {
ULONG64 ui0;
USHORT w0;
USHORT w1;
//Followed by:
// PCALLBACK[] 'array' - variable size array
}DPE;
#pragma pack(pop)
(1) The regular way to structure a for loop. The following code chunk is called somewhere in the middle of a larger serialization function:
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
for(size_t i = 0; i < (size_t)info.wNumCallbackFuncs; i++)
{
pDstClbks[i] = info.callbackFuncs[i];
}
As was mentioned somewhere in the answer on this page, it is clear that the compiler was starved of registers to have produced the following monstrocity (see how it reused rax for the loop end limit, or movzx eax,word ptr [r13] instruction that could've been clearly left out of the loop.)
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
00007FF7029327CF 48 83 C1 30 add rcx,30h
for(size_t i = 0; i < (size_t)info.wNumCallbackFuncs; i++)
00007FF7029327D3 66 41 3B 5D 00 cmp bx,word ptr [r13]
00007FF7029327D8 73 1F jae 07FF7029327F9h
00007FF7029327DA 4C 8B C1 mov r8,rcx
00007FF7029327DD 4C 2B F1 sub r14,rcx
{
pDstClbks[i] = info.callbackFuncs[i];
00007FF7029327E0 4B 8B 44 06 08 mov rax,qword ptr [r14+r8+8]
00007FF7029327E5 48 FF C3 inc rbx
00007FF7029327E8 49 89 00 mov qword ptr [r8],rax
00007FF7029327EB 4D 8D 40 08 lea r8,[r8+8]
00007FF7029327EF 41 0F B7 45 00 movzx eax,word ptr [r13]
00007FF7029327F4 48 3B D8 cmp rbx,rax
00007FF7029327F7 72 E7 jb 07FF7029327E0h
}
00007FF7029327F9 45 0F B7 C7 movzx r8d,r15w
(2) So if I re-write it into a less familiar C pattern:
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
PCALLBACK* pEndDstClbks = pDstClbks + (size_t)info.wNumCallbackFuncs;
for(PCALLBACK* pScrClbks = info.callbackFuncs;
pDstClbks < pEndDstClbks;
pScrClbks++, pDstClbks++)
{
*pDstClbks = *pScrClbks;
}
this produces a more sensible machine code (on the same compiler, in the same function, in the same project):
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
00007FF71D7E27C2 48 83 C1 30 add rcx,30h
PCALLBACK* pEndDstClbks = pDstClbks + (size_t)info.wNumCallbackFuncs;
00007FF71D7E27C6 0F B7 86 88 00 00 00 movzx eax,word ptr [rsi+88h]
00007FF71D7E27CD 48 8D 14 C1 lea rdx,[rcx+rax*8]
for(PCALLBACK* pScrClbks = info.callbackFuncs; pDstClbks < pEndDstClbks; pScrClbks++, pDstClbks++)
00007FF71D7E27D1 48 3B CA cmp rcx,rdx
00007FF71D7E27D4 76 14 jbe 07FF71D7E27EAh
00007FF71D7E27D6 48 2B F1 sub rsi,rcx
{
*pDstClbks = *pScrClbks;
00007FF71D7E27D9 48 8B 44 0E 08 mov rax,qword ptr [rsi+rcx+8]
00007FF71D7E27DE 48 89 01 mov qword ptr [rcx],rax
00007FF71D7E27E1 48 83 C1 08 add rcx,8
00007FF71D7E27E5 48 3B CA cmp rcx,rdx
00007FF71D7E27E8 77 EF jb 07FF71D7E27D9h
}
00007FF71D7E27EA 45 0F B7 C6 movzx r8d,r14w
I have been trying experiment with improving performance of strcmp under certain conditions. However, I unfortunately cannot even get an implementation of plain vanilla strcmp to perform as well as the library implementation.
I saw a similar question, but the answers say the difference was from the compiler optimizing away the comparison on string literals. My test does not use string literals.
Here's the implementation (comparisons.cpp)
int strcmp_custom(const char* a, const char* b) {
while (*b == *a) {
if (*a == '\0') return 0;
a++;
b++;
}
return *b - *a;
}
And here's the test driver (driver.cpp):
#include "comparisons.h"
#include <array>
#include <chrono>
#include <iostream>
void init_string(char* str, int nChars) {
// 10% of strings will be equal, and 90% of strings will have one char different.
// This way, many strings will share long prefixes so strcmp has to exercise a bit.
// Using random strings still shows the custom implementation as slower (just less so).
str[nChars - 1] = '\0';
for (int i = 0; i < nChars - 1; i++)
str[i] = (i % 94) + 32;
if (rand() % 10 != 0)
str[rand() % (nChars - 1)] = 'x';
}
int main(int argc, char** argv) {
srand(1234);
// Pre-generate some strings to compare.
const int kSampleSize = 100;
std::array<char[1024], kSampleSize> strings;
for (int i = 0; i < kSampleSize; i++)
init_string(strings[i], kSampleSize);
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < kSampleSize; i++)
for (int j = 0; j < kSampleSize; j++)
strcmp(strings[i], strings[j]);
auto end = std::chrono::high_resolution_clock::now();
std::cout << "strcmp - " << (end - start).count() << std::endl;
start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < kSampleSize; i++)
for (int j = 0; j < kSampleSize; j++)
strcmp_custom(strings[i], strings[j]);
end = std::chrono::high_resolution_clock::now();
std::cout << "strcmp_custom - " << (end - start).count() << std::endl;
}
And my makefile:
CC=clang++
test: driver.o comparisons.o
$(CC) -o test driver.o comparisons.o
# Compile the test driver with optimizations off.
driver.o: driver.cpp comparisons.h
$(CC) -c -o driver.o -std=c++11 -O0 driver.cpp
# Compile the code being tested separately with optimizations on.
comparisons.o: comparisons.cpp comparisons.h
$(CC) -c -o comparisons.o -std=c++11 -O3 comparisons.cpp
clean:
rm comparisons.o driver.o test
On the advice of this answer, I compiled my comparison function in a separate compilation unit with optimizations and compiled the driver with optimizations turned off, but I still get a slowdown of about 5x.
strcmp - 154519
strcmp_custom - 506282
I also tried copying the FreeBSD implementation but got similar results.
I'm wondering if my performance measurement is overlooking something. Or is the standard library implementation doing something fancier?
I don't know which standard library you have, but just to give you an idea of how serious C library maintainers are about optimizing the string primitives, the default strcmp used by GNU libc on x86-64 is two thousand lines of hand-optimized assembly language, as of version 2.24. There are separate, also hand-optimized, versions for when the SSSE3 and SSE4.2 instruction set extensions are available. (A fair bit of the complexity in that file appears to be because the same source code is used to generate several other functions; the machine code winds up being "only" 1120 instructions.) 2.24 was released roughly a year ago, and even more work has gone into it since.
They go to this much trouble because it's common for one of the string primitives to be the single hottest function in a profile.
Excerpts from my disassembly of glibc v2.2.5, x86_64 linux:
0000000000089cd0 <strcmp##GLIBC_2.2.5>:
89cd0: 48 8b 15 99 a1 33 00 mov 0x33a199(%rip),%rdx # 3c3e70 <_IO_file_jumps##GLIBC_2.2.5+0x790>
89cd7: 48 8d 05 92 58 01 00 lea 0x15892(%rip),%rax # 9f570 <strerror_l##GLIBC_2.6+0x200>
89cde: f7 82 b0 00 00 00 10 testl $0x10,0xb0(%rdx)
89ce5: 00 00 00
89ce8: 75 1a jne 89d04 <strcmp##GLIBC_2.2.5+0x34>
89cea: 48 8d 05 9f 48 0c 00 lea 0xc489f(%rip),%rax # 14e590 <__nss_passwd_lookup##GLIBC_2.2.5+0x9c30>
89cf1: f7 82 80 00 00 00 00 testl $0x200,0x80(%rdx)
89cf8: 02 00 00
89cfb: 75 07 jne 89d04 <strcmp##GLIBC_2.2.5+0x34>
89cfd: 48 8d 05 0c 00 00 00 lea 0xc(%rip),%rax # 89d10 <strcmp##GLIBC_2.2.5+0x40>
89d04: c3 retq
89d05: 90 nop
89d06: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
89d0d: 00 00 00
89d10: 89 f1 mov %esi,%ecx
89d12: 89 f8 mov %edi,%eax
89d14: 48 83 e1 3f and $0x3f,%rcx
89d18: 48 83 e0 3f and $0x3f,%rax
89d1c: 83 f9 30 cmp $0x30,%ecx
89d1f: 77 3f ja 89d60 <strcmp##GLIBC_2.2.5+0x90>
89d21: 83 f8 30 cmp $0x30,%eax
89d24: 77 3a ja 89d60 <strcmp##GLIBC_2.2.5+0x90>
89d26: 66 0f 12 0f movlpd (%rdi),%xmm1
89d2a: 66 0f 12 16 movlpd (%rsi),%xmm2
89d2e: 66 0f 16 4f 08 movhpd 0x8(%rdi),%xmm1
89d33: 66 0f 16 56 08 movhpd 0x8(%rsi),%xmm2
89d38: 66 0f ef c0 pxor %xmm0,%xmm0
89d3c: 66 0f 74 c1 pcmpeqb %xmm1,%xmm0
89d40: 66 0f 74 ca pcmpeqb %xmm2,%xmm1
89d44: 66 0f f8 c8 psubb %xmm0,%xmm1
89d48: 66 0f d7 d1 pmovmskb %xmm1,%edx
89d4c: 81 ea ff ff 00 00 sub $0xffff,%edx
...
The real thing is 1183 lines of assembly, with lots of potential cleverness about detecting system features and vectorized instructions. libc maintainers know that they can get an edge by just optimizing some of the functions called thousands of times by applications.
For comparison, your version at -O3:
comparisons.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <_Z13strcmp_customPKcS0_>:
int strcmp_custom(const char* a, const char* b) {
while (*b == *a) {
0: 8a 0e mov (%rsi),%cl
2: 8a 07 mov (%rdi),%al
4: 38 c1 cmp %al,%cl
6: 75 1e jne 26 <_Z13strcmp_customPKcS0_+0x26>
if (*a == '\0') return 0;
8: 48 ff c6 inc %rsi
b: 48 ff c7 inc %rdi
e: 66 90 xchg %ax,%ax
10: 31 c0 xor %eax,%eax
12: 84 c9 test %cl,%cl
14: 74 18 je 2e <_Z13strcmp_customPKcS0_+0x2e>
int strcmp_custom(const char* a, const char* b) {
while (*b == *a) {
16: 0f b6 0e movzbl (%rsi),%ecx
19: 0f b6 07 movzbl (%rdi),%eax
1c: 48 ff c6 inc %rsi
1f: 48 ff c7 inc %rdi
22: 38 c1 cmp %al,%cl
24: 74 ea je 10 <_Z13strcmp_customPKcS0_+0x10>
26: 0f be d0 movsbl %al,%edx
29: 0f be c1 movsbl %cl,%eax
if (*a == '\0') return 0;
a++;
b++;
}
return *b - *a;
2c: 29 d0 sub %edx,%eax
}
2e: c3 retq
I have a very long (in number of iterations) for loop, and I like to make it possible to personalize some of its parts. The code looks as following:
function expensive_loop( void (*do_true)(int), void (*do_false)(int)){
for(i=0; i<VeryLargeN; i++){
element=elements[i]
// long computation that produce a boolean condition
if (condition){
do_true(element);
}else{
do_false(element);
}
}
}
Now, the problem is that every time do_true and do_false are called, there is an overhead due to the push/pop of the stack that ruins the high performance of the code.
To solve this I could simply create several copies of the expensive_loop function, each with its own do_true and do_false implementation. This will make impossible the code to mantain.
So, how does someone make the internal part of an iteration so it can be personalized, and still mantain high performance?
Note that the function accepts pointers to functions, so those get called through a pointer. The optimizer may inline those calls through the function pointers if the definitions of expensive_loop and those functions are available and the compiler inlining limits have not been breached.
Another option is to make this algorithm a function template that accepts callable objects (function pointers, objects with a call operator, lambdas), just like standard algorithms do. This way the compiler may have more optimization opportunities. E.g.:
template<class DoTrue, class DoFalse>
void expensive_loop(DoTrue do_true, DoFalse do_false) {
// Original function body here.
}
There is -Winline compiler switch for g++:
-Winline
Warn if a function can not be inlined and it was declared as inline. Even with this option, the compiler will not warn about failures to inline functions declared in system headers.
The compiler uses a variety of heuristics to determine whether or not to inline a function. For example, the compiler takes into account the size of the function being inlined and the the amount of inlining that has already been done in the current function. Therefore, seemingly insignificant changes in the source program can cause the warnings produced by -Winline to appear or disappear.
It probably does not warn about a function not being inlined when it is called through a pointer.
The problem is that the function address (what actually is set in do_true and do_false is not resolved until link time, where there are not many opportunities for optimization.
If you are explicitly setting both functions in the code (i.e., the functions themselves don't come from an external library, etc.), you can declare your function with C++ templates, so that the compiler knows exactly which functions you want to call at that time.
struct function_one {
void operator()( int element ) {
}
};
extern int elements[];
extern bool condition();
template < typename DoTrue, typename DoFalse >
void expensive_loop(){
DoTrue do_true;
DoFalse do_false;
for(int i=0; i<50; i++){
int element=elements[i];
// long computation that produce a boolean condition
if (condition()){
do_true(element); // call DoTrue's operator()
}else{
do_false(element); // call DoFalse's operator()
}
}
}
int main( int argc, char* argv[] ) {
expensive_loop<function_one,function_one>();
return 0;
}
The compiler will instantiate an expensive_loop function for each combination of DoTrue and DoFalse types you specify. It will increase the size of the executable if you use more than one combination, but each of them should do what you expect.
For the example I shown, note how the function is empty.
The compiler just strips away the function call and leaves the loop:
main:
push rbx
mov ebx, 50
.L2:
call condition()
sub ebx, 1
jne .L2
xor eax, eax
pop rbx
ret
See example in https://godbolt.org/g/hV52Nn
Using function pointers as in your example, may not inline the function calls. This is the produced assembler for main and expensive_loop in a program where expensive_loop
// File A.cpp
void foo( int arg );
void bar( int arg );
extern bool condition();
extern int elements[];
void expensive_loop( void (*do_true)(int), void (*do_false)(int)){
for(int i=0; i<50; i++){
int element=elements[i];
// long computation that produce a boolean condition
if (condition()){
do_true(element);
}else{
do_false(element);
}
}
}
int main( int argc, char* argv[] ) {
expensive_loop( foo, bar );
return 0;
}
and the functions passed by argument
// File B.cpp
#include <math.h>
int elements[50];
bool condition() {
return elements[0] == 1;
}
inline int foo( int arg ) {
return arg%3;
}
inline int bar( int arg ) {
return 1234%arg;
}
are defined in different translation units.
0000000000400620 <expensive_loop(void (*)(int), void (*)(int))>:
400620: 41 55 push %r13
400622: 49 89 fd mov %rdi,%r13
400625: 41 54 push %r12
400627: 49 89 f4 mov %rsi,%r12
40062a: 55 push %rbp
40062b: 53 push %rbx
40062c: bb 60 10 60 00 mov $0x601060,%ebx
400631: 48 83 ec 08 sub $0x8,%rsp
400635: eb 19 jmp 400650 <expensive_loop(void (*)(int), void (*)(int))+0x30>
400637: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
40063e: 00 00
400640: 48 83 c3 04 add $0x4,%rbx
400644: 41 ff d5 callq *%r13
400647: 48 81 fb 28 11 60 00 cmp $0x601128,%rbx
40064e: 74 1d je 40066d <expensive_loop(void (*)(int), void (*)(int))+0x4d>
400650: 8b 2b mov (%rbx),%ebp
400652: e8 79 ff ff ff callq 4005d0 <condition()>
400657: 84 c0 test %al,%al
400659: 89 ef mov %ebp,%edi
40065b: 75 e3 jne 400640 <expensive_loop(void (*)(int), void (*)(int))+0x20>
40065d: 48 83 c3 04 add $0x4,%rbx
400661: 41 ff d4 callq *%r12
400664: 48 81 fb 28 11 60 00 cmp $0x601128,%rbx
40066b: 75 e3 jne 400650 <expensive_loop(void (*)(int), void (*)(int))+0x30>
40066d: 48 83 c4 08 add $0x8,%rsp
400671: 5b pop %rbx
400672: 5d pop %rbp
400673: 41 5c pop %r12
400675: 41 5d pop %r13
400677: c3 retq
400678: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
40067f: 00
You can see how the calls are still performed even when using -O3 optimization level:
400644: 41 ff d5 callq *%r13
This question already has answers here:
Can we change the value of an object defined with const through pointers?
(11 answers)
Closed 6 years ago.
Not a Duplicate. Please read Full question.
#include<iostream>
using namespace std;
int main()
{
const int a = 5;
const int *ptr1 = &a;
int *ptr = (int *)ptr1;
*ptr = 10;
cout<<ptr<<" = "<<*ptr<<endl;
cout<<ptr1<<" = "<<*ptr1<<endl;
cout<<&a<<" = "<<a;
return 0;
}
Output:
0x7ffe13455fb4 = 10
0x7ffe13455fb4 = 10
0x7ffe13455fb4 = 5
How is this possible?
You shouldn't rely on undefined behaviour. Look what the compiler does with your code, particularly the last part:
cout<<&a<<" = "<<a;
b6: 48 8d 45 ac lea -0x54(%rbp),%rax
ba: 48 89 c2 mov %rax,%rdx
bd: 48 8b 0d 00 00 00 00 mov 0x0(%rip),%rcx # c4 <main+0xc4>
c4: e8 00 00 00 00 callq c9 <main+0xc9>
c9: 48 8d 15 00 00 00 00 lea 0x0(%rip),%rdx # d0 <main+0xd0>
d0: 48 89 c1 mov %rax,%rcx
d3: e8 00 00 00 00 callq d8 <main+0xd8>
d8: ba 05 00 00 00 mov $0x5,%edx <=== direct insert of 5 in the register to display 5
dd: 48 89 c1 mov %rax,%rcx
e0: e8 00 00 00 00 callq e5 <main+0xe5>
return 0;
e5: b8 00 00 00 00 mov $0x0,%eax
ea: 90 nop
eb: 48 83 c4 48 add $0x48,%rsp
ef: 5b pop %rbx
f0: 5d pop %rbp
f1: c3 retq
When the compiler sees a constant expression, it can decide (implementation-dependent) to replace it with the actual value.
In that particular case, g++ did that without even -O1 option!
When you invoke undefined behavior anything is possible.
In this case, you are casting the constness away with this line:
int *ptr = (int *)ptr1;
And you're lucky enough that there is an address on the stack to be changed, that explains why the first two prints output a 10.
The third print outputs a 5 because the compiler optimized it by hardcoding a 5 making the assumption that a wouldn't be changed.
It is certainly undefined behavior, but I am strong proponent of understanding symptoms of undefined behavior for the benefit of spotting one. The results observed can be explained in following manner:
const int a = 5
defined integer constant. Compiler now assumes that value will never be modified for the duration of the whole function, so when it sees
cout<<&a<<" = "<<a;
it doesn't generate the code to reload the current value of a, instead it just uses the number it was initialized with - it is much faster, than loading from memory.
This is a very common optimization technique - when a certain condition can only happen when the program exhibits undefined behavior, optimizers assume that condition never happens.
I have a vector:
vector<Body*> Bodies;
And it contains pointers to Body objects that I have defined.
I also have a unsigned int const that contains the number of bodyobjects I wish to have in bodies.
unsigned int const NumParticles = 1000;
I have populated Bodieswith NumParticles amount of Body objects.
Now if I wish to iterate through a loop, for example invoking each of the Body's Update() functions in Bodies, I have two choices on what I can do:
First:
for (unsigned int i = 0; i < NumParticles; i++)
{
Bodies.at(i)->Update();
}
Or second:
for (unsigned int i = 0; i < Bodies.size(); i++)
{
Bodies.at(i)->Update();
}
There are pro's and con's of each. I would like to know which one (if either) would be the better practice, in terms of safety, readability and convention.
I expect, given that the compiler (at least in this case) can inline all relevant code in the std::vector, it will be identical code [aside from 1000 being a true constant literal in the machine code, and Bodies.size() will be a "variable" value].
Short summary of findings:
The compiler doesn't call a function for size() of a vector for every iteration, it calculates that in the beginning of the loop, and uses it as a "constant value".
Actual code IN the loop is identical, only the preparation of the loop is different.
As always: If performance is highly important, measure on your system with your data and your compiler. Otherwise, write the code that makes most sense for your design (I prefer using for(auto i : vec), as that is easy and straight forward [and works for all the containers])
Supporting evidence:
After fetching coffee, I wrote this code:
class X
{
public:
void Update() { x++; }
operator int() { return x; }
private:
int x = rand();
};
extern std::vector<X*> vec;
const size_t vec_size = 1000;
void Process1()
{
for(auto i : vec)
{
i->Update();
}
}
void Process2()
{
for(size_t i = 0; i < vec.size(); i++)
{
vec[i]->Update();
}
}
void Process3()
{
for(size_t i = 0; i < vec_size; i++)
{
vec[i]->Update();
}
}
(along with a main function that fills the array, and calls Process1(), Process2() and Process3() - the main is in an separate file to avoid the compiler deciding to inline everything and making it hard to tell what is what)
Here's the code generated by g++ 4.9.2:
0000000000401940 <_Z8Process1v>:
401940: 48 8b 0d a1 18 20 00 mov 0x2018a1(%rip),%rcx # 6031e8 <vec+0x8>
401947: 48 8b 05 92 18 20 00 mov 0x201892(%rip),%rax # 6031e0 <vec>
40194e: 48 39 c1 cmp %rax,%rcx
401951: 74 14 je 401967 <_Z8Process1v+0x27>
401953: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
401958: 48 8b 10 mov (%rax),%rdx
40195b: 48 83 c0 08 add $0x8,%rax
40195f: 83 02 01 addl $0x1,(%rdx)
401962: 48 39 c1 cmp %rax,%rcx
401965: 75 f1 jne 401958 <_Z8Process1v+0x18>
401967: f3 c3 repz retq
0000000000401970 <_Z8Process2v>:
401970: 48 8b 35 69 18 20 00 mov 0x201869(%rip),%rsi # 6031e0 <vec>
401977: 48 8b 0d 6a 18 20 00 mov 0x20186a(%rip),%rcx # 6031e8 <vec+0x8>
40197e: 31 c0 xor %eax,%eax
401980: 48 29 f1 sub %rsi,%rcx
401983: 48 c1 f9 03 sar $0x3,%rcx
401987: 48 85 c9 test %rcx,%rcx
40198a: 74 14 je 4019a0 <_Z8Process2v+0x30>
40198c: 0f 1f 40 00 nopl 0x0(%rax)
401990: 48 8b 14 c6 mov (%rsi,%rax,8),%rdx
401994: 48 83 c0 01 add $0x1,%rax
401998: 83 02 01 addl $0x1,(%rdx)
40199b: 48 39 c8 cmp %rcx,%rax
40199e: 75 f0 jne 401990 <_Z8Process2v+0x20>
4019a0: f3 c3 repz retq
00000000004019b0 <_Z8Process3v>:
4019b0: 48 8b 05 29 18 20 00 mov 0x201829(%rip),%rax # 6031e0 <vec>
4019b7: 48 8d 88 40 1f 00 00 lea 0x1f40(%rax),%rcx
4019be: 66 90 xchg %ax,%ax
4019c0: 48 8b 10 mov (%rax),%rdx
4019c3: 48 83 c0 08 add $0x8,%rax
4019c7: 83 02 01 addl $0x1,(%rdx)
4019ca: 48 39 c8 cmp %rcx,%rax
4019cd: 75 f1 jne 4019c0 <_Z8Process3v+0x10>
4019cf: f3 c3 repz retq
Whilst the assembly code looks slightly different for each of those cases, in practice, I'd say you'd be hard pushed to measure the difference between those loops, and in fact, a run of perf on the code show that it's "the same time for all loops" [this is with 100000 elements and 100 calls to Process1, Process2 and Process3 in a loop, otherwise the time was dominated by new X in main]:
31.29% a.out a.out [.] Process1
31.28% a.out a.out [.] Process3
31.13% a.out a.out [.] Process2
Unless you think 1/10th of a percent is significant - and it may be for something that takes a week to run, but this is only a few tenths of a seconds [0.163 seconds on my machine], and probably more measurement error than anything else - and the shorter time is actually the one that in theory should be slowest, Process2, using vec.size(). I did another run with a higher loop count, and now the measurement for each of the loops is with 0.01% of each other - in other words identical in time spent.
Of course, if you look carefully, you will see that the actual loop content for all three variants is essentially identical, except for the early part of Process3 which is simpler because the compiler knows that we will do at least one loop - Process1 and Process2 has to check for "is the vector empty" before the first iteration. This would make a difference for VERY short vector lengths.
I would vote for for range:
for (auto* body : Bodies)
{
body->Update();
}
NumParticles is not a property of the vector. It is some external constant relative to the vector. I would prefer to use the property size() of the vector. In this case the code is more safe and clear for the reader.
Usually using some constant instead of size() means for the reader that in general the constant can be unequal to the size().
Thus if you want to say the reader that you are going to process all elements of the vector then it is better to use size(). Otherwise use the constant.
Of course there are exceptions from this implicit rule when the accent is put on the constant. In this case it is better to use the constant. But it depends on the context.
I would suggest you to use the .size() function instead of defining a new constant.
Why?
Safety : Since .size() does not throw any exceptions, it is perfectly safe to use .size().
Readability : IMHO, Bodies.size() conveys the size of the vector Bodies more clearly than NumParticles.
Convention : According to conventions too, it is better to use .size() as it is a property of the vector, instead of the variable NumParticles.
Performance: .size() is a constant complexity member function, so there is no significant performance difference between using a const int and .size().
I prefer this form:
for (auto const& it : Bodies)
{
it->Update();
}