This question already has answers here:
Can we change the value of an object defined with const through pointers?
(11 answers)
Closed 6 years ago.
Not a Duplicate. Please read Full question.
#include<iostream>
using namespace std;
int main()
{
const int a = 5;
const int *ptr1 = &a;
int *ptr = (int *)ptr1;
*ptr = 10;
cout<<ptr<<" = "<<*ptr<<endl;
cout<<ptr1<<" = "<<*ptr1<<endl;
cout<<&a<<" = "<<a;
return 0;
}
Output:
0x7ffe13455fb4 = 10
0x7ffe13455fb4 = 10
0x7ffe13455fb4 = 5
How is this possible?
You shouldn't rely on undefined behaviour. Look what the compiler does with your code, particularly the last part:
cout<<&a<<" = "<<a;
b6: 48 8d 45 ac lea -0x54(%rbp),%rax
ba: 48 89 c2 mov %rax,%rdx
bd: 48 8b 0d 00 00 00 00 mov 0x0(%rip),%rcx # c4 <main+0xc4>
c4: e8 00 00 00 00 callq c9 <main+0xc9>
c9: 48 8d 15 00 00 00 00 lea 0x0(%rip),%rdx # d0 <main+0xd0>
d0: 48 89 c1 mov %rax,%rcx
d3: e8 00 00 00 00 callq d8 <main+0xd8>
d8: ba 05 00 00 00 mov $0x5,%edx <=== direct insert of 5 in the register to display 5
dd: 48 89 c1 mov %rax,%rcx
e0: e8 00 00 00 00 callq e5 <main+0xe5>
return 0;
e5: b8 00 00 00 00 mov $0x0,%eax
ea: 90 nop
eb: 48 83 c4 48 add $0x48,%rsp
ef: 5b pop %rbx
f0: 5d pop %rbp
f1: c3 retq
When the compiler sees a constant expression, it can decide (implementation-dependent) to replace it with the actual value.
In that particular case, g++ did that without even -O1 option!
When you invoke undefined behavior anything is possible.
In this case, you are casting the constness away with this line:
int *ptr = (int *)ptr1;
And you're lucky enough that there is an address on the stack to be changed, that explains why the first two prints output a 10.
The third print outputs a 5 because the compiler optimized it by hardcoding a 5 making the assumption that a wouldn't be changed.
It is certainly undefined behavior, but I am strong proponent of understanding symptoms of undefined behavior for the benefit of spotting one. The results observed can be explained in following manner:
const int a = 5
defined integer constant. Compiler now assumes that value will never be modified for the duration of the whole function, so when it sees
cout<<&a<<" = "<<a;
it doesn't generate the code to reload the current value of a, instead it just uses the number it was initialized with - it is much faster, than loading from memory.
This is a very common optimization technique - when a certain condition can only happen when the program exhibits undefined behavior, optimizers assume that condition never happens.
Related
Consider this simple function:
struct Foo {
int a;
int b;
int c;
int d;
int e;
int f;
};
Foo foo() {
Foo f;
f.a = 1;
f.b = 2;
f.c = 3;
f.d = 4;
f.e = 5;
f.f = 6;
return f;
}
It generates the following assembly:
0000000000400500 <foo()>:
400500: 48 ba 01 00 00 00 02 movabs rdx,0x200000001
400507: 00 00 00
40050a: 48 b9 03 00 00 00 04 movabs rcx,0x400000003
400511: 00 00 00
400514: 48 be 05 00 00 00 06 movabs rsi,0x600000005
40051b: 00 00 00
40051e: 48 89 17 mov QWORD PTR [rdi],rdx
400521: 48 89 4f 08 mov QWORD PTR [rdi+0x8],rcx
400525: 48 89 77 10 mov QWORD PTR [rdi+0x10],rsi
400529: 48 89 f8 mov rax,rdi
40052c: c3 ret
40052d: 0f 1f 00 nop DWORD PTR [rax]
Based on the assembly, I understand that the caller created space for Foo on its stack, and passed that information in rdi to the callee.
I am trying to find documentation for this convention. Calling convention in linux states that rdi contains the first integer argument. In this case, foo doesn't have any arguments.
Moreover, if I make foo take one integer argument, that is now passed as rsi (register for second argument) with rdi used for address of the return object.
Can anyone provide some documentation and clarity on how rdi is used in system V ABI?
See section 3.2.3 Parameter Passing in the ABI docs which says:
If the type has class MEMORY, then the caller provides space for the
return value and passes the address of this storage in %rdi as if it
were the first argument to the function. In effect, this address
becomes a "hidden" first argument.
On return %rax will contain the address that has been passed in by the
caller in %rdi.
I created code where I have two functions returnValues and returnValuesVoid. One returns tuple of 2 values and other accept argument's references to the function.
#include <iostream>
#include <tuple>
std::tuple<int, int> returnValues(const int a, const int b) {
return std::tuple(a,b);
}
void returnValuesVoid(int &a,int &b) {
a += 100;
b += 100;
}
int main() {
auto [x,y] = returnValues(10,20);
std::cout << x ;
std::cout << y ;
int a = 10, b = 20;
returnValuesVoid(a, b);
std::cout << a ;
std::cout << b ;
}
I read about http://en.cppreference.com/w/cpp/language/structured_binding
which can destruct tuple to auto [x,y] variables.
Is auto [x,y] = returnValues(10,20); better than passing by references? As I know it's slower because it does have to return tuple object and reference just works on orginal variables passed to function so there's no reason to use it except cleaner code.
As auto [x,y] is since C++17 do people use it on production? I see that it looks cleaner than returnValuesVoid which is void type and but does it have other advantages over passing by reference?
Look at disassemble (compiled with GCC -O3):
It takes more instruction to implement tuple call.
0000000000000000 <returnValues(int, int)>:
0: 83 c2 64 add $0x64,%edx
3: 83 c6 64 add $0x64,%esi
6: 48 89 f8 mov %rdi,%rax
9: 89 17 mov %edx,(%rdi)
b: 89 77 04 mov %esi,0x4(%rdi)
e: c3 retq
f: 90 nop
0000000000000010 <returnValuesVoid(int&, int&)>:
10: 83 07 64 addl $0x64,(%rdi)
13: 83 06 64 addl $0x64,(%rsi)
16: c3 retq
But less instructions for the tuple caller:
0000000000000000 <callTuple()>:
0: 48 83 ec 18 sub $0x18,%rsp
4: ba 14 00 00 00 mov $0x14,%edx
9: be 0a 00 00 00 mov $0xa,%esi
e: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
13: e8 00 00 00 00 callq 18 <callTuple()+0x18> // call returnValues
18: 8b 74 24 0c mov 0xc(%rsp),%esi
1c: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
23: e8 00 00 00 00 callq 28 <callTuple()+0x28> // std::cout::operator<<
28: 8b 74 24 08 mov 0x8(%rsp),%esi
2c: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
33: e8 00 00 00 00 callq 38 <callTuple()+0x38> // std::cout::operator<<
38: 48 83 c4 18 add $0x18,%rsp
3c: c3 retq
3d: 0f 1f 00 nopl (%rax)
0000000000000040 <callRef()>:
40: 48 83 ec 18 sub $0x18,%rsp
44: 48 8d 74 24 0c lea 0xc(%rsp),%rsi
49: 48 8d 7c 24 08 lea 0x8(%rsp),%rdi
4e: c7 44 24 08 0a 00 00 movl $0xa,0x8(%rsp)
55: 00
56: c7 44 24 0c 14 00 00 movl $0x14,0xc(%rsp)
5d: 00
5e: e8 00 00 00 00 callq 63 <callRef()+0x23> // call returnValuesVoid
63: 8b 74 24 08 mov 0x8(%rsp),%esi
67: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
6e: e8 00 00 00 00 callq 73 <callRef()+0x33> // std::cout::operator<<
73: 8b 74 24 0c mov 0xc(%rsp),%esi
77: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
7e: e8 00 00 00 00 callq 83 <callRef()+0x43> // std::cout::operator<<
83: 48 83 c4 18 add $0x18,%rsp
87: c3 retq
I don't think there is any considerable performance different, but the tuple one is more clear, more readable.
Also tried inlined call, there is absolutely no different at all. Both of them generate exactly the same assemble code.
0000000000000000 <callTuple()>:
0: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
7: 48 83 ec 08 sub $0x8,%rsp
b: be 6e 00 00 00 mov $0x6e,%esi
10: e8 00 00 00 00 callq 15 <callTuple()+0x15>
15: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
1c: be 78 00 00 00 mov $0x78,%esi
21: 48 83 c4 08 add $0x8,%rsp
25: e9 00 00 00 00 jmpq 2a <callTuple()+0x2a> // TCO, optimized way to call a function and also return
2a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
0000000000000030 <callRef()>:
30: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
37: 48 83 ec 08 sub $0x8,%rsp
3b: be 6e 00 00 00 mov $0x6e,%esi
40: e8 00 00 00 00 callq 45 <callRef()+0x15>
45: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi
4c: be 78 00 00 00 mov $0x78,%esi
51: 48 83 c4 08 add $0x8,%rsp
55: e9 00 00 00 00 jmpq 5a <callRef()+0x2a> // TCO, optimized way to call a function and also return
Focus on what's more readable and which approach provides a better intuition to the reader, and please keep the performance issues you might think that arise in the background.
A function that returns a tuple (or a pair, a struct, etc.) is yelling to the author that the function returns something, that almost always has some meaning that the user can take into account.
A function that gives back the results in variables passed by reference, may slip the eye's attention of a tired reader.
So, in general, prefer to return the results by a tuple.
Mike van Dyke pointed to this link:
F.21: To return multiple "out" values, prefer returning a tuple or struct
Reason
A return value is self-documenting as an "output-only"
value. Note that C++ does have multiple return values, by convention
of using a tuple (including pair), possibly with the extra convenience
of tie at the call site.
[...]
Exception
Sometimes, we need to pass an object to a function to manipulate its state. In such cases, passing the object by reference T& is usually the right technique.
Using another compiler (VS 2017) the resulting code shows no difference, as the function calls are just optimized away.
int main() {
00007FF6A9C51E50 sub rsp,28h
auto [x,y] = returnValues(10,20);
std::cout << x ;
00007FF6A9C51E54 mov edx,0Ah
00007FF6A9C51E59 call std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6A9C51F60h)
std::cout << y ;
00007FF6A9C51E5E mov edx,14h
00007FF6A9C51E63 call std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6A9C51F60h)
int a = 10, b = 20;
returnValuesVoid(a, b);
std::cout << a ;
00007FF6A9C51E68 mov edx,6Eh
00007FF6A9C51E6D call std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6A9C51F60h)
std::cout << b ;
00007FF6A9C51E72 mov edx,78h
00007FF6A9C51E77 call std::basic_ostream<char,std::char_traits<char> >::operator<< (07FF6A9C51F60h)
}
00007FF6A9C51E7C xor eax,eax
00007FF6A9C51E7E add rsp,28h
00007FF6A9C51E82 ret
So using clearer code seems to be the obvious choice.
What Zang said is true but not up-to the point. I ran the code provided in question with chrono to measure time. I think the answer needs to be edited after observing what happened.
For 1M iterations, time taken by function call via reference was 3ms while time taken by function call via std::tie combined with std::tuple was about 94ms.
Though the difference seems very less in practice, still tuple one will perform slightly slower. Hence, for performance intensive systems, I suggest using call by reference.
My code:
#include <iostream>
#include <tuple>
#include <chrono>
std::tuple<int, int> returnValues(const int a, const int b)
{
return std::tuple<int, int>(a, b);
}
void returnValuesVoid(int &a, int &b)
{
a += 100;
b += 100;
}
int main()
{
int a = 10, b = 20;
auto begin = std::chrono::high_resolution_clock::now();
int x, y;
for (int i = 0; i < 1000000; i++)
{
std::tie(x, y) = returnValues(a, b);
}
auto end = std::chrono::high_resolution_clock::now();
std::cout << double(std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count()) << '\n';
a = 10;
b = 20;
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 1000000; i++)
{
returnValuesVoid(a, b);
}
auto stop = std::chrono::high_resolution_clock::now();
std::cout << double(std::chrono::duration_cast<std::chrono::milliseconds>(stop - start).count()) << '\n';
}
This page recommends "loop unrolling" as an optimization:
Loop overhead can be reduced by reducing the number of iterations and
replicating the body of the loop.
Example:
In the code fragment below, the body of the loop can be replicated
once and the number of iterations can be reduced from 100 to 50.
for (i = 0; i < 100; i++)
g ();
Below is the code fragment after loop unrolling.
for (i = 0; i < 100; i += 2)
{
g ();
g ();
}
With GCC 5.2, loop unrolling isn't enabled unless you use -funroll-loops (it's not enabled in either -O2 or -O3). I've inspected the assembly to see if there's a significant difference.
g++ -std=c++14 -O3 -funroll-loops -c -Wall -pedantic -pthread main.cpp && objdump -d main.o
Version 1:
0: ba 64 00 00 00 mov $0x64,%edx
5: 0f 1f 00 nopl (%rax)
8: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # e <main+0xe>
e: 83 c0 01 add $0x1,%eax
# ... etc ...
a1: 83 c1 01 add $0x1,%ecx
a4: 83 ea 0a sub $0xa,%edx
a7: 89 0d 00 00 00 00 mov %ecx,0x0(%rip) # ad <main+0xad>
ad: 0f 85 55 ff ff ff jne 8 <main+0x8>
b3: 31 c0 xor %eax,%eax
b5: c3 retq
Version 2:
0: ba 32 00 00 00 mov $0x32,%edx
5: 0f 1f 00 nopl (%rax)
8: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # e <main+0xe>
e: 83 c0 01 add $0x1,%eax
11: 89 05 00 00 00 00 mov %eax,0x0(%rip) # 17 <main+0x17>
17: 8b 0d 00 00 00 00 mov 0x0(%rip),%ecx # 1d <main+0x1d>
1d: 83 c1 01 add $0x1,%ecx
# ... etc ...
143: 83 c7 01 add $0x1,%edi
146: 83 ea 0a sub $0xa,%edx
149: 89 3d 00 00 00 00 mov %edi,0x0(%rip) # 14f <main+0x14f>
14f: 0f 85 b3 fe ff ff jne 8 <main+0x8>
155: 31 c0 xor %eax,%eax
157: c3 retq
Version 2 produces more iterations. What am I missing?
Yes, there are cases where loop unrolling will make the code more efficient.
The theory is reduce the less overhead (branching to top of loop and incrementing loop counter).
Most processors hate branch instructions. They love data processing instructions. For every iteration, there is a minimum of one branch instruction. By "duplicating" a set of code, the number of branches is reduced and the data processing instructions is increased between branches.
Many modern compilers have optimization settings to perform loop unrolling.
It doesn’t produce more iterations; you’ll notice that the loop that calls g() twice runs half as many times. (What if you have to call g() an odd number of times? Look up Duff’s Device.)
In your listings, you’ll notice that the assembly-language instruction jne 8 <main+0x8> appears once in both. This tells the processor to go back to the start of the loop. In the original loop, this instruction will run 99 times. In the rolled loop, it will only run 49 times. Imagine if the body of the loop is something very short, just one or two instructions. These jumps might be a third or even half of the instructions in the most performance-critical part of your program! (And there is even a useful loop with zero instructions: BogoMIPS. But the article about optimizing that was a joke.)
So, unrolling the loop trades speed for code size, right? Not so fast. Maybe you’ve made your unrolled loop so big that the code at the top of the loop is no longer in the cache, and the CPU needs to fetch it. In the real world, the only way to know if it helps is to profile your program.
I have the following code:
void function(char *str)
{
int i;
char buffer[strlen(str) + 1];
strcpy(buffer, str);
buffer[strlen(str)] = '\0';
printf("Buffer: %s\n", buffer);
}
I would expect this code to throw a compile time error, as the 'buffer' being allocated on the stack has a runtime dependent length (based on strlen()). However in GCC the compilation passes. How does this work? Is the buffer dynamically allocated, or if it is still stack local, what is the size allocated?
C99 allows variable length arrays. Not compiling your code in C99 will not give any error because GCC also allow variable length array as an extension.
6.19 Arrays of Variable Length:
Variable-length automatic arrays are allowed in ISO C99, and as an extension GCC accepts them in C90 mode and in C++.
By disassembling your function you could easily verify this:
$ objdump -S <yourprogram>
...
void function(char *str)
{
4011a0: 55 push %ebp
4011a1: 89 e5 mov %esp,%ebp
4011a3: 53 push %ebx
4011a4: 83 ec 24 sub $0x24,%esp
4011a7: 89 e0 mov %esp,%eax
4011a9: 89 c3 mov %eax,%ebx
int i;
char buffer[strlen(str) + 1];
4011ab: 8b 45 08 mov 0x8(%ebp),%eax
4011ae: 89 04 24 mov %eax,(%esp)
4011b1: e8 42 01 00 00 call 4012f8 <_strlen>
4011b6: 83 c0 01 add $0x1,%eax
4011b9: 89 c2 mov %eax,%edx
4011bb: 83 ea 01 sub $0x1,%edx
4011be: 89 55 f4 mov %edx,-0xc(%ebp)
4011c1: ba 10 00 00 00 mov $0x10,%edx
4011c6: 83 ea 01 sub $0x1,%edx
4011c9: 01 d0 add %edx,%eax
4011cb: b9 10 00 00 00 mov $0x10,%ecx
4011d0: ba 00 00 00 00 mov $0x0,%edx
4011d5: f7 f1 div %ecx
4011d7: 6b c0 10 imul $0x10,%eax,%eax
4011da: e8 6d 00 00 00 call 40124c <___chkstk_ms>
4011df: 29 c4 sub %eax,%esp
4011e1: 8d 44 24 08 lea 0x8(%esp),%eax
4011e5: 83 c0 00 add $0x0,%eax
4011e8: 89 45 f0 mov %eax,-0x10(%ebp)
....
The relevant piece of assembly here is sub %eax,%esp anyway. This shows that the stack was expanded based on whatever strlen returned earlier to get space for your buffer.
There might be any because of inlining of #define statements.
I understand that answer may be compiler dependent, lets asume GCC then.
There already are similar questions about C and about C++, but they are more about usage aspects.
The compiler would treat them the same given basic optimization.
It's fairly easy to check - consider the following c code :
#define a 1
static const int b = 2;
typedef enum {FOUR = 4} enum_t;
int main() {
enum_t c = FOUR;
printf("%d\n",a);
printf("%d\n",b);
printf("%d\n",c);
return 0;
}
compiled with gcc -O3:
0000000000400410 <main>:
400410: 48 83 ec 08 sub $0x8,%rsp
400414: be 01 00 00 00 mov $0x1,%esi
400419: bf 2c 06 40 00 mov $0x40062c,%edi
40041e: 31 c0 xor %eax,%eax
400420: e8 cb ff ff ff callq 4003f0 <printf#plt>
400425: be 02 00 00 00 mov $0x2,%esi
40042a: bf 2c 06 40 00 mov $0x40062c,%edi
40042f: 31 c0 xor %eax,%eax
400431: e8 ba ff ff ff callq 4003f0 <printf#plt>
400436: be 04 00 00 00 mov $0x4,%esi
40043b: bf 2c 06 40 00 mov $0x40062c,%edi
400440: 31 c0 xor %eax,%eax
400442: e8 a9 ff ff ff callq 4003f0 <printf#plt>
Absolutely identical assembly code, hence - the exact same performance and memory usage.
Edit: As Damon stated in the comments, there may be some corner cases such as complicated non literals, but that goes a bit beyond the question.
When used as a constant expression there will be no difference in performance. If used as an lvalue, the static const will need to be defined (memory) and accessed (cpu).