Casting from double always returns zero - c++

Why does the following code returns zero, when compiled with clang?
#include <stdint.h>
uint64_t square() {
return ((uint32_t)((double)4294967296.0)) - 10;
}
The assembly it produces is:
push rbp
mov rbp, rsp
xor eax, eax
pop rbp
ret
I would have expected that double to become zero (integer) and that minus would wrap it around. In other words, why doesn't in matter what number there is to subtract, as it always produces zero? Note that gcc does produce different numbers, as expected:
push rbp
mov rbp, rsp
mov eax, 4294967285
pop rbp
ret
I assume casting 4294967296.0 to uint32_t is undefined behaviour but even then, I would expect to produce different results for different subtrahends.

The behaviour of casting an out-of-range double to an unsigned type is indeed undefined.
Wrap-around does not apply, even for an unsigned type.
This is an oft-forgotten rule.
Once program control reaches an undefined construct, the entire program is undefined, even somewhat paradoxically, statements that have already been ran.
Reference: https://timsong-cpp.github.io/cppwp/conv.fpint#1

g++ 5.4.0 gives 4294967285 (-11 unsigned), so it is up to the compiler what it wants to do with undefined behavior.
__Z6squarev:
LFB1023:
.cfi_startproc
pushl %ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
movl $-11, %eax
movl $0, %edx
popl %ebp
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
.cfi_endproc

Related

Which is faster, a struct, or a primitive variable containing the same bytes?

Here is an example piece of code:
#include <stdint.h>
#include <iostream>
typedef struct {
uint16_t low;
uint16_t high;
} __attribute__((packed)) A;
typedef uint32_t B;
int main() {
//simply to make the answer unknowable at compile time
uint16_t input;
cin >> input;
A a = {15,input};
B b = 0x000f0000 + input;
//a equals b
int resultA = a.low-a.high;
int resultB = b&0xffff - (b>>16)&0xffff;
//use the variables so the optimiser doesn't get rid of everything
return resultA+resultB;
}
Both resultA and resultB calculate the exact same thing - but which is faster (assuming you don't know the answer at compile time).
I tried using Compiler Explorer to look at the output, and I got something - but with any optimisation no matter what I tried it outsmarted me and optimised the whole calculation away (at first, it optimised everything away since it's not used) - I tried using cin to make the answer unknowable at runtime, but then I couldn't even figure out how it was getting the answer at all (I think it managed to still figure it out at compile time?)
Here is the output of Compiler Explorer with no optimisation flag:
push rbp
mov rbp, rsp
sub rsp, 32
mov dword ptr [rbp - 4], 0
movabs rdi, offset std::cin
lea rsi, [rbp - 6]
call std::basic_istream<char, std::char_traits<char> >::operator>>(unsigned short&)
mov word ptr [rbp - 16], 15
mov ax, word ptr [rbp - 6]
mov word ptr [rbp - 14], ax
movzx eax, word ptr [rbp - 6]
add eax, 983040
mov dword ptr [rbp - 20], eax
Begin calculating result A
movzx eax, word ptr [rbp - 16]
movzx ecx, word ptr [rbp - 14]
sub eax, ecx
mov dword ptr [rbp - 24], eax
End of calculation
Begin calculating result B
mov eax, dword ptr [rbp - 20]
mov edx, dword ptr [rbp - 20]
shr edx, 16
mov ecx, 65535
sub ecx, edx
and eax, ecx
and eax, 65535
mov dword ptr [rbp - 28], eax
End of calculation
mov eax, dword ptr [rbp - 24]
add eax, dword ptr [rbp - 28]
add rsp, 32
pop rbp
ret
I will also post the -O1 output, but I can't make any sense of it (I'm quite new to low level assembly stuff).
main: # #main
push rax
lea rsi, [rsp + 6]
mov edi, offset std::cin
call std::basic_istream<char, std::char_traits<char> >::operator>>(unsigned short&)
movzx ecx, word ptr [rsp + 6]
mov eax, ecx
and eax, -16
sub eax, ecx
add eax, 15
pop rcx
ret
Something to consider. While doing operations with the integer is slightly harder, simply accessing it as an integer easier compared to the struct (which you'd have to convert with bitshifts I think?). Does this make a difference?
This originally came up in the context of memory, where I saw someone map a memory address to a struct with a field for the low bits and the high bits. I thought this couldn't possibly be faster than simply using an integer of the right size and bitshifting if you need the low or high bits. In this specific situation - which is faster?
[Why did I add C to the tag list? While the example code I used is in C++, the concept of struct vs variable is very applicable to C too]
Other than the fact that some ABIs require that structs be passed differently than integers, there won't be a difference.
Now, there are important semantic differences between two 16 bit ints and one 32 bit int. If you add to the lower 16 bit int, it will not "overflow" into the higher one, while if you add to the lower 16 bits of a 32 bit int, it will. This difference in possible behavior (even if you, yourself, "know" it could not happen in your code) could change what assembly code is generated by your compiler, and impact performance.
Which of those two would result in a faster result is not going to be knowable without actually testing or a full description of the actual exact problem. So it is a toss up there.
Which means the only real concern is the ABI one. This means, without whole program optimization, a function taking a struct and a function taking an int with the same binary layout will have a different assumptions about where the data is.
This only matters for by-value single arguments however.
The 90/10 rule applies; 90% of your code runs for less than 10% of the time. The odds are this will have no impact on your critical path.
When trying to answer questions of performance, examining unoptimized code is largely irrelevant.
As a matter of fact, even examining the results of -O1 optimization is not particularly useful, because it does not give you the best that the compiler can achieve. You should try at least -O2.
Regardless of the above, the sample code you provided is unsuitable for examination, because you should be making sure that the values of a and b are separately unknowable by the compiler. As the code stands, the compiler does not know what the value of input is, but it does know that a and b will have the same value, so it optimizes the code in ways that make it impossible to derive any useful conclusions from it.
As a general rule, compilers tend to do an exceptionally good job when dealing with structs that fit within machine words, to the point where generally, there is absolutely no performance difference between the two scenarios you are considering, and between any of the special cases you are pondering about.
Using GCC on compiler explorer the version with the struct produces fewer instructions in -O3 mode.
Code:
#include <stdint.h>
typedef struct {
uint16_t low;
uint16_t high;
} __attribute__((packed)) A;
typedef uint32_t B;
int f1(A a)
{
return a.low - a.high;
}
int f2(B b)
{
return b&0xffff - (b>>16)&0xffff;
}
Assembly:
_Z2f11A:
movzwl %di, %eax
shrl $16, %edi
subl %edi, %eax
ret
_Z2f2j:
movl %edi, %edx
movl $65535, %eax
shrl $16, %edx
subl %edx, %eax
andl %edi, %eax
ret
But this might be because the two functions don't do the same thing as - has a higher precedence than &. When comparing the B case which does the same thing as A, then the exact same assembly is produced.
Code:
int f3(B b)
{
return (b&0xffff) - ((b>>16)&0xffff);
}
Assembly:
_Z2f3j:
movzwl %di, %eax
shrl $16, %edi
subl %edi, %eax
ret
Note that the only way to find out if something is faster is to benchmark it in a real world use case.

C/C++ : is using the result of comparison as int really branchless?

I have seen in many SO answers that kind of code:
template <typename T>
inline T imax (T a, T b)
{
return (a > b) * a + (a <= b) * b;
}
Where authors say that this branchless.
But is this really branchless on current architectures? (x86, ARM...)
And is there a real standard guarantee that this is branchless?
x86 has the SETcc family of instructions which set a byte register to 1 or 0 depending on the value of a flag. This is commonly used by compilers to implement this kind of code without branches.
If you use the “naïve” approach
int imax(int a, int b) {
return a > b ? a : b;
}
The compiler would generate even more efficient branch-less code using the CMOVcc (conditional move) family of instructions.
ARM has the ability to conditionally execute every instruction which allowed the compiler to compile both your and the naïve implementation efficiently, the naïve implementation being faster.
I stumbled upon this SO question because I was asking me the same… turns out it’s not always. For instance, the following code…
const struct op {
const char *foo;
int bar;
int flags;
} ops[] = {
{ "foo", 5, 16 },
{ "bar", 9, 16 },
{ "baz", 13, 0 },
{ 0, 0, 0 }
};
extern int foo(const struct op *, int);
int
bar(void *a, void *b, int c, const struct op *d)
{
c |= (a == b) && (d->flags & 16);
return foo(d, c) + 1;
}
… compiles to branching code with both gcc 3.4.6 (i386) and 8.3.0 (amd64, i386) in all optimisation levels. The one from 3.4.6 is more manually legibe, I’ll demonstrate with gcc -O2 -S -masm=intel x.c; less x.s:
[…]
.text
.p2align 2,,3
.globl bar
.type bar , #function
bar:
push %ebp
mov %ebp, %esp
push %ebx
push %eax
mov %eax, DWORD PTR [%ebp+12]
xor %ecx, %ecx
cmp DWORD PTR [%ebp+8], %eax
mov %edx, DWORD PTR [%ebp+16]
mov %ebx, DWORD PTR [%ebp+20]
je .L4
.L2:
sub %esp, 8
or %edx, %ecx
push %edx
push %ebx
call foo
inc %eax
mov %ebx, DWORD PTR [%ebp-4]
leave
ret
.p2align 2,,3
.L4:
test BYTE PTR [%ebx+8], 16
je .L2
mov %cl, 1
jmp .L2
.size bar , . - bar
Turns out the pointer comparison operation invokes a comparison and even a subroutine to insert 1.
Not even using !!(a == b) makes a difference here.
tl;dr
Check the actual compiler output (assembly with -S or disassembly with objdump -d -Mintel x.o; drop the -Mintel if not on x86, it merely makes the assembly more legible) of the actual compilation; compilers are unpredictable beasts.

How are non-POD static values initialized? [duplicate]

This question already has answers here:
What is the lifetime of a static variable in a C++ function?
(5 answers)
Closed 8 years ago.
C++, unlike some other languages, allows static data to be of any arbitrary type, not just plain-old-data. Plain-old-data is trivial to initialize (the compiler just writes the value at the appropriate address in the data segment), but the other, more complex types, are not.
How is initialization of non-POD types typically implemented in C++? In particular, what exactly happens when the function foo is executed for the first time? What mechanisms are used to keep track of whether str has already been initialized or not?
#include <string>
void foo() {
static std::string str("Hello, Stack Overflow!");
}
C++11 requires the initialization of function local static variables to be thread-safe. So at least in compilers that are compliant, there'll typically be some sort of synchronization primitive in use that'll need to be checked each time the function is entered.
For example, here's the assembly listing for the code from this program:
#include <string>
void foo() {
static std::string str("Hello, Stack Overflow!");
}
int main() {}
.LC0:
.string "Hello, Stack Overflow!"
foo():
cmpb $0, guard variable for foo()::str(%rip)
je .L14
ret
.L14:
pushq %rbx
movl guard variable for foo()::str, %edi
subq $16, %rsp
call __cxa_guard_acquire
testl %eax, %eax
jne .L15
.L1:
addq $16, %rsp
popq %rbx
ret
.L15:
leaq 15(%rsp), %rdx
movl $.LC0, %esi
movl foo()::str, %edi
call std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&)
movl guard variable for foo()::str, %edi
call __cxa_guard_release
movl $__dso_handle, %edx
movl foo()::str, %esi
movl std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string(), %edi
call __cxa_atexit
jmp .L1
movq %rax, %rbx
movl guard variable for foo()::str, %edi
call __cxa_guard_abort
movq %rbx, %rdi
call _Unwind_Resume
main:
xorl %eax, %eax
ret
The __cxa_guard_acquire, __cxa_guard_release etc. are guarding initialization of the static variable.
The implementation that I've seen uses a hidden boolean variable to check if the variable is initialized. Modern compiler will do this thread-safely, but IIRC, some older compilerd did not do that, and if it was called from several threads at the same time you could get the constructor called twice.
Something along the lines of:
static bool __str_initialized = false;
static char __mem_for_str[...]; //std::string str("Hello, Stack Overflow!");
void foo() {
if (!__str_initialized)
{
lock();
__str_initialized = true;
new (__mem_for_str) std::string("Hello, Stack Overflow!");
unlock();
}
}
Then, in the finalization code of the program:
if (__str_initialized)
((std::string&)__mem_for_str).~std::string();
It's implementation specific.
Typically, there'll be a flag (statically initialised to zero) to indicate whether it's initialised, and (in C++11, or earlier thread-safe implementations) some kind of mutex, also statically initialisable, to protect against multiple threads trying to in initialise it.
The generated code would typically behave something along the lines of
static __atomic_flag_type __initialised = false;
static __mutex_type __mutex = __MUTEX_INITIALISER;
if (!__initialised) {
__lock_type __lock(__mutex);
if (!__initialised) {
__initialise(str);
__initialised = true;
}
}
You can check what your compiler does by generating an assembler listing.
MSVC2008 in debug mode generates this code (excluding exception handling prolog/epilog etc):
mov eax, DWORD PTR ?$S1#?1??foo##YA_NXZ#4IA
and eax, 1
jne SHORT $LN1#foo
mov eax, DWORD PTR ?$S1#?1??foo##YA_NXZ#4IA
or eax, 1
mov DWORD PTR ?$S1#?1??foo##YA_NXZ#4IA, eax
mov DWORD PTR __$EHRec$[ebp+8], 0
mov esi, esp
push OFFSET ??_C#_0BH#ENJCLPMJ#Hello?0?5Stack?5Overflow?$CB?$AA#
mov ecx, OFFSET ?str#?1??foo##YA_NXZ#4V?$basic_string#DU?$char_traits#D#std##V?$allocator#D#2##std##A
call DWORD PTR __imp_??0?$basic_string#DU?$char_traits#D#std##V?$allocator#D#2##std##QAE#PBD#Z
cmp esi, esp
call __RTC_CheckEsp
push OFFSET ??__Fstr#?1??foo##YA_NXZ#YAXXZ ; `foo'::`2'::`dynamic atexit destructor for 'str''
call _atexit
add esp, 4
mov DWORD PTR __$EHRec$[ebp+8], -1
$LN1#foo:
i.e there is a static variable referenced by ?$S1#?1??foo##YA_NXZ#4IA this is checked to see if it & 1 is zero. if not it branches to the label $LN1#foo:. Otherwise it or's in 1 to the flag, constructs the string at a known location and then adds a call for its destructor at program exit using 'atexit'. Then continues the function as normal.

Why does this loop produce "warning: iteration 3u invokes undefined behavior" and output more than 4 lines?

Compiling this:
#include <iostream>
int main()
{
for (int i = 0; i < 4; ++i)
std::cout << i*1000000000 << std::endl;
}
and gcc produces the following warning:
warning: iteration 3u invokes undefined behavior [-Waggressive-loop-optimizations]
std::cout << i*1000000000 << std::endl;
^
I understand there is a signed integer overflow.
What I cannot get is why i value is broken by that overflow operation?
I've read the answers to Why does integer overflow on x86 with GCC cause an infinite loop?, but I'm still not clear on why this happens - I get that "undefined" means "anything can happen", but what's the underlying cause of this specific behavior?
Online: http://ideone.com/dMrRKR
Compiler: gcc (4.8)
Signed integer overflow (as strictly speaking, there is no such thing as "unsigned integer overflow") means undefined behaviour. And this means anything can happen, and discussing why does it happen under the rules of C++ doesn't make sense.
C++11 draft N3337: §5.4:1
If during the evaluation of an expression, the result is not mathematically defined or not in the range of
representable values for its type, the behavior is undefined. [ Note: most existing implementations of C++
ignore integer overflows. Treatment of division by zero, forming a remainder using a zero divisor, and all
floating point exceptions vary among machines, and is usually adjustable by a library function. —end note ]
Your code compiled with g++ -O3 emits warning (even without -Wall)
a.cpp: In function 'int main()':
a.cpp:11:18: warning: iteration 3u invokes undefined behavior [-Waggressive-loop-optimizations]
std::cout << i*1000000000 << std::endl;
^
a.cpp:9:2: note: containing loop
for (int i = 0; i < 4; ++i)
^
The only way we can analyze what the program is doing, is by reading the generated assembly code.
Here is the full assembly listing:
.file "a.cpp"
.section .text$_ZNKSt5ctypeIcE8do_widenEc,"x"
.linkonce discard
.align 2
LCOLDB0:
LHOTB0:
.align 2
.p2align 4,,15
.globl __ZNKSt5ctypeIcE8do_widenEc
.def __ZNKSt5ctypeIcE8do_widenEc; .scl 2; .type 32; .endef
__ZNKSt5ctypeIcE8do_widenEc:
LFB860:
.cfi_startproc
movzbl 4(%esp), %eax
ret $4
.cfi_endproc
LFE860:
LCOLDE0:
LHOTE0:
.section .text.unlikely,"x"
LCOLDB1:
.text
LHOTB1:
.p2align 4,,15
.def ___tcf_0; .scl 3; .type 32; .endef
___tcf_0:
LFB1091:
.cfi_startproc
movl $__ZStL8__ioinit, %ecx
jmp __ZNSt8ios_base4InitD1Ev
.cfi_endproc
LFE1091:
.section .text.unlikely,"x"
LCOLDE1:
.text
LHOTE1:
.def ___main; .scl 2; .type 32; .endef
.section .text.unlikely,"x"
LCOLDB2:
.section .text.startup,"x"
LHOTB2:
.p2align 4,,15
.globl _main
.def _main; .scl 2; .type 32; .endef
_main:
LFB1084:
.cfi_startproc
leal 4(%esp), %ecx
.cfi_def_cfa 1, 0
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
.cfi_escape 0x10,0x5,0x2,0x75,0
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
pushl %ecx
.cfi_escape 0xf,0x3,0x75,0x70,0x6
.cfi_escape 0x10,0x7,0x2,0x75,0x7c
.cfi_escape 0x10,0x6,0x2,0x75,0x78
.cfi_escape 0x10,0x3,0x2,0x75,0x74
xorl %edi, %edi
subl $24, %esp
call ___main
L4:
movl %edi, (%esp)
movl $__ZSt4cout, %ecx
call __ZNSolsEi
movl %eax, %esi
movl (%eax), %eax
subl $4, %esp
movl -12(%eax), %eax
movl 124(%esi,%eax), %ebx
testl %ebx, %ebx
je L15
cmpb $0, 28(%ebx)
je L5
movsbl 39(%ebx), %eax
L6:
movl %esi, %ecx
movl %eax, (%esp)
addl $1000000000, %edi
call __ZNSo3putEc
subl $4, %esp
movl %eax, %ecx
call __ZNSo5flushEv
jmp L4
.p2align 4,,10
L5:
movl %ebx, %ecx
call __ZNKSt5ctypeIcE13_M_widen_initEv
movl (%ebx), %eax
movl 24(%eax), %edx
movl $10, %eax
cmpl $__ZNKSt5ctypeIcE8do_widenEc, %edx
je L6
movl $10, (%esp)
movl %ebx, %ecx
call *%edx
movsbl %al, %eax
pushl %edx
jmp L6
L15:
call __ZSt16__throw_bad_castv
.cfi_endproc
LFE1084:
.section .text.unlikely,"x"
LCOLDE2:
.section .text.startup,"x"
LHOTE2:
.section .text.unlikely,"x"
LCOLDB3:
.section .text.startup,"x"
LHOTB3:
.p2align 4,,15
.def __GLOBAL__sub_I_main; .scl 3; .type 32; .endef
__GLOBAL__sub_I_main:
LFB1092:
.cfi_startproc
subl $28, %esp
.cfi_def_cfa_offset 32
movl $__ZStL8__ioinit, %ecx
call __ZNSt8ios_base4InitC1Ev
movl $___tcf_0, (%esp)
call _atexit
addl $28, %esp
.cfi_def_cfa_offset 4
ret
.cfi_endproc
LFE1092:
.section .text.unlikely,"x"
LCOLDE3:
.section .text.startup,"x"
LHOTE3:
.section .ctors,"w"
.align 4
.long __GLOBAL__sub_I_main
.lcomm __ZStL8__ioinit,1,1
.ident "GCC: (i686-posix-dwarf-rev1, Built by MinGW-W64 project) 4.9.0"
.def __ZNSt8ios_base4InitD1Ev; .scl 2; .type 32; .endef
.def __ZNSolsEi; .scl 2; .type 32; .endef
.def __ZNSo3putEc; .scl 2; .type 32; .endef
.def __ZNSo5flushEv; .scl 2; .type 32; .endef
.def __ZNKSt5ctypeIcE13_M_widen_initEv; .scl 2; .type 32; .endef
.def __ZSt16__throw_bad_castv; .scl 2; .type 32; .endef
.def __ZNSt8ios_base4InitC1Ev; .scl 2; .type 32; .endef
.def _atexit; .scl 2; .type 32; .endef
I can barely even read assembly, but even I can see the addl $1000000000, %edi line.
The resulting code looks more like
for(int i = 0; /* nothing, that is - infinite loop */; i += 1000000000)
std::cout << i << std::endl;
This comment of #T.C.:
I suspect that it's something like: (1) because every iteration with i of any value larger than 2 has undefined behavior -> (2) we can assume that i <= 2 for optimization purposes -> (3) the loop condition is always true -> (4) it's optimized away into an infinite loop.
gave me idea to compare the assembly code of the OP's code to the assembly code of the following code, with no undefined behaviour.
#include <iostream>
int main()
{
// changed the termination condition
for (int i = 0; i < 3; ++i)
std::cout << i*1000000000 << std::endl;
}
And, in fact, the correct code has termination condition.
; ...snip...
L6:
mov ecx, edi
mov DWORD PTR [esp], eax
add esi, 1000000000
call __ZNSo3putEc
sub esp, 4
mov ecx, eax
call __ZNSo5flushEv
cmp esi, -1294967296 // here it is
jne L7
lea esp, [ebp-16]
xor eax, eax
pop ecx
; ...snip...
Unfortunately this is the consequences of writing buggy code.
Fortunately you can make use of better diagnostics and better debugging tools - that's what they are for:
enable all warnings
-Wall is the gcc option that enables all useful warnings with no false positives. This is a bare minimum that you should always use.
gcc has many other warning options, however, they are not enabled with -Wall as they may warn on false positives
Visual C++ unfortunately is lagging behind with the ability to give useful warnings. At least the IDE enables some by default.
use debug flags for debugging
for integer overflow -ftrapv traps the program on overflow,
Clang compiler is excellent for this: -fcatch-undefined-behavior catches a lot of instances of undefined behaviour (note: "a lot of" != "all of them")
I have a spaghetti mess of a program not written by me that needs to be shipped tomorrow! HELP!!!!!!111oneone
Use gcc's -fwrapv
This option instructs the compiler to assume that signed arithmetic overflow of addition, subtraction and multiplication wraps around using twos-complement representation.
1 - this rule does not apply to "unsigned integer overflow", as §3.9.1.4 says that
Unsigned integers, declared unsigned, shall obey the laws of arithmetic modulo 2n where n is the number
of bits in the value representation of that particular size of integer.
and e.g. result of UINT_MAX + 1 is mathematically defined - by the rules of arithmetic modulo 2n
Short answer, gcc specifically has documented this problem, we can see that in the gcc 4.8 release notes which says (emphasis mine going forward):
GCC now uses a more aggressive analysis to derive an upper bound for
the number of iterations of loops using constraints imposed by
language standards. This may cause non-conforming programs to no
longer work as expected, such as SPEC CPU 2006 464.h264ref and
416.gamess. A new option, -fno-aggressive-loop-optimizations, was added to disable this aggressive analysis. In some loops that have
known constant number of iterations, but undefined behavior is known
to occur in the loop before reaching or during the last iteration, GCC
will warn about the undefined behavior in the loop instead of deriving
lower upper bound of the number of iterations for the loop. The
warning can be disabled with -Wno-aggressive-loop-optimizations.
and indeed if we use -fno-aggressive-loop-optimizations the infinite loop behavior should cease and it does in all the cases I have tested.
The long answer starts with knowing that signed integer overflow is undefined behavior by looking at the draft C++ standard section 5 Expressions paragraph 4 which says:
If during the evaluation of an expression, the result is not
mathematically defined or not in the range of representable values for
its type, the behavior is undefined. [ Note: most existing
implementations of C++ ignore integer overflows. Treatment of division
by zero, forming a remainder using a zero divisor, and all floating
point exceptions vary among machines, and is usually adjustable by a
library function. —end note
We know that the standard says undefined behavior is unpredictable from the note that come with the definition which says:
[ Note: Undefined behavior may be expected when this International
Standard omits any explicit definition of behavior or when a program
uses an erroneous construct or erroneous data. Permissible undefined
behavior ranges from ignoring the situation completely with
unpredictable results, to behaving during translation or program
execution in a documented manner characteristic of the environment
(with or without the issuance of a diagnostic message), to terminating
a translation or execution (with the issuance of a diagnostic
message). Many erroneous program constructs do not engender undefined
behavior; they are required to be diagnosed. —end note ]
But what in the world can the gcc optimizer be doing to turn this into an infinite loop? It sounds completely wacky. But thankfully gcc gives us a clue to figuring it out in the warning:
warning: iteration 3u invokes undefined behavior [-Waggressive-loop-optimizations]
std::cout << i*1000000000 << std::endl;
^
The clue is the Waggressive-loop-optimizations, what does that mean? Fortunately for us this is not the first time this optimization has broken code in this way and we are lucky because John Regehr has documented a case in the article GCC pre-4.8 Breaks Broken SPEC 2006 Benchmarks which shows the following code:
int d[16];
int SATD (void)
{
int satd = 0, dd, k;
for (dd=d[k=0]; k<16; dd=d[++k]) {
satd += (dd < 0 ? -dd : dd);
}
return satd;
}
the article says:
The undefined behavior is accessing d[16] just before exiting the
loop. In C99 it is legal to create a pointer to an element one
position past the end of the array, but that pointer must not be
dereferenced.
and later on says:
In detail, here is what’s going on. A C compiler, upon seeing d[++k],
is permitted to assume that the incremented value of k is within the
array bounds, since otherwise undefined behavior occurs. For the code
here, GCC can infer that k is in the range 0..15. A bit later, when
GCC sees k<16, it says to itself: “Aha– that expression is always
true, so we have an infinite loop.” The situation here, where the
compiler uses the assumption of well-definedness to infer a useful
dataflow fact,
So what the compiler must be doing in some cases is assuming since signed integer overflow is undefined behavior then i must always be less than 4 and thus we have an infinite loop.
He explains this is very similar to the infamous Linux kernel null pointer check removal where in seeing this code:
struct foo *s = ...;
int x = s->f;
if (!s) return ERROR;
gcc inferred that since s was deferenced in s->f; and since dereferencing a null pointer is undefined behavior then s must not be null and therefore optimizes away the if (!s) check on the next line.
The lesson here is that modern optimizers are very aggressive about exploiting undefined behavior and most likely will only get more aggressive. Clearly with just a few examples we can see the optimizer does things that seem completely unreasonable to a programmer but in retrospect from the optimizers perspective make sense.
tl;dr The code generates a test that integer + positive integer == negative integer. Usually the optimizer does not optimize this out, but in the specific case of std::endl being used next, the compiler does optimize this test out. I haven't figured out what's special about endl yet.
From the assembly code at -O1 and higher levels, it is clear that gcc refactors the loop to:
i = 0;
do {
cout << i << endl;
i += NUMBER;
}
while (i != NUMBER * 4)
The biggest value that works correctly is 715827882, i.e. floor(INT_MAX/3). The assembly snippet at -O1 is:
L4:
movsbl %al, %eax
movl %eax, 4(%esp)
movl $__ZSt4cout, (%esp)
call __ZNSo3putEc
movl %eax, (%esp)
call __ZNSo5flushEv
addl $715827882, %esi
cmpl $-1431655768, %esi
jne L6
// fallthrough to "return" code
Note, the -1431655768 is 4 * 715827882 in 2's complement.
Hitting -O2 optimizes that to the following:
L4:
movsbl %al, %eax
addl $715827882, %esi
movl %eax, 4(%esp)
movl $__ZSt4cout, (%esp)
call __ZNSo3putEc
movl %eax, (%esp)
call __ZNSo5flushEv
cmpl $-1431655768, %esi
jne L6
leal -8(%ebp), %esp
jne L6
// fallthrough to "return" code
So the optimization that has been made is merely that the addl was moved higher up.
If we recompile with 715827883 instead then the -O1 version is identical apart from the changed number and test value. However, -O2 then makes a change:
L4:
movsbl %al, %eax
addl $715827883, %esi
movl %eax, 4(%esp)
movl $__ZSt4cout, (%esp)
call __ZNSo3putEc
movl %eax, (%esp)
call __ZNSo5flushEv
jmp L2
Where there was cmpl $-1431655764, %esi at -O1, that line has been removed for -O2. The optimizer must have decided that adding 715827883 to %esi can never equal -1431655764.
This is pretty puzzling. Adding that to INT_MIN+1 does generate the expected result, so the optimizer must have decided that %esi can never be INT_MIN+1 and I'm not sure why it would decide that.
In the working example it seems it'd be equally valid to conclude that adding 715827882 to a number cannot equal INT_MIN + 715827882 - 2 ! (this is only possible if wraparound does actually occur), yet it does not optimize the line out in that example.
The code I was using is:
#include <iostream>
#include <cstdio>
int main()
{
for (int i = 0; i < 4; ++i)
{
//volatile int j = i*715827883;
volatile int j = i*715827882;
printf("%d\n", j);
std::endl(std::cout);
}
}
If the std::endl(std::cout) is removed then the optimization no longer occurs. In fact replacing it with std::cout.put('\n'); std::flush(std::cout); also causes the optimization to not happen, even though std::endl is inlined.
The inlining of std::endl seems to affect the earlier part of the loop structure (which I don't quite understand what it is doing but I'll post it here in case someone else does):
With original code and -O2:
L2:
movl %esi, 28(%esp)
movl 28(%esp), %eax
movl $LC0, (%esp)
movl %eax, 4(%esp)
call _printf
movl __ZSt4cout, %eax
movl -12(%eax), %eax
movl __ZSt4cout+124(%eax), %ebx
testl %ebx, %ebx
je L10
cmpb $0, 28(%ebx)
je L3
movzbl 39(%ebx), %eax
L4:
movsbl %al, %eax
addl $715827883, %esi
movl %eax, 4(%esp)
movl $__ZSt4cout, (%esp)
call __ZNSo3putEc
movl %eax, (%esp)
call __ZNSo5flushEv
jmp L2 // no test
With mymanual inlining of std::endl, -O2:
L3:
movl %ebx, 28(%esp)
movl 28(%esp), %eax
addl $715827883, %ebx
movl $LC0, (%esp)
movl %eax, 4(%esp)
call _printf
movl $10, 4(%esp)
movl $__ZSt4cout, (%esp)
call __ZNSo3putEc
movl $__ZSt4cout, (%esp)
call __ZNSo5flushEv
cmpl $-1431655764, %ebx
jne L3
xorl %eax, %eax
One difference between these two is that %esi is used in the original , and %ebx in the second version; is there any difference in semantics defined between %esi and %ebx in general? (I don't know much about x86 assembly).
Another example of this error being reported in gcc is when you have a loop that executes for a constant number of iterations, but you are using the counter variable as an index into an array that has less than that number of items, such as:
int a[50], x;
for( i=0; i < 1000; i++) x = a[i];
The compiler can determine that this loop will try to access memory outside of the array 'a'. The compiler complains about this with this rather cryptic message:
iteration xxu invokes undefined behavior [-Werror=aggressive-loop-optimizations]
What I cannot get is why i value is broken by that overflow operation?
It seems that integer overflow occurs in 4th iteration (for i = 3).
signed integer overflow invokes undefined behavior. In this case nothing can be predicted. The loop may iterate only 4 times or it may go to infinite or anything else!
Result may vary compiler to compiler or even for different versions of same compiler.
C11: 1.3.24 undefined behavior:
behavior for which this International Standard imposes no requirements
[ Note: Undefined behavior may be expected when this International Standard omits any explicit definition of behavior or when a program uses an erroneous construct or erroneous data. Permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message). Many erroneous program constructs do not engender undefined behavior; they are required to be diagnosed.
—end note ]

Does the evil cast get trumped by the evil compiler?

This is not academic code or a hypothetical quesiton. The original problem was converting code from HP11 to HP1123 Itanium. Basically it boils down to a compile error on HP1123 Itanium. It has me really scratching my head when reproducing it on Windows for study. I have stripped all but the most basic aspects... You may have to press control D to exit a console window if you run it as is:
#include "stdafx.h"
#include <iostream>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
char blah[6];
const int IAMCONST = 3;
int *pTOCONST;
pTOCONST = (int *) &IAMCONST;
(*pTOCONST) = 7;
printf("IAMCONST %d \n",IAMCONST);
printf("WHATISPOINTEDAT %d \n",(*pTOCONST));
printf("Address of IAMCONST %x pTOCONST %x\n",&IAMCONST, (pTOCONST));
cin >> blah;
return 0;
}
Here is the output
IAMCONST 3
WHATISPOINTEDAT 7
Address of IAMCONST 35f9f0 pTOCONST 35f9f0
All I can say is what the heck? Is it undefined to do this? It is the most counterintuitive thing I have seen for such a simple example.
Update:
Indeed after searching for a while the Menu Debug >> Windows >> Disassembly had exactly the optimization that was described below.
printf("IAMCONST %d \n",IAMCONST);
0024360E mov esi,esp
00243610 push 3
00243612 push offset string "IAMCONST %d \n" (2458D0h)
00243617 call dword ptr [__imp__printf (248338h)]
0024361D add esp,8
00243620 cmp esi,esp
00243622 call #ILT+325(__RTC_CheckEsp) (24114Ah)
Thank you all!
Looks like the compiler is optimizing
printf("IAMCONST %d \n",IAMCONST);
into
printf("IAMCONST %d \n",3);
since you said that IAMCONST is a const int.
But since you're taking the address of IAMCONST, it has to actually be located on the stack somewhere, and the constness can't be enforced, so the memory at that location (*pTOCONST) is mutable after all.
In short: you casted away the constness, don't do that. Poor, defenseless C...
Addendum
Using GCC for x86, with -O0 (no optimizations), the generated assembly
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ecx
subl $36, %esp
movl $3, -12(%ebp)
leal -12(%ebp), %eax
movl %eax, -8(%ebp)
movl -8(%ebp), %eax
movl $7, (%eax)
movl -12(%ebp), %eax
movl %eax, 4(%esp)
movl $.LC0, (%esp)
call printf
movl -8(%ebp), %eax
movl (%eax), %eax
movl %eax, 4(%esp)
movl $.LC1, (%esp)
call printf
copies from *(bp-12) on the stack to printf's arguments. However, using -O1 (as well as -Os, -O2, -O3, and other optimization levels),
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ecx
subl $20, %esp
movl $3, 4(%esp)
movl $.LC0, (%esp)
call printf
movl $7, 4(%esp)
movl $.LC1, (%esp)
call printf
you can clearly see that the constant 3 is used instead.
If you are using Visual Studio's CL.EXE, /Od disables optimization. This varies from compiler to compiler.
Be warned that the C specification allows the C compiler to assume that the target of any int * pointer never overlaps the memory location of a const int, so you really shouldn't be doing this at all if you want predictable behavior.
The constant value IAMCONST is being inlined into the printf call.
What you're doing is at best wrong and in all likelihood is undefined by the C++ standard. My guess is that the C++ standard leaves the compiler free to inline a const primitive which is local to a function declaration. The reason being that the value should not be able to change.
Then again, it's C++ where should and can are very different words.
You are lucky the compiler is doing the optimization. An alternative treatment would be to place the const integer into read-only memory, whereupon trying to modify the value would cause a core dump.
Writing to a const object through a cast that removes the const is undefined behavior - so at the point where you do this:
(*pTOCONST) = 7;
all bets are off.
From the C++ standard 7.1.5.1 (The cv-qualifiers):
Except that any class member declared mutable (7.1.1) can be modified, any attempt to modify a const
object during its lifetime (3.8) results in undefined behavior.
Because of this, the compiler is free to assume that the value of IAMCONST will not change, so it can optimize away the access to the actual storage. In fact, if the address of the const object is never taken, the compiler may eliminate the storage for the object altogether.
Also note that (again in 7.1.5.1):
A variable of non-volatile const-qualified integral or enumeration type initialized by an integral constant expression can be used in integral constant expressions (5.19).
Which means IAMCONST can be used in compile-time constant expressions (ie., to provide a value for an enumeration or the size of an array). What would it even mean to change that at runtime?
It doesn't matter if the compiler optimizes or not. You asked for trouble and you're lucky you got the trouble yourself instead of waiting for customers to report it to you.
"All I can say is what the heck? Is it undefined to do this? It is the most counterintuitive thing I have seen for such a simple example."
If you really believe that then you need to switch to a language you can understand, or change professions. For the sake of yourself and your customers, stop using C or C++ or C#.
const int IAMCONST = 3;
You said it.
int *pTOCONST;
pTOCONST = (int *) &IAMCONST;
Guess why the compiler complained if you omitted your evil cast. The compiler might have been telling the truth before you lied to it.
"Does the evil cast get trumped by the evil compiler?"
No. The evil cast gets trumped by itself. Whether or not your compiler tried to tell you the truth, the compiler was not evil.