Is there a buffer overflow helloworld for c++? - c++

I tried the code provided by this question,but it doesn't work.
How to contrive an overflow to wrap my head around?
Update:
.file "hw.cpp"
.section .rdata,"dr"
LC0:
.ascii "Oh shit really bad~!\15\12\0"
.text
.align 2
.globl __Z3badv
.def __Z3badv; .scl 2; .type 32; .endef
__Z3badv:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
movl $LC0, (%esp)
call _printf
leave
ret
.section .rdata,"dr"
LC1:
.ascii "WOW\0"
.text
.align 2
.globl __Z3foov
.def __Z3foov; .scl 2; .type 32; .endef
__Z3foov:
pushl %ebp
movl %esp, %ebp
subl $4, %esp
movl LC1, %eax
movl %eax, -4(%ebp)
movl $__Z3badv, 4(%ebp)
leave
ret
.def ___main; .scl 2; .type 32; .endef
.align 2
.globl _main
.def _main; .scl 2; .type 32; .endef
_main:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
andl $-16, %esp
movl $0, %eax
addl $15, %eax
addl $15, %eax
shrl $4, %eax
sall $4, %eax
movl %eax, -4(%ebp)
movl -4(%ebp), %eax
call __alloca
call ___main
call __Z3foov
movl $0, %eax
leave
ret
.def _printf; .scl 2; .type 32; .endef

It would help to compile the example in the other question to assembly so you can get a feel for how the stack is laid out for your given compiler and processor. The +8 in the example may not be the correct number for your environment. What you need to determine is where the return address is stored on the stack relative to the array stored on the stack.
By the way, the example worked for me. I compiled on Win XP with Cygwin, gcc version 4.3.4. When I say it "worked", I mean that it ran code in the bad() function, even though that function was never called by the code.
$ gcc -Wall -Wextra buffer-overflow.c && ./a.exe
Oh shit really bad~!
Segmentation fault (core dumped)
The code really isn't an example of a buffer overflow, it's an example of what bad things can happen when a buffer overflow is exploited.
I'm not great with x86 assembly, but here's my interpretation of how this exploit works.
$ gcc -S buffer-overflow.c && cat buffer-overflow.s
_foo:
pushl %ebp ;2
movl %esp, %ebp ;3
subl $16, %esp ;4
movl LC1, %eax ;5
movl %eax, -4(%ebp) ;6
leal -4(%ebp), %eax ;7
leal 8(%eax), %edx ;8
movl $_bad, %eax ;9
movl %eax, (%edx) ;10
leave
ret
_main:
...
call _foo ;1
...
When main calls foo (1), the call instruction pushes onto the stack the address within main to return to once the call to foo completes. Pushing onto the stack involves decrementing ESP and storing a value there.
Once in foo, the old base pointer value is also pushed onto the stack (2). This will be restored when foo returns. The stack pointer is saved as the base pointer for this stack frame (3). The stack pointer is decremented by 16 (4), which creates space on this stack frame for local variables.
The address of literal "WOW\0" is copied into local variable overme on the stack (5,6) -- this seems strange to me, shouldn't it be copying the 4 characters into space allocated on the stack? Anyway, the place where WOW (or a pointer to it) is copied is 4 bytes below the current base pointer. So the stack contains this value, then the old base pointer, then the return address.
The address of overme is put into EAX (7) and an integer pointer is created 8 bytes beyond that address (8). The address of the bad function is put into EAX (9) and then that address is stored in memory pointed to by the integer pointer (10).
The stack looks like this:
// 4 bytes on each row
ESP: (unused)
: (unused)
: (unused)
: &"WOW\0"
: old EBP from main
: return PC, overwritten with &bad
When you compile with optimization, all the interesting stuff gets optimized away as "useless code" (which it is).
$ gcc -S -O2 buffer-overflow.c && cat buffer-overflow.s
_foo:
pushl %ebp
movl %esp, %ebp
popl %ebp
ret

You can use the C example you posted. It works the same in C as C++.
The smallest readable answer I can think of:
int main() {
return ""[1]; // Undefined behaviour (reading past '\0' in string)
}

Something like this?
int main()
{
char arr[1];
arr[1000000] = 'a';
}

A simple buffer overflow would be something like this:
#include <stdio.h>
#include <string.h>
int main() {
char a[4] = {0};
char b[32] = {0};
printf("before: b == \"%s\"\n", b);
strcpy(a, "Putting too many characters in array a");
printf("after: b == \"%s\"\n", b);
}
A possible output:
before: b == ""
after: b == " characters in array a"
The actual behavior of the program is undefined, so the buffer overflow might also cause different output, crashes or no observable effect at all.

#define _WIN32_WINNT 0x0400
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
void process_msg(const char *pSrc)
{
char cBuff[5];
strcpy(cBuff, pSrc);
}
void main()
{
char szInput[] = "hello world!";
process_msg(szInput);
}
Running this program at Visual Studio 2008 in Debug mode gives this message:
Run-Time Check Failure #2 - Stack around the variable 'cBuff' was corrupted.
The 'cBuff' char array is allocated on the stack in this example and it's of 5 bytes in size. Copying the given pointer's data (pSrc) to that char arrat (cBuff) overwrites the stack frame's data which will result in a possible exploit.
This technique is used by hackers - they send a specially crafted array of chars which will overwrite the pointer to the "return" address, on the stack, and change it to their desired location at the memory.
So, for example, they could point that "return" address to any system/program code that will open a port or establish a connection, and then they get to you PC, with the application's privileges (many times this means root/administrator).
Read more at http://en.wikipedia.org/wiki/Buffer_overflow .

In addition to the excellent article pointed out by Eric, you might also check out the following reading materials:
Writing buffer overflow exploits - a tutorial for beginners
A step-by-step on the buffer overflow vulnerablity
The following article focuses more on heap overflows:
w00w00 on Heap Overflows
This was copied from my answer here.

The term buffer overflow precisely means accessing past the end of the buffer (sloppily needing to include the idea of buffer underflow, too). But there's a whole class of "memory safety problems" having to do with buffer overflows, pointers, arrays, allocation and deallocation, all of which can produce crashes and/or exploit opportunities in code. See this C example of another memory safety problem (and our way to detect it).

Related

Hidden parameter in C++ function call [duplicate]

The return value of a function is usually stored on the stack or in a register. But for a large structure, it has to be on the stack. How much copying has to happen in a real compiler for this code? Or is it optimized away?
For example:
struct Data {
unsigned values[256];
};
Data createData()
{
Data data;
// initialize data values...
return data;
}
(Assuming the function cannot be inlined..)
None; no copies are done.
The address of the caller's Data return value is actually passed as a hidden argument to the function, and the createData function simply writes into the caller's stack frame.
This is known as the named return value optimisation. Also see the c++ faq on this topic.
commercial-grade C++ compilers implement return-by-value in a way that lets them eliminate the overhead, at least in simple cases
...
When yourCode() calls rbv(), the compiler secretly passes a pointer to the location where rbv() is supposed to construct the "returned" object.
You can demonstrate that this has been done by adding a destructor with a printf to your struct. The destructor should only be called once if this return-by-value optimisation is in operation, otherwise twice.
Also you can check the assembly to see that this happens:
Data createData()
{
Data data;
// initialize data values...
data.values[5] = 6;
return data;
}
here's the assembly:
__Z10createDatav:
LFB2:
pushl %ebp
LCFI0:
movl %esp, %ebp
LCFI1:
subl $1032, %esp
LCFI2:
movl 8(%ebp), %eax
movl $6, 20(%eax)
leave
ret $4
LFE2:
Curiously, it allocated enough space on the stack for the data item subl $1032, %esp, but note that it takes the first argument on the stack 8(%ebp) as the base address of the object, and then initialises element 6 of that item. Since we didn't specify any arguments to createData, this is curious until you realise this is the secret hidden pointer to the parent's version of Data.
But for a large structure, it has to be on the heap stack.
Indeed so! A large structure declared as a local variable is allocated on the stack. Glad to have that cleared up.
As for avoiding copying, as others have noted:
Most calling conventions deal with "function returning struct" by passing an additional parameter that points the location in the caller's stack frame in which the struct should be placed. This is definitely a matter for the calling convention and not the language.
With this calling convention, it becomes possible for even a relatively simple compiler to notice when a code path is definitely going to return a struct, and for it to fix assignments to that struct's members so that they go directly into the caller's frame and don't have to be copied. The key is for the compiler to notice that all terminating code paths through the function return the same struct variable. If that's the case, the compiler can safely use the space in the caller's frame, eliminating the need for a copy at the point of return.
There are many examples given, but basically
This question does not have any definite answer. it will depend on the compiler.
C does not specify how large structs are returned from a function.
Here's some tests for one particular compiler, gcc 4.1.2 on x86 RHEL 5.4
gcc trivial case, no copying
[00:05:21 1 ~] $ gcc -O2 -S -c t.c
[00:05:23 1 ~] $ cat t.s
.file "t.c"
.text
.p2align 4,,15
.globl createData
.type createData, #function
createData:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %eax
movl $1, 24(%eax)
popl %ebp
ret $4
.size createData, .-createData
.ident "GCC: (GNU) 4.1.2 20080704 (Red Hat 4.1.2-46)"
.section .note.GNU-stack,"",#progbits
gcc more realistic case , allocate on stack, memcpy to caller
#include <stdlib.h>
struct Data {
unsigned values[256];
};
struct Data createData()
{
struct Data data;
int i;
for(i = 0; i < 256 ; i++)
data.values[i] = rand();
return data;
}
[00:06:08 1 ~] $ gcc -O2 -S -c t.c
[00:06:10 1 ~] $ cat t.s
.file "t.c"
.text
.p2align 4,,15
.globl createData
.type createData, #function
createData:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
movl $1, %ebx
subl $1036, %esp
movl 8(%ebp), %edi
leal -1036(%ebp), %esi
.p2align 4,,7
.L2:
call rand
movl %eax, -4(%esi,%ebx,4)
addl $1, %ebx
cmpl $257, %ebx
jne .L2
movl %esi, 4(%esp)
movl %edi, (%esp)
movl $1024, 8(%esp)
call memcpy
addl $1036, %esp
movl %edi, %eax
popl %ebx
popl %esi
popl %edi
popl %ebp
ret $4
.size createData, .-createData
.ident "GCC: (GNU) 4.1.2 20080704 (Red Hat 4.1.2-46)"
.section .note.GNU-stack,"",#progbits
gcc 4.4.2### has grown a lot, and does not copy for the above non-trivial case.
.file "t.c"
.text
.p2align 4,,15
.globl createData
.type createData, #function
createData:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
movl $1, %ebx
subl $1036, %esp
movl 8(%ebp), %edi
leal -1036(%ebp), %esi
.p2align 4,,7
.L2:
call rand
movl %eax, -4(%esi,%ebx,4)
addl $1, %ebx
cmpl $257, %ebx
jne .L2
movl %esi, 4(%esp)
movl %edi, (%esp)
movl $1024, 8(%esp)
call memcpy
addl $1036, %esp
movl %edi, %eax
popl %ebx
popl %esi
popl %edi
popl %ebp
ret $4
.size createData, .-createData
.ident "GCC: (GNU) 4.1.2 20080704 (Red Hat 4.1.2-46)"
.section .note.GNU-stack,"",#progbits
In addition, VS2008 (compiled the above as C) will reserve struct Data on the stack of createData() and do a rep movsd loop to copy it back to the caller in Debug mode, in Release mode it will move the return value of rand() (%eax) directly back to the caller
typedef struct {
unsigned value[256];
} Data;
Data createData(void) {
Data r;
calcualte(&r);
return r;
}
Data d = createData();
msvc(6,8,9) and gcc mingw(3.4.5,4.4.0) will generate code like the following pseudocode
void createData(Data* r) {
calculate(&r)
}
Data d;
createData(&d);
gcc on linux will issue a memcpy() to copy the struct back on the stack of the caller. If the function has internal linkage, more optimizations become available though.

What is the mechanism of calling nonvirtual member function in C++?

The C++ object model is such that it does not contain any table for non virtual member functions. When there is a call of such a function
a.my_function();
with name mangling it becomes something like
my_function__5AclassKd(&a)
The object contains only data members. There are no table for non virtual functions. So in such a circumstances how calling mechanism finds out which function to call?
What's going on under the hood?
Formally the standard doesn't require them to work in any specific way, but usually they work exactly like plain functions, but with an extra invisible parameter: a pointer to the object instance they're called on.
Of course a compiler might be able to optimize that, e.g. don't pass the pointer if the member function doesn't use this or any member variables or member functions requiring this.
The compiler's job is to lay out the data and code the program needs into memory addresses. Each non-virtual function - whether member or non-member - gets a fixed virtual memory address at which it can be called. Calling machine code then hardcodes an absolute (or with position independent code a calling-address-relative offset) address of the function to call.
For example, say your compiler is compiling a non-virtual member function that takes 20 bytes of machine code, and it's putting the executable code at virtual addresses from offset 0x1000 and has already generated 10 bytes of executable code for other functions, then it will start the code of this function at virtual address 0x100A. Code that wants to call the function then generates machine code for "call 0x100A" after pushing any function call arguments (including a this pointer to the object to be operated upon) onto the stack.
You can easily see all this happening:
~/dev > cat example.cc
#include <cstdio>
struct X
{
int f(int n) { return n + 3; }
};
int main()
{
X x;
printf("%d\n", x.f(7));
}
~/dev > g++ example.cc -S; c++filt < example.s
.file "example.cc"
.section .text._ZN1X1fEi,"axG",#progbits,X::f(int),comdat
.align 2
.weak X::f(int)
.type X::f(int), #function
X::f(int): // code to execute X::f(int) starts at label .LFB0
.LFB0: // when this assembly is covered to machine code
.cfi_startproc // it's given a virtual address
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq %rdi, -8(%rbp)
movl %esi, -12(%rbp)
movl -12(%rbp), %eax
addl $3, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size X::f(int), .-X::f(int)
.section .rodata
.LC0:
.string "%d\n"
.text
.globl main
.type main, #function
main:
.LFB1:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movq %fs:40, %rax
movq %rax, -8(%rbp)
xorl %eax, %eax
leaq -9(%rbp), %rax
movl $7, %esi
movq %rax, %rdi
call X::f(int) // call non-member member function
// machine code will hardcoded address
movl %eax, %esi
leaq .LC0(%rip), %rdi
movl $0, %eax
call printf#PLT
movl $0, %eax
movq -8(%rbp), %rdx
xorq %fs:40, %rdx
je .L5
call __stack_chk_fail#PLT
.L5:
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE1:
.size main, .-main
.ident "GCC: (Ubuntu 7.2.0-8ubuntu3) 7.2.0"
.section .note.GNU-stack,"",#progbits
If you compile a program then look at the disassembly it'll usually show the actual virtual address offsets too.
With non-virtual functions, there is no need to determine at runtime which function to call; so the resulting machine code will typically look the same as a normal function call, just with an extra argument for this as indicated in your example. (Though it's not always identical - for example, I think MSVC compiling 32-bit programs, in at least some versions, passes this in the ECX register instead of on the stack as for usual function parameters.)
Thus, the determination of which function to call is made by the compiler at compile time. At that time, it has the information determined from parsing class declarations that it can use, for example to do method overload resolution, and from there to either calculate or look up the mangled name to put into assembly code.

Why does this loop produce "warning: iteration 3u invokes undefined behavior" and output more than 4 lines?

Compiling this:
#include <iostream>
int main()
{
for (int i = 0; i < 4; ++i)
std::cout << i*1000000000 << std::endl;
}
and gcc produces the following warning:
warning: iteration 3u invokes undefined behavior [-Waggressive-loop-optimizations]
std::cout << i*1000000000 << std::endl;
^
I understand there is a signed integer overflow.
What I cannot get is why i value is broken by that overflow operation?
I've read the answers to Why does integer overflow on x86 with GCC cause an infinite loop?, but I'm still not clear on why this happens - I get that "undefined" means "anything can happen", but what's the underlying cause of this specific behavior?
Online: http://ideone.com/dMrRKR
Compiler: gcc (4.8)
Signed integer overflow (as strictly speaking, there is no such thing as "unsigned integer overflow") means undefined behaviour. And this means anything can happen, and discussing why does it happen under the rules of C++ doesn't make sense.
C++11 draft N3337: §5.4:1
If during the evaluation of an expression, the result is not mathematically defined or not in the range of
representable values for its type, the behavior is undefined. [ Note: most existing implementations of C++
ignore integer overflows. Treatment of division by zero, forming a remainder using a zero divisor, and all
floating point exceptions vary among machines, and is usually adjustable by a library function. —end note ]
Your code compiled with g++ -O3 emits warning (even without -Wall)
a.cpp: In function 'int main()':
a.cpp:11:18: warning: iteration 3u invokes undefined behavior [-Waggressive-loop-optimizations]
std::cout << i*1000000000 << std::endl;
^
a.cpp:9:2: note: containing loop
for (int i = 0; i < 4; ++i)
^
The only way we can analyze what the program is doing, is by reading the generated assembly code.
Here is the full assembly listing:
.file "a.cpp"
.section .text$_ZNKSt5ctypeIcE8do_widenEc,"x"
.linkonce discard
.align 2
LCOLDB0:
LHOTB0:
.align 2
.p2align 4,,15
.globl __ZNKSt5ctypeIcE8do_widenEc
.def __ZNKSt5ctypeIcE8do_widenEc; .scl 2; .type 32; .endef
__ZNKSt5ctypeIcE8do_widenEc:
LFB860:
.cfi_startproc
movzbl 4(%esp), %eax
ret $4
.cfi_endproc
LFE860:
LCOLDE0:
LHOTE0:
.section .text.unlikely,"x"
LCOLDB1:
.text
LHOTB1:
.p2align 4,,15
.def ___tcf_0; .scl 3; .type 32; .endef
___tcf_0:
LFB1091:
.cfi_startproc
movl $__ZStL8__ioinit, %ecx
jmp __ZNSt8ios_base4InitD1Ev
.cfi_endproc
LFE1091:
.section .text.unlikely,"x"
LCOLDE1:
.text
LHOTE1:
.def ___main; .scl 2; .type 32; .endef
.section .text.unlikely,"x"
LCOLDB2:
.section .text.startup,"x"
LHOTB2:
.p2align 4,,15
.globl _main
.def _main; .scl 2; .type 32; .endef
_main:
LFB1084:
.cfi_startproc
leal 4(%esp), %ecx
.cfi_def_cfa 1, 0
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
.cfi_escape 0x10,0x5,0x2,0x75,0
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
pushl %ecx
.cfi_escape 0xf,0x3,0x75,0x70,0x6
.cfi_escape 0x10,0x7,0x2,0x75,0x7c
.cfi_escape 0x10,0x6,0x2,0x75,0x78
.cfi_escape 0x10,0x3,0x2,0x75,0x74
xorl %edi, %edi
subl $24, %esp
call ___main
L4:
movl %edi, (%esp)
movl $__ZSt4cout, %ecx
call __ZNSolsEi
movl %eax, %esi
movl (%eax), %eax
subl $4, %esp
movl -12(%eax), %eax
movl 124(%esi,%eax), %ebx
testl %ebx, %ebx
je L15
cmpb $0, 28(%ebx)
je L5
movsbl 39(%ebx), %eax
L6:
movl %esi, %ecx
movl %eax, (%esp)
addl $1000000000, %edi
call __ZNSo3putEc
subl $4, %esp
movl %eax, %ecx
call __ZNSo5flushEv
jmp L4
.p2align 4,,10
L5:
movl %ebx, %ecx
call __ZNKSt5ctypeIcE13_M_widen_initEv
movl (%ebx), %eax
movl 24(%eax), %edx
movl $10, %eax
cmpl $__ZNKSt5ctypeIcE8do_widenEc, %edx
je L6
movl $10, (%esp)
movl %ebx, %ecx
call *%edx
movsbl %al, %eax
pushl %edx
jmp L6
L15:
call __ZSt16__throw_bad_castv
.cfi_endproc
LFE1084:
.section .text.unlikely,"x"
LCOLDE2:
.section .text.startup,"x"
LHOTE2:
.section .text.unlikely,"x"
LCOLDB3:
.section .text.startup,"x"
LHOTB3:
.p2align 4,,15
.def __GLOBAL__sub_I_main; .scl 3; .type 32; .endef
__GLOBAL__sub_I_main:
LFB1092:
.cfi_startproc
subl $28, %esp
.cfi_def_cfa_offset 32
movl $__ZStL8__ioinit, %ecx
call __ZNSt8ios_base4InitC1Ev
movl $___tcf_0, (%esp)
call _atexit
addl $28, %esp
.cfi_def_cfa_offset 4
ret
.cfi_endproc
LFE1092:
.section .text.unlikely,"x"
LCOLDE3:
.section .text.startup,"x"
LHOTE3:
.section .ctors,"w"
.align 4
.long __GLOBAL__sub_I_main
.lcomm __ZStL8__ioinit,1,1
.ident "GCC: (i686-posix-dwarf-rev1, Built by MinGW-W64 project) 4.9.0"
.def __ZNSt8ios_base4InitD1Ev; .scl 2; .type 32; .endef
.def __ZNSolsEi; .scl 2; .type 32; .endef
.def __ZNSo3putEc; .scl 2; .type 32; .endef
.def __ZNSo5flushEv; .scl 2; .type 32; .endef
.def __ZNKSt5ctypeIcE13_M_widen_initEv; .scl 2; .type 32; .endef
.def __ZSt16__throw_bad_castv; .scl 2; .type 32; .endef
.def __ZNSt8ios_base4InitC1Ev; .scl 2; .type 32; .endef
.def _atexit; .scl 2; .type 32; .endef
I can barely even read assembly, but even I can see the addl $1000000000, %edi line.
The resulting code looks more like
for(int i = 0; /* nothing, that is - infinite loop */; i += 1000000000)
std::cout << i << std::endl;
This comment of #T.C.:
I suspect that it's something like: (1) because every iteration with i of any value larger than 2 has undefined behavior -> (2) we can assume that i <= 2 for optimization purposes -> (3) the loop condition is always true -> (4) it's optimized away into an infinite loop.
gave me idea to compare the assembly code of the OP's code to the assembly code of the following code, with no undefined behaviour.
#include <iostream>
int main()
{
// changed the termination condition
for (int i = 0; i < 3; ++i)
std::cout << i*1000000000 << std::endl;
}
And, in fact, the correct code has termination condition.
; ...snip...
L6:
mov ecx, edi
mov DWORD PTR [esp], eax
add esi, 1000000000
call __ZNSo3putEc
sub esp, 4
mov ecx, eax
call __ZNSo5flushEv
cmp esi, -1294967296 // here it is
jne L7
lea esp, [ebp-16]
xor eax, eax
pop ecx
; ...snip...
Unfortunately this is the consequences of writing buggy code.
Fortunately you can make use of better diagnostics and better debugging tools - that's what they are for:
enable all warnings
-Wall is the gcc option that enables all useful warnings with no false positives. This is a bare minimum that you should always use.
gcc has many other warning options, however, they are not enabled with -Wall as they may warn on false positives
Visual C++ unfortunately is lagging behind with the ability to give useful warnings. At least the IDE enables some by default.
use debug flags for debugging
for integer overflow -ftrapv traps the program on overflow,
Clang compiler is excellent for this: -fcatch-undefined-behavior catches a lot of instances of undefined behaviour (note: "a lot of" != "all of them")
I have a spaghetti mess of a program not written by me that needs to be shipped tomorrow! HELP!!!!!!111oneone
Use gcc's -fwrapv
This option instructs the compiler to assume that signed arithmetic overflow of addition, subtraction and multiplication wraps around using twos-complement representation.
1 - this rule does not apply to "unsigned integer overflow", as §3.9.1.4 says that
Unsigned integers, declared unsigned, shall obey the laws of arithmetic modulo 2n where n is the number
of bits in the value representation of that particular size of integer.
and e.g. result of UINT_MAX + 1 is mathematically defined - by the rules of arithmetic modulo 2n
Short answer, gcc specifically has documented this problem, we can see that in the gcc 4.8 release notes which says (emphasis mine going forward):
GCC now uses a more aggressive analysis to derive an upper bound for
the number of iterations of loops using constraints imposed by
language standards. This may cause non-conforming programs to no
longer work as expected, such as SPEC CPU 2006 464.h264ref and
416.gamess. A new option, -fno-aggressive-loop-optimizations, was added to disable this aggressive analysis. In some loops that have
known constant number of iterations, but undefined behavior is known
to occur in the loop before reaching or during the last iteration, GCC
will warn about the undefined behavior in the loop instead of deriving
lower upper bound of the number of iterations for the loop. The
warning can be disabled with -Wno-aggressive-loop-optimizations.
and indeed if we use -fno-aggressive-loop-optimizations the infinite loop behavior should cease and it does in all the cases I have tested.
The long answer starts with knowing that signed integer overflow is undefined behavior by looking at the draft C++ standard section 5 Expressions paragraph 4 which says:
If during the evaluation of an expression, the result is not
mathematically defined or not in the range of representable values for
its type, the behavior is undefined. [ Note: most existing
implementations of C++ ignore integer overflows. Treatment of division
by zero, forming a remainder using a zero divisor, and all floating
point exceptions vary among machines, and is usually adjustable by a
library function. —end note
We know that the standard says undefined behavior is unpredictable from the note that come with the definition which says:
[ Note: Undefined behavior may be expected when this International
Standard omits any explicit definition of behavior or when a program
uses an erroneous construct or erroneous data. Permissible undefined
behavior ranges from ignoring the situation completely with
unpredictable results, to behaving during translation or program
execution in a documented manner characteristic of the environment
(with or without the issuance of a diagnostic message), to terminating
a translation or execution (with the issuance of a diagnostic
message). Many erroneous program constructs do not engender undefined
behavior; they are required to be diagnosed. —end note ]
But what in the world can the gcc optimizer be doing to turn this into an infinite loop? It sounds completely wacky. But thankfully gcc gives us a clue to figuring it out in the warning:
warning: iteration 3u invokes undefined behavior [-Waggressive-loop-optimizations]
std::cout << i*1000000000 << std::endl;
^
The clue is the Waggressive-loop-optimizations, what does that mean? Fortunately for us this is not the first time this optimization has broken code in this way and we are lucky because John Regehr has documented a case in the article GCC pre-4.8 Breaks Broken SPEC 2006 Benchmarks which shows the following code:
int d[16];
int SATD (void)
{
int satd = 0, dd, k;
for (dd=d[k=0]; k<16; dd=d[++k]) {
satd += (dd < 0 ? -dd : dd);
}
return satd;
}
the article says:
The undefined behavior is accessing d[16] just before exiting the
loop. In C99 it is legal to create a pointer to an element one
position past the end of the array, but that pointer must not be
dereferenced.
and later on says:
In detail, here is what’s going on. A C compiler, upon seeing d[++k],
is permitted to assume that the incremented value of k is within the
array bounds, since otherwise undefined behavior occurs. For the code
here, GCC can infer that k is in the range 0..15. A bit later, when
GCC sees k<16, it says to itself: “Aha– that expression is always
true, so we have an infinite loop.” The situation here, where the
compiler uses the assumption of well-definedness to infer a useful
dataflow fact,
So what the compiler must be doing in some cases is assuming since signed integer overflow is undefined behavior then i must always be less than 4 and thus we have an infinite loop.
He explains this is very similar to the infamous Linux kernel null pointer check removal where in seeing this code:
struct foo *s = ...;
int x = s->f;
if (!s) return ERROR;
gcc inferred that since s was deferenced in s->f; and since dereferencing a null pointer is undefined behavior then s must not be null and therefore optimizes away the if (!s) check on the next line.
The lesson here is that modern optimizers are very aggressive about exploiting undefined behavior and most likely will only get more aggressive. Clearly with just a few examples we can see the optimizer does things that seem completely unreasonable to a programmer but in retrospect from the optimizers perspective make sense.
tl;dr The code generates a test that integer + positive integer == negative integer. Usually the optimizer does not optimize this out, but in the specific case of std::endl being used next, the compiler does optimize this test out. I haven't figured out what's special about endl yet.
From the assembly code at -O1 and higher levels, it is clear that gcc refactors the loop to:
i = 0;
do {
cout << i << endl;
i += NUMBER;
}
while (i != NUMBER * 4)
The biggest value that works correctly is 715827882, i.e. floor(INT_MAX/3). The assembly snippet at -O1 is:
L4:
movsbl %al, %eax
movl %eax, 4(%esp)
movl $__ZSt4cout, (%esp)
call __ZNSo3putEc
movl %eax, (%esp)
call __ZNSo5flushEv
addl $715827882, %esi
cmpl $-1431655768, %esi
jne L6
// fallthrough to "return" code
Note, the -1431655768 is 4 * 715827882 in 2's complement.
Hitting -O2 optimizes that to the following:
L4:
movsbl %al, %eax
addl $715827882, %esi
movl %eax, 4(%esp)
movl $__ZSt4cout, (%esp)
call __ZNSo3putEc
movl %eax, (%esp)
call __ZNSo5flushEv
cmpl $-1431655768, %esi
jne L6
leal -8(%ebp), %esp
jne L6
// fallthrough to "return" code
So the optimization that has been made is merely that the addl was moved higher up.
If we recompile with 715827883 instead then the -O1 version is identical apart from the changed number and test value. However, -O2 then makes a change:
L4:
movsbl %al, %eax
addl $715827883, %esi
movl %eax, 4(%esp)
movl $__ZSt4cout, (%esp)
call __ZNSo3putEc
movl %eax, (%esp)
call __ZNSo5flushEv
jmp L2
Where there was cmpl $-1431655764, %esi at -O1, that line has been removed for -O2. The optimizer must have decided that adding 715827883 to %esi can never equal -1431655764.
This is pretty puzzling. Adding that to INT_MIN+1 does generate the expected result, so the optimizer must have decided that %esi can never be INT_MIN+1 and I'm not sure why it would decide that.
In the working example it seems it'd be equally valid to conclude that adding 715827882 to a number cannot equal INT_MIN + 715827882 - 2 ! (this is only possible if wraparound does actually occur), yet it does not optimize the line out in that example.
The code I was using is:
#include <iostream>
#include <cstdio>
int main()
{
for (int i = 0; i < 4; ++i)
{
//volatile int j = i*715827883;
volatile int j = i*715827882;
printf("%d\n", j);
std::endl(std::cout);
}
}
If the std::endl(std::cout) is removed then the optimization no longer occurs. In fact replacing it with std::cout.put('\n'); std::flush(std::cout); also causes the optimization to not happen, even though std::endl is inlined.
The inlining of std::endl seems to affect the earlier part of the loop structure (which I don't quite understand what it is doing but I'll post it here in case someone else does):
With original code and -O2:
L2:
movl %esi, 28(%esp)
movl 28(%esp), %eax
movl $LC0, (%esp)
movl %eax, 4(%esp)
call _printf
movl __ZSt4cout, %eax
movl -12(%eax), %eax
movl __ZSt4cout+124(%eax), %ebx
testl %ebx, %ebx
je L10
cmpb $0, 28(%ebx)
je L3
movzbl 39(%ebx), %eax
L4:
movsbl %al, %eax
addl $715827883, %esi
movl %eax, 4(%esp)
movl $__ZSt4cout, (%esp)
call __ZNSo3putEc
movl %eax, (%esp)
call __ZNSo5flushEv
jmp L2 // no test
With mymanual inlining of std::endl, -O2:
L3:
movl %ebx, 28(%esp)
movl 28(%esp), %eax
addl $715827883, %ebx
movl $LC0, (%esp)
movl %eax, 4(%esp)
call _printf
movl $10, 4(%esp)
movl $__ZSt4cout, (%esp)
call __ZNSo3putEc
movl $__ZSt4cout, (%esp)
call __ZNSo5flushEv
cmpl $-1431655764, %ebx
jne L3
xorl %eax, %eax
One difference between these two is that %esi is used in the original , and %ebx in the second version; is there any difference in semantics defined between %esi and %ebx in general? (I don't know much about x86 assembly).
Another example of this error being reported in gcc is when you have a loop that executes for a constant number of iterations, but you are using the counter variable as an index into an array that has less than that number of items, such as:
int a[50], x;
for( i=0; i < 1000; i++) x = a[i];
The compiler can determine that this loop will try to access memory outside of the array 'a'. The compiler complains about this with this rather cryptic message:
iteration xxu invokes undefined behavior [-Werror=aggressive-loop-optimizations]
What I cannot get is why i value is broken by that overflow operation?
It seems that integer overflow occurs in 4th iteration (for i = 3).
signed integer overflow invokes undefined behavior. In this case nothing can be predicted. The loop may iterate only 4 times or it may go to infinite or anything else!
Result may vary compiler to compiler or even for different versions of same compiler.
C11: 1.3.24 undefined behavior:
behavior for which this International Standard imposes no requirements
[ Note: Undefined behavior may be expected when this International Standard omits any explicit definition of behavior or when a program uses an erroneous construct or erroneous data. Permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message). Many erroneous program constructs do not engender undefined behavior; they are required to be diagnosed.
—end note ]

Slow XOR operator

EDIT: Indeed, I had a weird error in my timing code leading to these results. When I fixed my error, the smart version ended up faster as expected. My timing code looked like this:
bool x = false;
before = now();
for (int i=0; i<N; ++i) {
x ^= smart_xor(A[i],B[i]);
}
after = now();
I had done the ^= to discourage my compiler from optimizing the for-loop away. But I think that the ^= somehow interacts strangely with the two xor functions. I changed my timing code to simply fill out an array of the xor results, and then do computation with that array outside of the timed code. And that fixed things.
Should I delete this question?
END EDIT
I defined two C++ functions as follows:
bool smart_xor(bool a, bool b) {
return a^b;
}
bool dumb_xor(bool a, bool b) {
return a?!b:b;
}
My timing tests indicate that dumb_xor() is slightly faster (1.31ns vs 1.90ns when inlined, 1.92ns vs 2.21ns when not inlined). This puzzles me, as the ^ operator should be a single machine operation. I'm wondering if anyone has an explanation.
The assembly looks like this (when not inlined):
.file "xor.cpp"
.text
.p2align 4,,15
.globl _Z9smart_xorbb
.type _Z9smart_xorbb, #function
_Z9smart_xorbb:
.LFB0:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
movl %esi, %eax
xorl %edi, %eax
ret
.cfi_endproc
.LFE0:
.size _Z9smart_xorbb, .-_Z9smart_xorbb
.p2align 4,,15
.globl _Z8dumb_xorbb
.type _Z8dumb_xorbb, #function
_Z8dumb_xorbb:
.LFB1:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
movl %esi, %edx
movl %esi, %eax
xorl $1, %edx
testb %dil, %dil
cmovne %edx, %eax
ret
.cfi_endproc
.LFE1:
.size _Z8dumb_xorbb, .-_Z8dumb_xorbb
.ident "GCC: (Ubuntu 4.4.3-4ubuntu5) 4.4.3"
.section .note.GNU-stack,"",#progbits
I'm using g++ 4.4.3-4ubuntu5 on an Intel Xeon X5570. I compiled with -O3.
I don't think you benchmarked your code correctly.
We can see in the generated assembly that your smart_xor function is:
movl %esi, %eax
xorl %edi, %eax
while your dumb_xor function is:
movl %esi, %edx
movl %esi, %eax
xorl $1, %edx
testb %dil, %dil
cmovne %edx, %eax
So obviously, the first one will be faster.
If not, then you have benchmarking issues.
So you may want to tune your benchmarking code... And remember you'll need to run a lot of calls to have a good and meaningful average.
Given that your "dumb XOR" code is significantly longer (and most instructions are dependent on a previous one, so it won't run in parallel), I suspect that you have some sort of measurement error in your results.
The compiler will need to produce two instructions for the out-of-line version of "smart XOR" because the registers that the data comes in as is not the register to give the return result in, so the data has to move from EDI and ESI to EAX. In an inline version, the code should be able to use whatever register the data is in before the call, and if the code allows it, result stays in the register it came in as.
Calling a function is out-of-line is probably at least as long in execution time as the actual code in the function.
It would help if you showes your test-harness that you use for benchmarking too...

gcc stack optimization

Hi I have a question on possible stack optimization by gcc (or g++)..
Sample code under FreeBSD (does UNIX variance matter here?):
void main() {
char bing[100];
..
string buffer = ....;
..
}
What I found in gdb for a coredump of this program is that the address
of bing is actually lower than that buffer (namely, &bing[0] < &buffer).
I think this is totally the contrary of was told in textbook. Could there
be some compiler optimization that re-organize the stack layout in such a
way?
This seems to be only possible explanation but I'm not sure..
In case you're interested, the coredump is due to the buffer overflow by
bing to buffer (but that also confirms &bing[0] < &buffer).
Thanks!
Compilers are free to organise stack frames (assuming they even use stacks) any way they wish.
They may do it for alignment reasons, or for performance reasons, or for no reason at all. You would be unwise to assume any specific order.
If you hadn't invoked undefined behavior by overflowing the buffer, you probably never would have known, and that's the way it should be.
A compiler can not only re-organise your variables, it can optimise them out of existence if it can establish they're not used. With the code:
#include <stdio.h>
int main (void) {
char bing[71];
int x = 7;
bing[0] = 11;
return 0;
}
Compare the normal assembler output:
main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
subl $80, %esp
movl %gs:20, %eax
movl %eax, 76(%esp)
xorl %eax, %eax
movl $7, (%esp)
movb $11, 5(%esp)
movl $0, %eax
movl 76(%esp), %edx
xorl %gs:20, %edx
je .L3
call __stack_chk_fail
.L3:
leave
ret
with the insanely optimised:
main:
pushl %ebp
xorl %eax, %eax
movl %esp, %ebp
popl %ebp
ret
Notice anything missing from the latter? Yes, there are no stack manipulations to create space for either bing or x. They don't exist. In fact, the entire code sequence boils down to:
set return code to 0.
return.
A compiler is free to layout local variables on the stack (or keep them in register or do something else with them) however it sees fit: the C and C++ language standards don't say anything about these implementation details, and neither does POSIX or UNIX. I doubt that your textbook told you otherwise, and if it did, I would look for a new textbook.