Coding for loop with an assembler in c++ function - c++

I was always interested in assembler, however so far I didn't have a true chance to confront it in a best way. Now, when I do have some time, I began coding some small programs using assembler in a c++, but that's just small ones, i.e. define x, store it somewhere and so on, so forth. I wanted to implement foor loop in assembler, but I couldn't make it, so I would like to ask if anyone here has ever done with it, would be nice to share here. Example of some function would be
for(i=0;i<10;i++) { std::cout<< "A"; }
Anyone has some idea how to implement this in a assembler?
edit2: ISA x86

Here's the unoptimized output1 of GCC for this code:
void some_function(void);
int main()
{
for (int i = 0; i < 137; ++i) { some_function(); }
}
movl $0, 12(%esp) // i = 0; i is stored at %esp + 12
jmp .L2
.L3:
call some_function // some_function()
addl $1, 12(%esp) // ++i
.L2:
cmpl $136, 12(%esp) // compare i to 136 ...
jle .L3 // ... and repeat loop less-or-equal
movl $0, %eax // return 0
leave // --"--
With optimization -O3, the addition+comparison is turned into subtraction:
pushl %ebx // save %ebx
movl $137, %ebx // set %ebx to 137
// some unrelated parts
.L2:
call some_function // some_function()
subl $1, %ebx // subtract 1 from %ebx
jne .L2 // if not equal to 0, repeat loop
1The generated assembly can be examined by invoking GCC with the -S flag.

Try to rewrite the for loop in C++ using a goto and an if statement and you will have the basics for the assembly version.

You could try the reverse - write the program in C++ or C and look at the dissasembled code:
for ( int i = 0 ; i < 10 ; i++ )
00E714EE mov dword ptr [i],0
00E714F5 jmp wmain+30h (0E71500h)
00E714F7 mov eax,dword ptr [i]
00E714FA add eax,1
00E714FD mov dword ptr [i],eax
00E71500 cmp dword ptr [i],0Ah
00E71504 jge wmain+4Bh (0E7151Bh)
cout << "A";
00E71506 push offset string "A" (0E76800h)
00E7150B mov eax,dword ptr [__imp_std::cout (0E792ECh)]
00E71510 push eax
00E71511 call std::operator<<<std::char_traits<char> > (0E71159h)
00E71516 add esp,8
00E71519 jmp wmain+27h (0E714F7h)
then try to make sense of it.

Related

Why do I get SIGSEGV when trying to pass a char to my function and how to fix it?

I'm overloading operator= in my class and I need to determine whether my string contains digits or not. Unfortunately, I need to use C arrays (char*) mainly, for 2 reasons:
Training
Using std::string will require me to change 100+ lines of code.
When trying to pass a value from a char* array to my function, I get a really nice SEGFAULT and unfortunately I'm not sure why. Probably I'm not using correctly my pointer.
I was searching for hours but I wasn't able to find a solution to this problem. Anyways, here's the code. If you need further information regarding this matter, please let me know in the comments.
#include <limits.h>
#include <iostream>
#include <string.h>
#include <exception>
struct notdig : public std::exception {
virtual const char* what() const throw() override{
return "int_huge: isdig(*(o + i)) returned false. \nYour declaration of an int_huge integer should contain decimal digits.";
}
};
class int_huge {
private:
char* buffera;
char* bufferb;
int numDig(unsigned long long int number){ //Gets the number of digits
int i = 0;
while(number){
number /= 10;
i++;
}
return i;
}
inline bool isdig( char character ) { //Checks whether character is digit or not
return ( '0' <= character && character <= '9' );
}
public:
int_huge(){
this->buffera = "0";
this->bufferb = "0";
}
void operator=(char* operand){
for (int i = 0; i < strlen(operand); i++){
if (!isdig(operand[i])){
throw notdig();
}
}
if (strlen(operand) >= numDig(ULLONG_MAX)){
if (strlen(operand) - numDig(ULLONG_MAX)){ //Equivalent with if (strlen(operand) != numDig(ULLONG_MAX)
int i = 0;
while (i < strlen(operand)-numDig(ULLONG_MAX)){
this->buffera[i] = operand[i];
i++;
}
this->bufferb = operand + i;
} else {
this->buffera[0] = operand[0];
this->bufferb = operand + 1;
}
} else {
this->buffera = "0";
this->bufferb = operand;
}
}
};
int main() {
int_huge object;
try {
object = "90";
} catch (std::exception &e) {
std::cout << e.what();
}
}
Disassembler results:
0x4019b4 push %ebp
0x4019b5 mov %esp,%ebp
0x4019b7 push %ebx
0x4019b8 sub $0x24,%esp
0x4019bb mov 0xc(%ebp),%eax
0x4019be mov %eax,(%esp)
0x4019c1 call 0x401350 <strlen>
0x4019c6 mov %eax,%ebx
0x4019c8 movl $0xffffffff,0x4(%esp)
0x4019d0 movl $0xffffffff,0x8(%esp)
0x4019d8 mov 0x8(%ebp),%eax
0x4019db mov %eax,(%esp)
0x4019de call 0x40195c <int_huge::numDig(unsigned long long)>
0x4019e3 cmp %eax,%ebx
0x4019e5 setae %al
0x4019e8 test %al,%al
0x4019ea je 0x401aa8 <int_huge::int_huge(char*)+244>
0x4019f0 mov 0xc(%ebp),%eax
0x4019f3 mov %eax,(%esp)
0x4019f6 call 0x401350 <strlen>
0x4019fb mov %eax,%ebx
0x4019fd movl $0xffffffff,0x4(%esp)
0x401a05 movl $0xffffffff,0x8(%esp)
0x401a0d mov 0x8(%ebp),%eax
0x401a10 mov %eax,(%esp)
0x401a13 call 0x40195c <int_huge::numDig(unsigned long long)>
0x401a18 cmp %eax,%ebx
0x401a1a setne %al
0x401a1d test %al,%al
0x401a1f je 0x401a8d <int_huge::int_huge(char*)+217>
0x401a21 movl $0x0,-0xc(%ebp)
0x401a28 mov 0xc(%ebp),%eax
0x401a2b mov %eax,(%esp)
0x401a2e call 0x401350 <strlen>
0x401a33 mov %eax,%ebx
0x401a35 movl $0xffffffff,0x4(%esp)
0x401a3d movl $0xffffffff,0x8(%esp)
0x401a45 mov 0x8(%ebp),%eax
0x401a48 mov %eax,(%esp)
0x401a4b call 0x40195c <int_huge::numDig(unsigned long long)>
0x401a50 sub %eax,%ebx
0x401a52 mov %ebx,%edx
0x401a54 mov -0xc(%ebp),%eax
0x401a57 cmp %eax,%edx
0x401a59 seta %al
0x401a5c test %al,%al
0x401a5e je 0x401a7d <int_huge::int_huge(char*)+201>
0x401a60 mov 0x8(%ebp),%eax
0x401a63 mov (%eax),%edx
0x401a65 mov -0xc(%ebp),%eax
0x401a68 add %eax,%edx
0x401a6a mov -0xc(%ebp),%ecx
0x401a6d mov 0xc(%ebp),%eax
0x401a70 add %ecx,%eax
0x401a72 movzbl (%eax),%eax
0x401a75 mov %al,(%edx) ;This is where the compiler stops. Probably due to SIGSEGV.
0x401a77 addl $0x1,-0xc(%ebp)
0x401a7b jmp 0x401a28 <int_huge::int_huge(char*)+116>
0x401a7d mov -0xc(%ebp),%edx
0x401a80 mov 0xc(%ebp),%eax
0x401a83 add %eax,%edx
0x401a85 mov 0x8(%ebp),%eax
0x401a88 mov %edx,0x4(%eax)
0x401a8b jmp 0x401aba <int_huge::int_huge(char*)+262>
0x401a8d mov 0x8(%ebp),%eax
0x401a90 mov (%eax),%eax
0x401a92 mov 0xc(%ebp),%edx
0x401a95 movzbl (%edx),%edx
0x401a98 mov %dl,(%eax)
0x401a9a mov 0xc(%ebp),%eax
0x401a9d lea 0x1(%eax),%edx
0x401aa0 mov 0x8(%ebp),%eax
0x401aa3 mov %edx,0x4(%eax)
0x401aa6 jmp 0x401aba <int_huge::int_huge(char*)+262>
0x401aa8 mov 0x8(%ebp),%eax
0x401aab movl $0x4030d2,(%eax)
0x401ab1 mov 0x8(%ebp),%eax
0x401ab4 mov 0xc(%ebp),%edx
0x401ab7 mov %edx,0x4(%eax)
0x401aba nop
0x401abb add $0x24,%esp
0x401abe pop %ebx
0x401abf pop %ebp
0x401ac0 ret
In
MyClass(){
this->a = "0";
this->b = "0";
}
and in main at
o = "90";
string literals, which may not be in writable storage, are assigned to non-constant pointers to char. The compiler should be warning you about this or outright refusing to compile if the compiler supports the C++11 or newer standards.
Where this blows up is in operator= here:
this->a[i] = o[i];
and here:
this->a[0] = o[0];
as the program attempts to write storage that cannot be written.
Solution:
Use std::string. If that is off the table, and it sounds like it is, don't use string literals. Allocate a buffer in writable memory with new, copy the literal into the buffer, and then assign the buffer. And also remember that the buffers are of a fixed size. Trying to place a larger string in the buffer will end in disaster.
This will be a memory management headache and a potential Rule of Three horror show, so be careful.

Warning when comparing references to bool and int with MSVC 2015

The following code produces a warning with MSVC (2015 Update 3) - with /W4:
const bool& a = true;
const int& b = 1;
if(a == b)
C4805: '==': unsafe mix of type 'const bool' and type 'const int' in operation
but without the references it compiles cleanly.
const bool a = true;
const int b = 1;
if(a == b)
WHY?
EDIT:
just tested without const as well
bool a = true;
int b = 1;
if(a == b)
and the warning reappeared...
EDIT 2:
Compiling in Debug... I did have to silence C4127: conditional expression is constant in the const noref case though...
EDIT 3:
here are the disassemblies for the 3 cases:
const ref
0113BA92 in al,dx
0113BA93 sub esp,24h
0113BA96 mov eax,0CCCCCCCCh
0113BA9B mov dword ptr [ebp-24h],eax
0113BA9E mov dword ptr [ebp-20h],eax
0113BAA1 mov dword ptr [ebp-1Ch],eax
0113BAA4 mov dword ptr [ebp-18h],eax
0113BAA7 mov dword ptr [b],eax
0113BAAA mov dword ptr [ebp-10h],eax
0113BAAD mov dword ptr [ebp-0Ch],eax
0113BAB0 mov dword ptr [ebp-8],eax
0113BAB3 mov dword ptr [a],eax
const bool& a = true;
0113BAB6 mov byte ptr [ebp-9],1
0113BABA lea eax,[ebp-9]
0113BABD mov dword ptr [a],eax
const int& b = 1;
0113BAC0 mov dword ptr [ebp-1Ch],1
0113BAC7 lea ecx,[ebp-1Ch]
0113BACA mov dword ptr [b],ecx
if(a == b)
0113BACD mov edx,dword ptr [a]
0113BAD0 movzx eax,byte ptr [edx]
0113BAD3 mov ecx,dword ptr [b]
0113BAD6 cmp eax,dword ptr [ecx]
0113BAD8 jne DOCTEST_ANON_FUNC_2+5Fh (0113BAEFh)
throw 5;
0113BADA mov dword ptr [ebp-24h],5
0113BAE1 push offset __TI1H (0117318Ch)
0113BAE6 lea edx,[ebp-24h]
0113BAE9 push edx
0113BAEA call __CxxThrowException#8 (01164B04h)
const only
0137BA92 in al,dx
0137BA93 sub esp,0Ch
0137BA96 mov dword ptr [ebp-0Ch],0CCCCCCCCh
0137BA9D mov dword ptr [b],0CCCCCCCCh
0137BAA4 mov dword ptr [ebp-4],0CCCCCCCCh
const bool a = true;
0137BAAB mov byte ptr [a],1
const int b = 1;
0137BAAF mov dword ptr [b],1
if(a == b)
0137BAB6 mov eax,1
0137BABB test eax,eax
0137BABD je DOCTEST_ANON_FUNC_2+44h (0137BAD4h)
throw 5;
0137BABF mov dword ptr [ebp-0Ch],5
0137BAC6 push offset __TI1H (013B318Ch)
0137BACB lea ecx,[ebp-0Ch]
0137BACE push ecx
0137BACF call __CxxThrowException#8 (013A4B04h)
no const no ref
0012BA92 in al,dx
0012BA93 sub esp,0Ch
0012BA96 mov dword ptr [ebp-0Ch],0CCCCCCCCh
0012BA9D mov dword ptr [b],0CCCCCCCCh
0012BAA4 mov dword ptr [ebp-4],0CCCCCCCCh
bool a = true;
0012BAAB mov byte ptr [a],1
int b = 1;
0012BAAF mov dword ptr [b],1
if(a == b)
0012BAB6 movzx eax,byte ptr [a]
0012BABA cmp eax,dword ptr [b]
0012BABD jne DOCTEST_ANON_FUNC_2+44h (012BAD4h)
throw 5;
0012BABF mov dword ptr [ebp-0Ch],5
0012BAC6 push offset __TI1H (016318Ch)
0012BACB lea ecx,[ebp-0Ch]
0012BACE push ecx
0012BACF call __CxxThrowException#8 (0154B04h)
So it seems there is awlays a jump for the if statement so it isn't optimized and not warning-diagnosed in the const case (debug config)...
This is what happens with GCC 6.1 for the following code:
int ref(int num) {
const bool& a = true;
const int& b = 1;
return a == b;
}
int noref(int num) {
const bool a = true;
const int b = 1;
return a == b;
}
int noref_noconst(int num) {
bool a = true;
int b = 1;
return a == b;
}
Assembly output:
ref(int):
pushq %rbp
movq %rsp, %rbp
movl %edi, -36(%rbp)
movl $1, %eax
movb %al, -21(%rbp)
leaq -21(%rbp), %rax
movq %rax, -8(%rbp)
movl $1, %eax
movl %eax, -20(%rbp)
leaq -20(%rbp), %rax
movq %rax, -16(%rbp)
movq -8(%rbp), %rax
movzbl (%rax), %eax
movzbl %al, %edx
movq -16(%rbp), %rax
movl (%rax), %eax
cmpl %eax, %edx
sete %al
movzbl %al, %eax
popq %rbp
ret
noref(int):
pushq %rbp
movq %rsp, %rbp
movl %edi, -20(%rbp)
movb $1, -1(%rbp)
movl $1, -8(%rbp)
movl $1, %eax
popq %rbp
ret
noref_noconst(int):
pushq %rbp
movq %rsp, %rbp
movl %edi, -20(%rbp)
movb $1, -1(%rbp)
movl $1, -8(%rbp)
movzbl -1(%rbp), %eax
cmpl -8(%rbp), %eax
sete %al
movzbl %al, %eax
popq %rbp
ret
The easiest thing to note here about the assembly is that in the case of noref there isn't even a jump; it's just a simple return 1.
The same thing is probably happening in MSVC, which then slips past the detection mechanism for warnings, and you just don't get one.
EDIT:
Let's look at your assembly, and the const only version in particular:
if(a == b)
0137BAB6 mov eax,1
0137BABB test eax,eax
0137BABD je DOCTEST_ANON_FUNC_2+44h (0137BAD4h)
While it is true that the if statement (je) is there, we can see that it doesn't actually compare the variables--which is the precise operation that triggers the warning. It simply puts 1 in a register and then compares the register with itself.

OSX 64 bit C++ DIsassembly line by line

I have been reading through the following series of articles: http://www.altdevblogaday.com/2011/11/09/a-low-level-curriculum-for-c-and-c
The disassembled code shown and the disassembled code I am managing to produce whilst running the same code vary quite significantly and I lack the understanding to explain the differences.
Is there anyone that can step through it line by line and perhaps explain what it's doing at each step ? I get the feeling from the searching around I have done that the first few lines have something to do with frame pointers, there also seems to be a few extra lines in my disassembled code that ensures registers are empty before placing new values into them (absent from the code in the article)
I am running this on OSX (original author is using Windows) using the g++ compiler from within XCode 4. I am really clueless as to weather or not these variances are due to the OS, the architecture (32 bit vs 64 bit maybe?) or the compiler itself. It could even be the code I guess - mine is wrapped inside the main function declaration whereas the original code makes no mention of this.
My code:
int main(int argc, const char * argv[])
{
int x = 1;
int y = 2;
int z = 0;
z = x + y;
}
My disassembled code:
0x100000f40: pushq %rbp
0x100000f41: movq %rsp, %rbp
0x100000f44: movl $0, %eax
0x100000f49: movl %edi, -4(%rbp)
0x100000f4c: movq %rsi, -16(%rbp)
0x100000f50: movl $1, -20(%rbp)
0x100000f57: movl $2, -24(%rbp)
0x100000f5e: movl $0, -28(%rbp)
0x100000f65: movl -20(%rbp), %edi
0x100000f68: addl -24(%rbp), %edi
0x100000f6b: movl %edi, -28(%rbp)
0x100000f6e: popq %rbp
0x100000f6f: ret
The disassembled code from the original article:
mov dword ptr [ebp-8],1
mov dword ptr [ebp-14h],2
mov dword ptr [ebp-20h],0
mov eax, dword ptr [ebp-8]
add eax, dword ptr [ebp-14h]
mov dword ptr [ebp-20h],eax
A full line by line breakdown would be extremely enlightening but any help in understanding this would be appreciated.
All of the code from the original article is in your code, there's just some extra stuff around it. This:
0x100000f50: movl $1, -20(%rbp)
0x100000f57: movl $2, -24(%rbp)
0x100000f5e: movl $0, -28(%rbp)
0x100000f65: movl -20(%rbp), %edi
0x100000f68: addl -24(%rbp), %edi
0x100000f6b: movl %edi, -28(%rbp)
Corresponds directly to the 6 instructions talked about in the article.
There are two major differences between your disassembled code and the article's code.
One is that the article is using the Intel assembler syntax, while your disassembled code is using the traditional Unix/AT&T assembler syntax. Some differences between the two are documented on Wikipedia.
The other difference is that the article omits the function prologue, which sets up the stack frame, and the function epilogue, which destroys the stack frame and returns to the caller. The program he's disassembling has to contain instructions to do those things, but his disassembler isn't showing them. (Actually the stack frame could and probably would be omitted if the optimizer were enabled, but it's clearly not enabled.)
There are also some minor differences: your code is using a slightly different layout for local variables, and your code is computing the sum in a different register.
On the Mac, g++ doesn't support emitting Intel mnemonics, but clang does:
:; clang -S -mllvm --x86-asm-syntax=intel t.c
:; cat t.s
.section __TEXT,__text,regular,pure_instructions
.globl _main
.align 4, 0x90
_main: ## #main
.cfi_startproc
## BB#0:
push RBP
Ltmp2:
.cfi_def_cfa_offset 16
Ltmp3:
.cfi_offset rbp, -16
mov RBP, RSP
Ltmp4:
.cfi_def_cfa_register rbp
mov EAX, 0
mov DWORD PTR [RBP - 4], EDI
mov QWORD PTR [RBP - 16], RSI
mov DWORD PTR [RBP - 20], 1
mov DWORD PTR [RBP - 24], 2
mov DWORD PTR [RBP - 28], 0
mov EDI, DWORD PTR [RBP - 20]
add EDI, DWORD PTR [RBP - 24]
mov DWORD PTR [RBP - 28], EDI
pop RBP
ret
.cfi_endproc
.subsections_via_symbols
If you add the -g flag, the compiler will add debug information including source filenames and line numbers. It's too big to put here in its entirety, but this is the relevant part:
.loc 1 4 14 prologue_end ## t.c:4:14
Ltmp5:
mov DWORD PTR [RBP - 20], 1
.loc 1 5 14 ## t.c:5:14
mov DWORD PTR [RBP - 24], 2
.loc 1 6 14 ## t.c:6:14
mov DWORD PTR [RBP - 28], 0
.loc 1 8 5 ## t.c:8:5
mov EDI, DWORD PTR [RBP - 20]
add EDI, DWORD PTR [RBP - 24]
mov DWORD PTR [RBP - 28], EDI
First of all, the assembler listed as "from original article" is using "Intel" syntax, where the "disassembled output" in your post is "AT&T syntax". This explains the order of arguments to instructions being "back to front" [let's not argue about which is right or wrong, ok?], and register names are prefixed by a %, constants prefixed by $. There is also a difference in how memory locations/offsets to registers are referenced - dword ptr [reg+offs] in Intel assembler translates to l as a suffix on the instruction, and offs(%reg).
The 32-bit vs. 64-bit renames some of the registers - %rbp is the same as ebp in the article code.
The actual offsets (e.g -20) are different partly because the registers are bigger in 64-bit, but also because you have argc and argv as part of your function arguments, which is stored as part of the start of the function - I have a feeling the original article is actually disassembling a different function than main.

Will a shorthand IF provide an efficiency boost compared to the default IF?

If i have a large data file containing integers of an arbitrary length needing to be sorted by it's second field:
1 3 4 5
1 4 5 7
-1 34 56 7
124 58394 1384 -1938
1948 3848089 -14850 0
1048 01840 1039 888
//consider this is a LARGE file, the data goes on for quite some time
and i call upon qsort to be my weapon of choice, inside my sort function, will using the shorthand IF provide a significant performance boost to overall time it takes the data to be sorted? Or is the shorthand IF only used as a convenience tool for organizing code?
num2 = atoi(Str);
num1 = atoi(Str2);
LoggNum = (num2 > num1) ? num2 : num1; //faster?
num2 = atoi(Str);
num1 = atoi(Str2);
if(num2 > num1) //or the same?
LoggNum = num2;
else
LoggNum = num1;
Any modern compiler will build identical code in these two cases, the difference is one of style and convenience only.
There is no answer to this question... the compiler is free to generate whatever code it feels suitable. That said, only a particularly stupid compiler would produce significantly different code for these cases. You should write whatever you feel best expresses how your program works... to me the ternary operator version is more concise and readable, but people not used to C and/or C++ may find the if/else version easier to understand at first.
If you're interested in things like this and want to get a better feel for code generation, you should learn how to invoke your compiler asking it to produce assembly language output showing the CPU instructions for the program. For example, GNU compilers accept a -S flag that produces .s assembly language files.
The only way to know is to profile. However, the ternary operator allows you to do an initialization of an object:
Sometype z = ( x < y) ? something : somethingElse; // copy initialization
With if-else, you would have to use an extra assignment to get the equivalent behaviour.
SomeType z; // default construction
if ( x < y) {
z = something; // assignment
} else {
z = somethingElse; // assignment
}
This could have an impact if the overhead of assigning to a SomeType is large.
We tested these two statements in VC 2010:
The geneated assembly for ?= is:
01207AE1 cmp byte ptr [ebp-101h],0
01207AE8 jne CTestSOFDlg::OnBnClickedButton1+47h (1207AF7h)
01207AEA push offset (1207BE5h)
01207AEF call #ILT+840(__RTC_UninitUse) (120134Dh)
01207AF4 add esp,4
01207AF7 cmp byte ptr [ebp-0F5h],0
01207AFE jne CTestSOFDlg::OnBnClickedButton1+5Dh (1207B0Dh)
01207B00 push offset (1207BE0h)
01207B05 call #ILT+840(__RTC_UninitUse) (120134Dh)
01207B0A add esp,4
01207B0D mov eax,dword ptr [num2]
01207B10 cmp eax,dword ptr [num1]
01207B13 jle CTestSOFDlg::OnBnClickedButton1+86h (1207B36h)
01207B15 cmp byte ptr [ebp-101h],0
01207B1C jne CTestSOFDlg::OnBnClickedButton1+7Bh (1207B2Bh)
01207B1E push offset (1207BE5h)
01207B23 call #ILT+840(__RTC_UninitUse) (120134Dh)
01207B28 add esp,4
01207B2B mov ecx,dword ptr [num2]
01207B2E mov dword ptr [ebp-10Ch],ecx
01207B34 jmp CTestSOFDlg::OnBnClickedButton1+0A5h (1207B55h)
01207B36 cmp byte ptr [ebp-0F5h],0
01207B3D jne CTestSOFDlg::OnBnClickedButton1+9Ch (1207B4Ch)
01207B3F push offset (1207BE0h)
01207B44 call #ILT+840(__RTC_UninitUse) (120134Dh)
01207B49 add esp,4
01207B4C mov edx,dword ptr [num1]
01207B4F mov dword ptr [ebp-10Ch],edx
01207B55 mov eax,dword ptr [ebp-10Ch]
01207B5B mov dword ptr [LoggNum],eax
and for if else operator:
01207B5E cmp byte ptr [ebp-101h],0
01207B65 jne CTestSOFDlg::OnBnClickedButton1+0C4h (1207B74h)
01207B67 push offset (1207BE5h)
01207B6C call #ILT+840(__RTC_UninitUse) (120134Dh)
01207B71 add esp,4
01207B74 cmp byte ptr [ebp-0F5h],0
01207B7B jne CTestSOFDlg::OnBnClickedButton1+0DAh (1207B8Ah)
01207B7D push offset (1207BE0h)
01207B82 call #ILT+840(__RTC_UninitUse) (120134Dh)
01207B87 add esp,4
01207B8A mov eax,dword ptr [num2]
01207B8D cmp eax,dword ptr [num1]
01207B90 jle CTestSOFDlg::OnBnClickedButton1+100h (1207BB0h)
01207B92 cmp byte ptr [ebp-101h],0
01207B99 jne CTestSOFDlg::OnBnClickedButton1+0F8h (1207BA8h)
01207B9B push offset (1207BE5h)
01207BA0 call #ILT+840(__RTC_UninitUse) (120134Dh)
01207BA5 add esp,4
01207BA8 mov eax,dword ptr [num2]
01207BAB mov dword ptr [LoggNum],eax
01207BB0 cmp byte ptr [ebp-0F5h],0
01207BB7 jne CTestSOFDlg::OnBnClickedButton1+116h (1207BC6h)
01207BB9 push offset (1207BE0h)
01207BBE call #ILT+840(__RTC_UninitUse) (120134Dh)
01207BC3 add esp,4
01207BC6 mov eax,dword ptr [num1]
01207BC9 mov dword ptr [LoggNum],eax
as you can see ?= operator has two more asm commands:
01207B4C mov edx,dword ptr [num1]
01207B4F mov dword ptr [ebp-10Ch],edx
which takes more CPU ticks.
for a loop with 97000000 size if else is faster.
This particular optimization has bitten me in a real code base where changing from one form to the other in a locking function saved 10% execution time in a macro benchmark.
Let's test this with gcc 4.2.1 on MacOS:
int global;
void
foo(int x, int y)
{
if (x < y)
global = x;
else
global = y;
}
void
bar(int x, int y)
{
global = (x < y) ? x : y;
}
We get (cleaned up):
_foo:
cmpl %esi, %edi
jge L2
movq _global#GOTPCREL(%rip), %rax
movl %edi, (%rax)
ret
L2:
movq _global#GOTPCREL(%rip), %rax
movl %esi, (%rax)
ret
_bar:
cmpl %edi, %esi
cmovle %esi, %edi
movq _global#GOTPCREL(%rip), %rax
movl %edi, (%rax)
ret
Considering that the cmov instructions were added specifically to improve the performance of this kind of operation (by avoiding pipeline stalls), it is clear that "?:" in this particular compiler is faster.
Should the compiler generate the same code in both cases? Yes. Will it? No of course not, no compiler is perfect. If you really care about performance (in most cases you shouldn't), test and see.

GCC optimization trick, does it really work?

While looking at some questions on optimization, this accepted answer for the question on coding practices for most effective use of the optimizer piqued my curiosity. The assertion is that local variables should be used for computations in a function, not output arguments. It was suggested this would allow the compiler to make additional optimizations otherwise not possible.
So, writing a simple bit of code for the example Foo class and compiling the code fragments with g++ v4.4 and -O2 gave some assembler output (use -S). The parts of the assembler listing with just the loop portion shown below. On examination of the output, it seems the loop is nearly identical for both, with just a difference in one address. That address being a pointer to the output argument for the first example or the local variable for the second.
There seems to no change in the actual effect whether the local variable is used or not. So the question breaks down to 3 parts:
a) is GCC not doing additional optimization, even given the hint suggested;
b) is GCC successfully optimizing in both cases, but should not be;
c) is GCC successfully optimizing in both cases, and is producing compliant output as defined by the C++ standard?
Here is the unoptimized function:
void DoSomething(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
for (int i=0; i<numFoo, i++)
{
barOut.munge(foo1, foo2[i]);
}
}
And corresponding assembly:
.L3:
movl (%esi), %eax
addl $1, %ebx
addl $4, %esi
movl %eax, 8(%esp)
movl (%edi), %eax
movl %eax, 4(%esp)
movl 20(%ebp), %eax ; Note address is that of the output argument
movl %eax, (%esp)
call _ZN3Foo5mungeES_S_
cmpl %ebx, 16(%ebp)
jg .L3
Here is the re-written function:
void DoSomethingFaster(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
Foo barTemp = barOut;
for (int i=0; i<numFoo, i++)
{
barTemp.munge(foo1, foo2[i]);
}
barOut = barTemp;
}
And here is the compiler output for the function using a local variable:
.L3:
movl (%esi), %eax ; Load foo2[i] pointer into EAX
addl $1, %ebx ; increment i
addl $4, %esi ; increment foo2[i] (32-bit system, 8 on 64-bit systems)
movl %eax, 8(%esp) ; PUSH foo2[i] onto stack (careful! from EAX, not ESI)
movl (%edi), %eax ; Load foo1 pointer into EAX
movl %eax, 4(%esp) ; PUSH foo1
leal -28(%ebp), %eax ; Load barTemp pointer into EAX
movl %eax, (%esp) ; PUSH the this pointer for barTemp
call _ZN3Foo5mungeES_S_ ; munge()!
cmpl %ebx, 16(%ebp) ; i < numFoo
jg .L3 ; recall incrementing i by one coming into the loop
; so test if greater
The example given in that answer was not a very good one because of the call to an unknown function the compiler cannot reason much about. Here's a better example:
void FillOneA(int *array, int length, int& startIndex)
{
for (int i = 0; i < length; i++) array[startIndex + i] = 1;
}
void FillOneB(int *array, int length, int& startIndex)
{
int localIndex = startIndex;
for (int i = 0; i < length; i++) array[localIndex + i] = 1;
}
The first version optimizes poorly because it needs to protect against the possibility that somebody called it as
int array[10] = { 0 };
FillOneA(array, 5, array[1]);
resulting in {1, 1, 0, 1, 1, 1, 0, 0, 0, 0 } since the iteration with i=1 modifies the startIndex parameter.
The second one doesn't need to worry about the possibility that the array[localIndex + i] = 1 will modify localIndex because localIndex is a local variable whose address has never been taken.
In assembly (Intel notation, because that's what I use):
FillOneA:
mov edx, [esp+8]
xor eax, eax
test edx, edx
jle $b
push esi
mov esi, [esp+16]
push edi
mov edi, [esp+12]
$a: mov ecx, [esi]
add ecx, eax
inc eax
mov [edi+ecx*4], 1
cmp eax, edx
jl $a
pop edi
pop esi
$b: ret
FillOneB:
mov ecx, [esp+8]
mov eax, [esp+12]
mov edx, [eax]
test ecx, ecx
jle $a
mov eax, [esp+4]
push edi
lea edi, [eax+edx*4]
mov eax, 1
rep stosd
pop edi
$a: ret
ADDED: Here's an example where the compiler's insight is into Bar, and not munge:
class Bar
{
public:
float getValue() const
{
return valueBase * boost;
}
private:
float valueBase;
float boost;
};
class Foo
{
public:
void munge(float adjustment);
};
void Adjust10A(Foo& foo, const Bar& bar)
{
for (int i = 0; i < 10; i++)
foo.munge(bar.getValue());
}
void Adjust10B(Foo& foo, const Bar& bar)
{
Bar localBar = bar;
for (int i = 0; i < 10; i++)
foo.munge(localBar.getValue());
}
The resulting code is
Adjust10A:
push ecx
push ebx
mov ebx, [esp+12] ;; foo
push esi
mov esi, [esp+20] ;; bar
push edi
mov edi, 10
$a: fld [esi+4] ;; bar.valueBase
push ecx
fmul [esi] ;; valueBase * boost
mov ecx, ebx
fstp [esp+16]
fld [esp+16]
fstp [esp]
call Foo::munge
dec edi
jne $a
pop edi
pop esi
pop ebx
pop ecx
ret 0
Adjust10B:
sub esp, 8
mov ecx, [esp+16] ;; bar
mov eax, [ecx] ;; bar.valueBase
mov [esp], eax ;; localBar.valueBase
fld [esp] ;; localBar.valueBase
mov eax, [ecx+4] ;; bar.boost
mov [esp+4], eax ;; localBar.boost
fmul [esp+4] ;; localBar.getValue()
push esi
push edi
mov edi, [esp+20] ;; foo
fstp [esp+24]
fld [esp+24] ;; cache localBar.getValue()
mov esi, 10 ;; loop counter
$a: push ecx
mov ecx, edi ;; foo
fstp [esp] ;; use cached value
call Foo::munge
fld [esp]
dec esi
jne $a ;; loop
pop edi
fstp ST(0)
pop esi
add esp, 8
ret 0
Observe that the inner loop in Adjust10A must recalculate the value since it must protect against the possibility that foo.munge changed bar.
That said, this style of optimization is not a slam dunk. (For example, we could've gotten the same effect by manually caching bar.getValue() into localValue.) It tends to be most helpful for vectorized operations, since those can be paralellized.
First, I'm going to assume munge() cannot be inlined - that is, its definition is not in the same translation unit; you haven't provided complete source, so I can't be entirely sure, but it would explain these results.
Since foo1 is passed to munge as a reference, at the implementation level, the compiler just passes a pointer. If we just forward our argument, this is nice and fast - any aliasing issues are munge()'s problem - and have to be, since munge() can't assume anything about its arguments, and we can't assume anything about what munge() might do with them (since munge()'s definition is not available).
However, if we copy to a local variable, we must copy to a local variable and pass a pointer to the local variable. This is because munge() can observe a difference in behavior - if it takes the pointer to its first argument, it can see it's not equal to &foo1. Since munge()'s implementation is not in scope, the compiler can't assume it won't do this.
This local-variable-copy trick here thus ends up pessimizing, rather than optimizing - the optimizations it tries to help are not possible, because munge() cannot be inlined; for the same reason, the local variable actively hurts performance.
It would be instructive to try this again, making sure munge() is non-virtual and available as an inlinable function.