GCC optimization trick, does it really work? - c++

While looking at some questions on optimization, this accepted answer for the question on coding practices for most effective use of the optimizer piqued my curiosity. The assertion is that local variables should be used for computations in a function, not output arguments. It was suggested this would allow the compiler to make additional optimizations otherwise not possible.
So, writing a simple bit of code for the example Foo class and compiling the code fragments with g++ v4.4 and -O2 gave some assembler output (use -S). The parts of the assembler listing with just the loop portion shown below. On examination of the output, it seems the loop is nearly identical for both, with just a difference in one address. That address being a pointer to the output argument for the first example or the local variable for the second.
There seems to no change in the actual effect whether the local variable is used or not. So the question breaks down to 3 parts:
a) is GCC not doing additional optimization, even given the hint suggested;
b) is GCC successfully optimizing in both cases, but should not be;
c) is GCC successfully optimizing in both cases, and is producing compliant output as defined by the C++ standard?
Here is the unoptimized function:
void DoSomething(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
for (int i=0; i<numFoo, i++)
{
barOut.munge(foo1, foo2[i]);
}
}
And corresponding assembly:
.L3:
movl (%esi), %eax
addl $1, %ebx
addl $4, %esi
movl %eax, 8(%esp)
movl (%edi), %eax
movl %eax, 4(%esp)
movl 20(%ebp), %eax ; Note address is that of the output argument
movl %eax, (%esp)
call _ZN3Foo5mungeES_S_
cmpl %ebx, 16(%ebp)
jg .L3
Here is the re-written function:
void DoSomethingFaster(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
Foo barTemp = barOut;
for (int i=0; i<numFoo, i++)
{
barTemp.munge(foo1, foo2[i]);
}
barOut = barTemp;
}
And here is the compiler output for the function using a local variable:
.L3:
movl (%esi), %eax ; Load foo2[i] pointer into EAX
addl $1, %ebx ; increment i
addl $4, %esi ; increment foo2[i] (32-bit system, 8 on 64-bit systems)
movl %eax, 8(%esp) ; PUSH foo2[i] onto stack (careful! from EAX, not ESI)
movl (%edi), %eax ; Load foo1 pointer into EAX
movl %eax, 4(%esp) ; PUSH foo1
leal -28(%ebp), %eax ; Load barTemp pointer into EAX
movl %eax, (%esp) ; PUSH the this pointer for barTemp
call _ZN3Foo5mungeES_S_ ; munge()!
cmpl %ebx, 16(%ebp) ; i < numFoo
jg .L3 ; recall incrementing i by one coming into the loop
; so test if greater

The example given in that answer was not a very good one because of the call to an unknown function the compiler cannot reason much about. Here's a better example:
void FillOneA(int *array, int length, int& startIndex)
{
for (int i = 0; i < length; i++) array[startIndex + i] = 1;
}
void FillOneB(int *array, int length, int& startIndex)
{
int localIndex = startIndex;
for (int i = 0; i < length; i++) array[localIndex + i] = 1;
}
The first version optimizes poorly because it needs to protect against the possibility that somebody called it as
int array[10] = { 0 };
FillOneA(array, 5, array[1]);
resulting in {1, 1, 0, 1, 1, 1, 0, 0, 0, 0 } since the iteration with i=1 modifies the startIndex parameter.
The second one doesn't need to worry about the possibility that the array[localIndex + i] = 1 will modify localIndex because localIndex is a local variable whose address has never been taken.
In assembly (Intel notation, because that's what I use):
FillOneA:
mov edx, [esp+8]
xor eax, eax
test edx, edx
jle $b
push esi
mov esi, [esp+16]
push edi
mov edi, [esp+12]
$a: mov ecx, [esi]
add ecx, eax
inc eax
mov [edi+ecx*4], 1
cmp eax, edx
jl $a
pop edi
pop esi
$b: ret
FillOneB:
mov ecx, [esp+8]
mov eax, [esp+12]
mov edx, [eax]
test ecx, ecx
jle $a
mov eax, [esp+4]
push edi
lea edi, [eax+edx*4]
mov eax, 1
rep stosd
pop edi
$a: ret
ADDED: Here's an example where the compiler's insight is into Bar, and not munge:
class Bar
{
public:
float getValue() const
{
return valueBase * boost;
}
private:
float valueBase;
float boost;
};
class Foo
{
public:
void munge(float adjustment);
};
void Adjust10A(Foo& foo, const Bar& bar)
{
for (int i = 0; i < 10; i++)
foo.munge(bar.getValue());
}
void Adjust10B(Foo& foo, const Bar& bar)
{
Bar localBar = bar;
for (int i = 0; i < 10; i++)
foo.munge(localBar.getValue());
}
The resulting code is
Adjust10A:
push ecx
push ebx
mov ebx, [esp+12] ;; foo
push esi
mov esi, [esp+20] ;; bar
push edi
mov edi, 10
$a: fld [esi+4] ;; bar.valueBase
push ecx
fmul [esi] ;; valueBase * boost
mov ecx, ebx
fstp [esp+16]
fld [esp+16]
fstp [esp]
call Foo::munge
dec edi
jne $a
pop edi
pop esi
pop ebx
pop ecx
ret 0
Adjust10B:
sub esp, 8
mov ecx, [esp+16] ;; bar
mov eax, [ecx] ;; bar.valueBase
mov [esp], eax ;; localBar.valueBase
fld [esp] ;; localBar.valueBase
mov eax, [ecx+4] ;; bar.boost
mov [esp+4], eax ;; localBar.boost
fmul [esp+4] ;; localBar.getValue()
push esi
push edi
mov edi, [esp+20] ;; foo
fstp [esp+24]
fld [esp+24] ;; cache localBar.getValue()
mov esi, 10 ;; loop counter
$a: push ecx
mov ecx, edi ;; foo
fstp [esp] ;; use cached value
call Foo::munge
fld [esp]
dec esi
jne $a ;; loop
pop edi
fstp ST(0)
pop esi
add esp, 8
ret 0
Observe that the inner loop in Adjust10A must recalculate the value since it must protect against the possibility that foo.munge changed bar.
That said, this style of optimization is not a slam dunk. (For example, we could've gotten the same effect by manually caching bar.getValue() into localValue.) It tends to be most helpful for vectorized operations, since those can be paralellized.

First, I'm going to assume munge() cannot be inlined - that is, its definition is not in the same translation unit; you haven't provided complete source, so I can't be entirely sure, but it would explain these results.
Since foo1 is passed to munge as a reference, at the implementation level, the compiler just passes a pointer. If we just forward our argument, this is nice and fast - any aliasing issues are munge()'s problem - and have to be, since munge() can't assume anything about its arguments, and we can't assume anything about what munge() might do with them (since munge()'s definition is not available).
However, if we copy to a local variable, we must copy to a local variable and pass a pointer to the local variable. This is because munge() can observe a difference in behavior - if it takes the pointer to its first argument, it can see it's not equal to &foo1. Since munge()'s implementation is not in scope, the compiler can't assume it won't do this.
This local-variable-copy trick here thus ends up pessimizing, rather than optimizing - the optimizations it tries to help are not possible, because munge() cannot be inlined; for the same reason, the local variable actively hurts performance.
It would be instructive to try this again, making sure munge() is non-virtual and available as an inlinable function.

Related

check palindrome through recursion c++

bool palindrome(char arr[],int size){
if(size<=1){
return true;
}
if(*(arr)==*(arr+size-1)){
bool small_ans=palindrome(arr+1,size-2);
return small_ans;
}
return false;
}
How efficient is this code for checking palindrome ??
There is compiler optimization called tailing recursion.
In your quite simple case compiler spotted that there is possibility to use this optimization. As a result it silently turn your code into iterative version:
https://godbolt.org/z/rsjaYhde6
palindrome(char*, int):
cmp esi, 1
jle .L4
movsx rax, esi
sub esi, 2
shr esi
lea rax, [rdi-1+rax]
lea edx, [rsi+1]
add rdx, rdi
jmp .L3
.L8:
add rdi, 1
sub rax, 1
cmp rdi, rdx
je .L4
.L3:
movzx ecx, BYTE PTR [rax]
cmp BYTE PTR [rdi], cl
je .L8
xor eax, eax
ret
.L4:
mov eax, 1
ret
Note:
there is no call instruction needed in code which actually uses recursion
label .L8 is responsible for a loop which replaced recursion
Remember there is "As-if rule" so compiler can transom your code in may ways to make it faster.
In general, the recursive solution is often more elegant than iteration, but mostly needs more CPU time and memory space. The CPU has to put data on the stack every recursion.
Especially in this case, iteration seems more efficient in time and memory.
Try somthing like this:
bool palindrome(char arr[], int size)
{
for (int i = 0; i < size; ++i) {
if (arr[i] != arr[size-1-i])
return false;
}
return true;
}

Successfully enabling -fno-finite-math-only on NaN removal method

In finding a bug which made everything turn into NaNs when running the optimized version of my code (compiling in g++ 4.8.2 and 4.9.3), I identified that the problem was the -Ofast option, specifically, the -ffinite-math-only flag it includes.
One part of the code involves reading floats from a FILE* using fscanf, and then replacing all NaNs with a numeric value. As could be expected, however, -ffinite-math-only kicks in, and removes these checks, thus leaving the NaNs.
In trying to solve this problem, I stumbled uppon this, which suggested adding -fno-finite-math-only as a method attribute to disable the optimization on the specific method. The following illustrates the problem and the attempted fix (which doesn't actually fix it):
#include <cstdio>
#include <cmath>
__attribute__((optimize("-fno-finite-math-only")))
void replaceNaN(float * arr, int size, float newValue){
for(int i = 0; i < size; i++) if (std::isnan(arr[i])) arr[i] = newValue;
}
int main(void){
const size_t cnt = 10;
float val[cnt];
for(int i = 0; i < cnt; i++) scanf("%f", val + i);
replaceNaN(val, cnt, -1.0f);
for(int i = 0; i < cnt; i++) printf("%f ", val[i]);
return 0;
}
The code does not act as desired if compiled/run using echo 1 2 3 4 5 6 7 8 nan 10 | (g++ -ffinite-math-only test.cpp -o test && ./test), specifically, it outputs a nan (which should have been replaced by a -1.0f) -- it behaves fine if the -ffinite-math-only flag is ommited. Shouldn't this work? Am I missing something with the syntax for attributes in gcc, or is this one of the afforementioned "there being some trouble with some version of GCC related to this" (from the linked SO question)
A few solutions I'm aware of, but would rather something a bit cleaner/more portable:
Compile the code with -fno-finite-math-only (my interrim solution): I suspect that this optimization may be rather useful in my context in the remainder of the program;
Manually look for the string "nan" in the input stream, and then replace the value there (the input reader is in an unrelated part of the library, yielding poor design to include this test there).
Assume a specific floating point architecture and make my own isNaN: I may do this, but it's a bit hackish and non-portable.
Prefilter the data using a separately compiled program without the -ffinite-math-only flag, and then feed that into the main program: The added complexity of maintaining two binaries and getting them to talk to each other just isn't worth it.
Edit: As put in the accepted answer, It would seem this is a compiler "bug" in older versions of g++, such as 4.82 and 4.9.3, that is fixed in newer versions, such as 5.1 and 6.1.1.
If for some reason updating the compiler is not a reasonably easy option (e.g.: no root access), or adding this attribute to a single function still doesn't completely solve the NaN check problem, an alternate solution, if you can be certain that the code will always run in an IEEE754 floating point environment, is to manually check the bits of the float for a NaN signature.
The accepted answer suggests doing this using a bit field, however, the order in which the compiler places the elements in a bit field is non-standard, and in fact, changes between the older and newer versions of g++, even refusing to adhere to the desired positioning in older versions (4.8.2 and 4.9.3, always placing the mantissa first), regardless of the order in which they appear in the code.
A solution using bit manipulation, however, is guaranteed to work on all IEEE754 compliant compilers. Below is my such implementation, which I ultimately used to solve my problem. It checks for IEEE754 compliance, and I've extended it to allow for doubles, as well as other more routine floating point bit manipulations.
#include <limits> // IEEE754 compliance test
#include <type_traits> // enable_if
template<
typename T,
typename = typename std::enable_if<std::is_floating_point<T>::value>::type,
typename = typename std::enable_if<std::numeric_limits<T>::is_iec559>::type,
typename u_t = typename std::conditional<std::is_same<T, float>::value, uint32_t, uint64_t>::type
>
struct IEEE754 {
enum class WIDTH : size_t {
SIGN = 1,
EXPONENT = std::is_same<T, float>::value ? 8 : 11,
MANTISSA = std::is_same<T, float>::value ? 23 : 52
};
enum class MASK : u_t {
SIGN = (u_t)1 << (sizeof(u_t) * 8 - 1),
EXPONENT = ((~(u_t)0) << (size_t)WIDTH::MANTISSA) ^ (u_t)MASK::SIGN,
MANTISSA = (~(u_t)0) >> ((size_t)WIDTH::SIGN + (size_t)WIDTH::EXPONENT)
};
union {
T f;
u_t u;
};
IEEE754(T f) : f(f) {}
inline u_t sign() const { return u & (u_t)MASK::SIGN >> ((size_t)WIDTH::EXPONENT + (size_t)WIDTH::MANTISSA); }
inline u_t exponent() const { return u & (u_t)MASK::EXPONENT >> (size_t)WIDTH::MANTISSA; }
inline u_t mantissa() const { return u & (u_t)MASK::MANTISSA; }
inline bool isNan() const {
return (mantissa() != 0) && ((u & ((u_t)MASK::EXPONENT)) == (u_t)MASK::EXPONENT);
}
};
template<typename T>
inline IEEE754<T> toIEEE754(T val) { return IEEE754<T>(val); }
And the replaceNaN function now becomes:
void replaceNaN(float * arr, int size, float newValue){
for(int i = 0; i < size; i++)
if (toIEEE754(arr[i]).isNan()) arr[i] = newValue;
}
An inspection of the assembly of these functions reveals that, as expected, all masks become compile-time constants, leading to the following (seemingly) efficient code:
# In loop of replaceNaN
movl (%rcx), %eax # eax = arr[i]
testl $8388607, %eax # Check if mantissa is empty
je .L3 # If it is, it's not a nan (it's inf), continue loop
andl $2139095040, %eax # Mask leaves only exponent
cmpl $2139095040, %eax # Test if exponent is all 1s
jne .L3 # If it isn't, it's not a nan, so continue loop
This is one instruction less than with a working bit field solution (no shift), and the same number of registers are used (although it's tempting to say this alone makes it more efficient, there are other concerns such as pipelining which may make one solution more or less efficient than the other one).
Looks like a compiler bug to me. Up through GCC 4.9.2, the attribute is completely ignored. GCC 5.1 and later pay attention to it. Perhaps it's time to upgrade your compiler?
__attribute__((optimize("-fno-finite-math-only")))
void replaceNaN(float * arr, int size, float newValue){
for(int i = 0; i < size; i++) if (std::isnan(arr[i])) arr[i] = newValue;
}
Compiled with -ffinite-math-only on GCC 4.9.2:
replaceNaN(float*, int, float):
rep ret
But with the exact same settings on GCC 5.1:
replaceNaN(float*, int, float):
test esi, esi
jle .L26
sub rsp, 8
call std::isnan(float) [clone .isra.0]
test al, al
je .L2
mov rax, rdi
and eax, 15
shr rax, 2
neg rax
and eax, 3
cmp eax, esi
cmova eax, esi
cmp esi, 6
jg .L28
mov eax, esi
.L5:
cmp eax, 1
movss DWORD PTR [rdi], xmm0
je .L16
cmp eax, 2
movss DWORD PTR [rdi+4], xmm0
je .L17
cmp eax, 3
movss DWORD PTR [rdi+8], xmm0
je .L18
cmp eax, 4
movss DWORD PTR [rdi+12], xmm0
je .L19
cmp eax, 5
movss DWORD PTR [rdi+16], xmm0
je .L20
movss DWORD PTR [rdi+20], xmm0
mov edx, 6
.L7:
cmp esi, eax
je .L2
.L6:
mov r9d, esi
lea r8d, [rsi-1]
mov r11d, eax
sub r9d, eax
lea ecx, [r9-4]
sub r8d, eax
shr ecx, 2
add ecx, 1
cmp r8d, 2
lea r10d, [0+rcx*4]
jbe .L9
movaps xmm1, xmm0
lea r8, [rdi+r11*4]
xor eax, eax
shufps xmm1, xmm1, 0
.L11:
add eax, 1
add r8, 16
movaps XMMWORD PTR [r8-16], xmm1
cmp ecx, eax
ja .L11
add edx, r10d
cmp r9d, r10d
je .L2
.L9:
movsx rax, edx
movss DWORD PTR [rdi+rax*4], xmm0
lea eax, [rdx+1]
cmp eax, esi
jge .L2
add edx, 2
cdqe
cmp esi, edx
movss DWORD PTR [rdi+rax*4], xmm0
jle .L2
movsx rdx, edx
movss DWORD PTR [rdi+rdx*4], xmm0
.L2:
add rsp, 8
.L26:
rep ret
.L28:
test eax, eax
jne .L5
xor edx, edx
jmp .L6
.L20:
mov edx, 5
jmp .L7
.L19:
mov edx, 4
jmp .L7
.L18:
mov edx, 3
jmp .L7
.L17:
mov edx, 2
jmp .L7
.L16:
mov edx, 1
jmp .L7
The output is similar, although not quite identical, on GCC 6.1.
Replacing the attribute with
#pragma GCC push_options
#pragma GCC optimize ("-fno-finite-math-only")
void replaceNaN(float * arr, int size, float newValue){
for(int i = 0; i < size; i++) if (std::isnan(arr[i])) arr[i] = newValue;
}
#pragma GCC pop_options
makes absolutely no difference, so it is not simply a matter of the attribute being ignored. These older versions of the compiler clearly do not support controlling the floating-point optimization behavior at function-level granularity.
Note, however, that the generated code on GCC 5.1 and later is still significantly worse than compiling the function without without the -ffinite-math-only switch:
replaceNaN(float*, int, float):
test esi, esi
jle .L1
lea eax, [rsi-1]
lea rax, [rdi+4+rax*4]
.L5:
movss xmm1, DWORD PTR [rdi]
ucomiss xmm1, xmm1
jnp .L6
movss DWORD PTR [rdi], xmm0
.L6:
add rdi, 4
cmp rdi, rax
jne .L5
rep ret
.L1:
rep ret
I have no idea why there is such a discrepancy. Something is badly throwing the compiler off its game; this is even worse code than you get with optimizations completely disabled. If I had to guess, I'd speculate it was the implementation of std::isnan. If this replaceNaN method is not speed-critical, then it probably doesn't matter. If you need to repeatedly parse values from a file, you might prefer to have a reasonably efficient implementation.
Personally, I would write my own non-portable implementation of std::isnan. The IEEE 754 formats are all quite well-documented, and assuming you thoroughly test and comment the code, I can't see the harm in this, unless you absolutely need the code to be portable to all different architectures. It will drive purists up the wall, but so should using non-standard options like -ffinite-math-only. For a single-precision float, something like:
bool my_isnan(float value)
{
union IEEE754_Single
{
float f;
struct
{
#if BIG_ENDIAN
uint32_t sign : 1;
uint32_t exponent : 8;
uint32_t mantissa : 23;
#else
uint32_t mantissa : 23;
uint32_t exponent : 8;
uint32_t sign : 1;
#endif
} bits;
} u = { value };
// In the IEEE 754 representation, a float is NaN when
// the mantissa is non-zero, and the exponent is all ones
// (2^8 - 1 == 255).
return (u.bits.mantissa != 0) && (u.bits.exponent == 255);
}
Now, no need for annotations, just use my_isnan instead of std::isnan. The produces the following object code when compiled with -ffinite-math-only enabled:
replaceNaN(float*, int, float):
test esi, esi
jle .L6
lea eax, [rsi-1]
lea rdx, [rdi+4+rax*4]
.L13:
mov eax, DWORD PTR [rdi] ; get original floating-point value
test eax, 8388607 ; test if mantissa != 0
je .L9
shr eax, 16 ; test if exponent has all bits set
and ax, 32640
cmp ax, 32640
jne .L9
movss DWORD PTR [rdi], xmm0 ; set newValue if original was NaN
.L9:
add rdi, 4
cmp rdx, rdi
jne .L13
rep ret
.L6:
rep ret
The NaN check is slightly more complicated than a simple ucomiss followed by a test of the parity flag, but is guaranteed to be correct as long as your compiler adheres to the IEEE 754 standard. This works on all versions of GCC, and any other compiler.

Would the compiler optimize this expression into a temporary constant rather than resolve it every iteration?

I have the following loop:
for (unique_ptr<Surface>& child : Children)
{
child->Gather(hit);
if (hit && HitTestMode == HitTestMode::Content && child->MouseOver && !mouseOver)
{
mouseOver = true;
}
}
I wonder if the compiler (I use Visual Studio 2013, targeting x64 on Win7 upwards) would optimize the expression
hit && HitTestMode == HitTestMode::Content
into a temporary constant and use that rather than resolve the expression every iteration, similar to me doing something like this:
bool contentMode = hit && HitTestMode == HitTestMode::Content;
for (unique_ptr<Surface>& child : Children)
{
child->Gather(hit);
if (contentMode && child->MouseOver && !mouseOver)
{
mouseOver = true;
}
}
Bonus question:
Is checking for !mouseOver worth it (in order to skip the conditional mouseOver = true; if it has already been set)? Or is it faster to simply set it again regardless?
The answer to whether that optimization could even take place would depend on what hit, HitTestMode and HitTestMode::Content are and whether it's possible that they could be changed by the call to child->Gather().
If those identifiers are constants or local variables that the compiler can prove aren't modified, then it's entirely possible that the sub-expression hit && HitTestMode == HitTestMode::Content will be hoisted.
For example, consider the following compilable version of your example:
#include <memory>
#include <vector>
using namespace std;
class Surface
{
public:
void Gather(bool hit);
bool MouseOver;
};
enum class HitTestMode
{
Content = 1,
Foo = 3,
Bar = 4,
};
extern HitTestMode hittestmode;
bool anyMiceOver( vector<unique_ptr<Surface> > & Children, bool hit)
{
bool mouseOver = false;
for (unique_ptr<Surface>& child : Children)
{
child->Gather(hit);
if (hit && hittestmode == HitTestMode::Content && child->MouseOver && !mouseOver)
{
mouseOver = true;
}
}
return mouseOver;
}
When compiled using g++ 4.8.1 (mingw) with the -O3 optimization option, you get the following snippet of code for the loop (annotations added):
mov rbx, QWORD PTR [rcx] ; Children.begin()
mov rsi, QWORD PTR 8[rcx] ; Children.end()
cmp rbx, rsi
je .L8 ; early exit if Children is empty
test dl, dl ; hit == 0?
movzx edi, dl
je .L5 ; then goto loop L5
xor ebp, ebp
mov r12d, 1
jmp .L7
.p2align 4,,10
.L6:
add rbx, 8
cmp rsi, rbx ; check for end of Children
je .L2
.L7:
mov rcx, QWORD PTR [rbx]
mov edx, edi
call _ZN7Surface6GatherEb ; call child->Gather(hit)
cmp DWORD PTR hittestmode[rip], 1 ; check hittestmode
jne .L6
mov rax, QWORD PTR [rbx] ; check child->MouseOver
cmp BYTE PTR [rax], 0
cmovne ebp, r12d ; set mouseOver appropriately
jmp .L6
.p2align 4,,10
.L5: ; loop L5 is run only when hit == 0
mov rcx, QWORD PTR [rbx] ; get net element in Children
mov edx, edi
add rbx, 8
call _ZN7Surface6GatherEb ; call child->Gather(hit)
cmp rsi, rbx
jne .L5
.L8:
xor ebp, ebp
.L2:
mov eax, ebp
add rsp, 32
pop rbx
pop rsi
pop rdi
pop rbp
pop r12
ret
You'll note that the check for hit has been hoisted out of the loop - and if it's false then the a loop that does nothing but call child->Gather() is run.
If hitmodetest is changed to be a variable that's passed as a function argument so it's no longer subject to possibly being changed by the call to child-Gather(hit), then the compiler will also hoist the check for the value of hittestmode out of the loop and jump to the loop that does nothing but call child->Gather().
With a local hittestmode using -O2 will calculate hit && hittestmode == HitTestMode::Content prior to the loop and stash that result in a register, but it will still test the register in each loop iteration instead of optimizing to a separate loops that don't even bother with the test.
Since you specifically asked about the VS2013 compiler (using /Ox and /Ot options), it doesn't seem to hoist or optimize either of the checks (for hit or hittestmode) out of the loop - all it seems to do is keep the values for those variable in registers.

Will a shorthand IF provide an efficiency boost compared to the default IF?

If i have a large data file containing integers of an arbitrary length needing to be sorted by it's second field:
1 3 4 5
1 4 5 7
-1 34 56 7
124 58394 1384 -1938
1948 3848089 -14850 0
1048 01840 1039 888
//consider this is a LARGE file, the data goes on for quite some time
and i call upon qsort to be my weapon of choice, inside my sort function, will using the shorthand IF provide a significant performance boost to overall time it takes the data to be sorted? Or is the shorthand IF only used as a convenience tool for organizing code?
num2 = atoi(Str);
num1 = atoi(Str2);
LoggNum = (num2 > num1) ? num2 : num1; //faster?
num2 = atoi(Str);
num1 = atoi(Str2);
if(num2 > num1) //or the same?
LoggNum = num2;
else
LoggNum = num1;
Any modern compiler will build identical code in these two cases, the difference is one of style and convenience only.
There is no answer to this question... the compiler is free to generate whatever code it feels suitable. That said, only a particularly stupid compiler would produce significantly different code for these cases. You should write whatever you feel best expresses how your program works... to me the ternary operator version is more concise and readable, but people not used to C and/or C++ may find the if/else version easier to understand at first.
If you're interested in things like this and want to get a better feel for code generation, you should learn how to invoke your compiler asking it to produce assembly language output showing the CPU instructions for the program. For example, GNU compilers accept a -S flag that produces .s assembly language files.
The only way to know is to profile. However, the ternary operator allows you to do an initialization of an object:
Sometype z = ( x < y) ? something : somethingElse; // copy initialization
With if-else, you would have to use an extra assignment to get the equivalent behaviour.
SomeType z; // default construction
if ( x < y) {
z = something; // assignment
} else {
z = somethingElse; // assignment
}
This could have an impact if the overhead of assigning to a SomeType is large.
We tested these two statements in VC 2010:
The geneated assembly for ?= is:
01207AE1 cmp byte ptr [ebp-101h],0
01207AE8 jne CTestSOFDlg::OnBnClickedButton1+47h (1207AF7h)
01207AEA push offset (1207BE5h)
01207AEF call #ILT+840(__RTC_UninitUse) (120134Dh)
01207AF4 add esp,4
01207AF7 cmp byte ptr [ebp-0F5h],0
01207AFE jne CTestSOFDlg::OnBnClickedButton1+5Dh (1207B0Dh)
01207B00 push offset (1207BE0h)
01207B05 call #ILT+840(__RTC_UninitUse) (120134Dh)
01207B0A add esp,4
01207B0D mov eax,dword ptr [num2]
01207B10 cmp eax,dword ptr [num1]
01207B13 jle CTestSOFDlg::OnBnClickedButton1+86h (1207B36h)
01207B15 cmp byte ptr [ebp-101h],0
01207B1C jne CTestSOFDlg::OnBnClickedButton1+7Bh (1207B2Bh)
01207B1E push offset (1207BE5h)
01207B23 call #ILT+840(__RTC_UninitUse) (120134Dh)
01207B28 add esp,4
01207B2B mov ecx,dword ptr [num2]
01207B2E mov dword ptr [ebp-10Ch],ecx
01207B34 jmp CTestSOFDlg::OnBnClickedButton1+0A5h (1207B55h)
01207B36 cmp byte ptr [ebp-0F5h],0
01207B3D jne CTestSOFDlg::OnBnClickedButton1+9Ch (1207B4Ch)
01207B3F push offset (1207BE0h)
01207B44 call #ILT+840(__RTC_UninitUse) (120134Dh)
01207B49 add esp,4
01207B4C mov edx,dword ptr [num1]
01207B4F mov dword ptr [ebp-10Ch],edx
01207B55 mov eax,dword ptr [ebp-10Ch]
01207B5B mov dword ptr [LoggNum],eax
and for if else operator:
01207B5E cmp byte ptr [ebp-101h],0
01207B65 jne CTestSOFDlg::OnBnClickedButton1+0C4h (1207B74h)
01207B67 push offset (1207BE5h)
01207B6C call #ILT+840(__RTC_UninitUse) (120134Dh)
01207B71 add esp,4
01207B74 cmp byte ptr [ebp-0F5h],0
01207B7B jne CTestSOFDlg::OnBnClickedButton1+0DAh (1207B8Ah)
01207B7D push offset (1207BE0h)
01207B82 call #ILT+840(__RTC_UninitUse) (120134Dh)
01207B87 add esp,4
01207B8A mov eax,dword ptr [num2]
01207B8D cmp eax,dword ptr [num1]
01207B90 jle CTestSOFDlg::OnBnClickedButton1+100h (1207BB0h)
01207B92 cmp byte ptr [ebp-101h],0
01207B99 jne CTestSOFDlg::OnBnClickedButton1+0F8h (1207BA8h)
01207B9B push offset (1207BE5h)
01207BA0 call #ILT+840(__RTC_UninitUse) (120134Dh)
01207BA5 add esp,4
01207BA8 mov eax,dword ptr [num2]
01207BAB mov dword ptr [LoggNum],eax
01207BB0 cmp byte ptr [ebp-0F5h],0
01207BB7 jne CTestSOFDlg::OnBnClickedButton1+116h (1207BC6h)
01207BB9 push offset (1207BE0h)
01207BBE call #ILT+840(__RTC_UninitUse) (120134Dh)
01207BC3 add esp,4
01207BC6 mov eax,dword ptr [num1]
01207BC9 mov dword ptr [LoggNum],eax
as you can see ?= operator has two more asm commands:
01207B4C mov edx,dword ptr [num1]
01207B4F mov dword ptr [ebp-10Ch],edx
which takes more CPU ticks.
for a loop with 97000000 size if else is faster.
This particular optimization has bitten me in a real code base where changing from one form to the other in a locking function saved 10% execution time in a macro benchmark.
Let's test this with gcc 4.2.1 on MacOS:
int global;
void
foo(int x, int y)
{
if (x < y)
global = x;
else
global = y;
}
void
bar(int x, int y)
{
global = (x < y) ? x : y;
}
We get (cleaned up):
_foo:
cmpl %esi, %edi
jge L2
movq _global#GOTPCREL(%rip), %rax
movl %edi, (%rax)
ret
L2:
movq _global#GOTPCREL(%rip), %rax
movl %esi, (%rax)
ret
_bar:
cmpl %edi, %esi
cmovle %esi, %edi
movq _global#GOTPCREL(%rip), %rax
movl %edi, (%rax)
ret
Considering that the cmov instructions were added specifically to improve the performance of this kind of operation (by avoiding pipeline stalls), it is clear that "?:" in this particular compiler is faster.
Should the compiler generate the same code in both cases? Yes. Will it? No of course not, no compiler is perfect. If you really care about performance (in most cases you shouldn't), test and see.

Coding for loop with an assembler in c++ function

I was always interested in assembler, however so far I didn't have a true chance to confront it in a best way. Now, when I do have some time, I began coding some small programs using assembler in a c++, but that's just small ones, i.e. define x, store it somewhere and so on, so forth. I wanted to implement foor loop in assembler, but I couldn't make it, so I would like to ask if anyone here has ever done with it, would be nice to share here. Example of some function would be
for(i=0;i<10;i++) { std::cout<< "A"; }
Anyone has some idea how to implement this in a assembler?
edit2: ISA x86
Here's the unoptimized output1 of GCC for this code:
void some_function(void);
int main()
{
for (int i = 0; i < 137; ++i) { some_function(); }
}
movl $0, 12(%esp) // i = 0; i is stored at %esp + 12
jmp .L2
.L3:
call some_function // some_function()
addl $1, 12(%esp) // ++i
.L2:
cmpl $136, 12(%esp) // compare i to 136 ...
jle .L3 // ... and repeat loop less-or-equal
movl $0, %eax // return 0
leave // --"--
With optimization -O3, the addition+comparison is turned into subtraction:
pushl %ebx // save %ebx
movl $137, %ebx // set %ebx to 137
// some unrelated parts
.L2:
call some_function // some_function()
subl $1, %ebx // subtract 1 from %ebx
jne .L2 // if not equal to 0, repeat loop
1The generated assembly can be examined by invoking GCC with the -S flag.
Try to rewrite the for loop in C++ using a goto and an if statement and you will have the basics for the assembly version.
You could try the reverse - write the program in C++ or C and look at the dissasembled code:
for ( int i = 0 ; i < 10 ; i++ )
00E714EE mov dword ptr [i],0
00E714F5 jmp wmain+30h (0E71500h)
00E714F7 mov eax,dword ptr [i]
00E714FA add eax,1
00E714FD mov dword ptr [i],eax
00E71500 cmp dword ptr [i],0Ah
00E71504 jge wmain+4Bh (0E7151Bh)
cout << "A";
00E71506 push offset string "A" (0E76800h)
00E7150B mov eax,dword ptr [__imp_std::cout (0E792ECh)]
00E71510 push eax
00E71511 call std::operator<<<std::char_traits<char> > (0E71159h)
00E71516 add esp,8
00E71519 jmp wmain+27h (0E714F7h)
then try to make sense of it.