I was experimenting with GCC, trying to convince it to assume that certain portions of code are unreachable so as to take opportunity to optimize. One of my experiments gave me somewhat strange code. Here's the source:
#include <iostream>
#define UNREACHABLE {char* null=0; *null=0; return {};}
double test(double x)
{
if(x==-1) return -1;
else if(x==1) return 1;
UNREACHABLE;
}
int main()
{
std::cout << "Enter a number. Only +/- 1 is supported, otherwise I dunno what'll happen: ";
double x;
std::cin >> x;
std::cout << "Here's what I got: " << test(x) << "\n";
}
Here's how I compiled it:
g++ -std=c++11 test.cpp -O3 -march=native -S -masm=intel -Wall -Wextra
And the code of test function looks like this:
_Z4testd:
.LFB1397:
.cfi_startproc
fld QWORD PTR [esp+4]
fld1
fchs
fld st(0)
fxch st(2)
fucomi st, st(2)
fstp st(2)
jp .L10
je .L11
fstp st(0)
jmp .L7
.L10:
fstp st(0)
.p2align 4,,10
.p2align 3
.L7:
fld1
fld st(0)
fxch st(2)
fucomip st, st(2)
fstp st(1)
jp .L12
je .L6
fstp st(0)
jmp .L8
.L12:
fstp st(0)
.p2align 4,,10
.p2align 3
.L8:
mov BYTE PTR ds:0, 0
ud2 // This is redundant, isn't it?..
.p2align 4,,10
.p2align 3
.L11:
fstp st(1)
.L6:
rep; ret
What makes me wonder here is the code at .L8. Namely, it already writes to zero address, which guarantees segmentation fault unless ds has some non-default selector. So why the additional ud2? Isn't writing to zero address already guaranteed crash? Or does GCC not believe that ds has default selector and tries to make a sure-fire crash?
So, your code is writing to address zero (NULL) which in itself is defined to be "undefined behaviour". Since undefined behaviour covers anything, and most importantly for this case, "that it does what you may imagine that it would do" (in other words, writes to address zero rather than crashing). The compiler then decides to TELL you that by adding an UD2 instruction. It's also possible that it is to protect against continuing from a signal handler with further undefined behaviour.
Yes, most machines, under most circumstances, will crash for NULL accesses. But it's not 100% guaranteed, and as I said above, one can catch segfault in a signal handler, and then try to continue - it's really not a good idea to actually continue after trying to write to NULL, so the compiler adds UD2 to ensure you don't go on... It uses 2 bytes more of memory, beyond that I don't see what harm it does [after all, it's undefined what happens - if the compiler wished to do so, it could email random pictures from your filesystem to the Queen of England... I think UD2 is a better choice...]
It is interesting to spot that LLVM does this by itself - I have no special detection of NIL pointer access, but my pascal compiler compiles this:
program p;
var
ptr : ^integer;
begin
ptr := NIL;
ptr^ := 42;
end.
into:
0000000000400880 <__PascalMain>:
400880: 55 push %rbp
400881: 48 89 e5 mov %rsp,%rbp
400884: 48 c7 05 19 18 20 00 movq $0x0,0x201819(%rip) # 6020a8 <ptr>
40088b: 00 00 00 00
40088f: 0f 0b ud2
I'm still trying to figure out where in LLVM this happens and try to understand the purpose of the UD2 instruction itself.
I think the answer is here, in llvm/lib/Transforms/Utils/Local.cpp
void llvm::changeToUnreachable(Instruction *I, bool UseLLVMTrap) {
BasicBlock *BB = I->getParent();
// Loop over all of the successors, removing BB's entry from any PHI
// nodes.
for (succ_iterator SI = succ_begin(BB), SE = succ_end(BB); SI != SE; ++SI)
(*SI)->removePredecessor(BB);
// Insert a call to llvm.trap right before this. This turns the undefined
// behavior into a hard fail instead of falling through into random code.
if (UseLLVMTrap) {
Function *TrapFn =
Intrinsic::getDeclaration(BB->getParent()->getParent(), Intrinsic::trap);
CallInst *CallTrap = CallInst::Create(TrapFn, "", I);
CallTrap->setDebugLoc(I->getDebugLoc());
}
new UnreachableInst(I->getContext(), I);
// All instructions after this are dead.
BasicBlock::iterator BBI = I->getIterator(), BBE = BB->end();
while (BBI != BBE) {
if (!BBI->use_empty())
BBI->replaceAllUsesWith(UndefValue::get(BBI->getType()));
BB->getInstList().erase(BBI++);
}
}
In particular the comment in the middle, where it says "instead of falling through to the into random code". In your code there is no code following the NULL access, but imagine this:
void func()
{
if (answer == 42)
{
#if DEBUG
// Intentionally crash to avoid formatting hard disk for now
char *ptr = NULL;
ptr = 0;
#endif
// Format hard disk.
... some code to format hard disk ...
}
printf("We haven't found the answer yet\n");
...
}
So, this SHOULD crash, but if it doesn't the compiler will ensure that you do not continue after it... It makes UB crashes a little more obvious (and in this case prevents the hard disk from being formatted...)
I was trying to find out when this was introduced, but the function itself originates in 2007, but it's not used for exactly this purpose at the time, which makes it really hard to figure out why it is used this way.
Related
I'm trying to benchmark the fast inverse square root. The full code is here:
#include <benchmark/benchmark.h>
#include <math.h>
float number = 30942;
static void BM_FastInverseSqrRoot(benchmark::State &state) {
for (auto _ : state) {
// from wikipedia:
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y;
i = 0x5f3759df - ( i >> 1 );
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) );
// y = y * ( threehalfs - ( x2 * y * y ) );
float result = y;
benchmark::DoNotOptimize(result);
}
}
static void BM_InverseSqrRoot(benchmark::State &state) {
for (auto _ : state) {
float result = 1 / sqrt(number);
benchmark::DoNotOptimize(result);
}
}
BENCHMARK(BM_FastInverseSqrRoot);
BENCHMARK(BM_InverseSqrRoot);
and here is the code in quick-bench if you want to run it yourself.
Compiling with GCC 11.2 and -O3, the BM_FastInverseSqrRoot is around 31 times slower than Noop (around 10 ns when I ran it locally on my machine). Compiling with Clang 13.0 and -O3, it is around 3.6 times slower than Noop (around 1 ns when I ran it locally on my machine). This is a 10x speed difference.
Here is the relevant Assembly (taken from quick-bench).
With GCC:
push %rbp
mov %rdi,%rbp
push %rbx
sub $0x18,%rsp
cmpb $0x0,0x1a(%rdi)
je 408c98 <BM_FastInverseSqrRoot(benchmark::State&)+0x28>
callq 40a770 <benchmark::State::StartKeepRunning()>
408c84 add $0x18,%rsp
mov %rbp,%rdi
pop %rbx
pop %rbp
jmpq 40aa20 <benchmark::State::FinishKeepRunning()>
nopw 0x0(%rax,%rax,1)
408c98 mov 0x10(%rdi),%rbx
callq 40a770 <benchmark::State::StartKeepRunning()>
test %rbx,%rbx
je 408c84 <BM_FastInverseSqrRoot(benchmark::State&)+0x14>
movss 0x1b386(%rip),%xmm4 # 424034 <_IO_stdin_used+0x34>
movss 0x1b382(%rip),%xmm3 # 424038 <_IO_stdin_used+0x38>
mov $0x5f3759df,%edx
nopl 0x0(%rax,%rax,1)
408cc0 movss 0x237a8(%rip),%xmm0 # 42c470 <number>
mov %edx,%ecx
movaps %xmm3,%xmm1
2.91% movss %xmm0,0xc(%rsp)
mulss %xmm4,%xmm0
mov 0xc(%rsp),%rax
44.70% sar %rax
3.27% sub %eax,%ecx
3.24% movd %ecx,%xmm2
3.27% mulss %xmm2,%xmm0
9.58% mulss %xmm2,%xmm0
10.00% subss %xmm0,%xmm1
10.03% mulss %xmm2,%xmm1
9.64% movss %xmm1,0x8(%rsp)
3.33% sub $0x1,%rbx
jne 408cc0 <BM_FastInverseSqrRoot(benchmark::State&)+0x50>
add $0x18,%rsp
mov %rbp,%rdi
pop %rbx
pop %rbp
408d0a jmpq 40aa20 <benchmark::State::FinishKeepRunning()>
With Clang:
push %rbp
push %r14
push %rbx
sub $0x10,%rsp
mov %rdi,%r14
mov 0x1a(%rdi),%bpl
mov 0x10(%rdi),%rbx
call 213a80 <benchmark::State::StartKeepRunning()>
test %bpl,%bpl
jne 212e69 <BM_FastInverseSqrRoot(benchmark::State&)+0x79>
test %rbx,%rbx
je 212e69 <BM_FastInverseSqrRoot(benchmark::State&)+0x79>
movss -0xf12e(%rip),%xmm0 # 203cec <_IO_stdin_used+0x8>
movss -0xf13a(%rip),%xmm1 # 203ce8 <_IO_stdin_used+0x4>
cs nopw 0x0(%rax,%rax,1)
nopl 0x0(%rax)
212e30 2.46% movd 0x3c308(%rip),%xmm2 # 24f140 <number>
4.83% movd %xmm2,%eax
8.07% mulss %xmm0,%xmm2
12.35% shr %eax
2.60% mov $0x5f3759df,%ecx
5.15% sub %eax,%ecx
8.02% movd %ecx,%xmm3
11.53% mulss %xmm3,%xmm2
3.16% mulss %xmm3,%xmm2
5.71% addss %xmm1,%xmm2
8.19% mulss %xmm3,%xmm2
16.44% movss %xmm2,0xc(%rsp)
11.50% add $0xffffffffffffffff,%rbx
jne 212e30 <BM_FastInverseSqrRoot(benchmark::State&)+0x40>
212e69 mov %r14,%rdi
call 213af0 <benchmark::State::FinishKeepRunning()>
add $0x10,%rsp
pop %rbx
pop %r14
pop %rbp
212e79 ret
They look pretty similar to me. Both seem to be using SIMD registers/instructions like mulss. The GCC version has a sar that is supposedly taking 46%? (But I think it's just mislabelled and it's the mulss, mov, sar that together take 46%). Anyway, I'm not familiar enough with Assembly to really tell what is causing such a huge performance difference.
Anyone know?
Just FYI, Is it still worth using the Quake fast inverse square root algorithm nowadays on x86-64? - no, obsoleted by SSE1 rsqrtss which you can use with or without a Newton iteration.
As people pointed out in comments, you're using 64-bit long (since this is x86-64 on a non-Windows system), pointing it at a 32-bit float. So as well as a strict-aliasing violation (use memcpy or std::bit_cast<int32_t>(myfloat) for type punning), that's a showstopper for performance as well as correctness.
Your perf report output confirms it; GCC is doing a 32-bit movss %xmm0,0xc(%rsp) store to the stack, then a 64-bit reload mov 0xc(%rsp),%rax, which will cause a store forwarding stall costing much extra latency. And a throughput penalty, since actually you're testing throughput, not latency: the next computation of an inverse sqrt only has a constant input, not the result of the previous iteration. (benchmark::DoNotOptimize contains a "memory" clobber which stops GCC/clang from hoisting most of the computation out of the loop; they have to assume number may have changed since it's not const.)
The instruction waiting for the load result (the sar) is getting the blame for those cycles, as usual. (When an interrupt fires to collect a sample upon the cycles event counter wrapping around, the CPU has to figure out one instruction to blame for that event. Usually this ends up being the one waiting for an earlier slow instruction, or maybe just one after a slow instruction even without a data dependency, I forget.)
Clang chooses to assume that the upper 32 bits are zero, thus movd %xmm0, %eax to just copy the register with an ALU uop, and the shr instead of sar because it knows it's shifting in a zero from the high half of the 64-bit long it's pretending to work with. (A function call still used %rdi so that isn't Windows clang.)
Bugfixed version: GCC and clang make similar asm
Fixing the code on the quick-bench link in the question to use int32_t and std::bit_cast, https://godbolt.org/z/qbxqsaW4e shows GCC and clang compile similarly with -Ofast, although not identical. e.g. GCC loads number twice, once into an integer register, once into XMM0. Clang loads once and uses movd eax, xmm2 to get it.
On QB (https://quick-bench.com/q/jYLeX2krrTs0afjQKFp6Nm_G2v8), now GCC's BM_FastInverseSqrRoot is faster by a factor of 2 than the naive version, without -ffast-math
And yes, the naive benchmark compiles to sqrtss / divss without -ffast-math, thanks to C++ inferring sqrtf from sqrt(float). It does check for the number being >=0 every time, since quick-bench doesn't allow compiling with -fno-math-errno to omit that check to maybe call the libm function. But that branch predicts perfectly so the loop should still easily just bottleneck on port 0 throughput (div/sqrt unit).
Quick-bench does allow -Ofast, which is equivalent to -O3 -ffast-math, which uses rsqrtss and a Newton iteration. (Would be even faster with FMA available, but quick-bench doesn't allow -march=native or anything. I guess one could use __attribute__((target("avx,fma"))).
Quick-bench is now giving Error or timeout whether I use that or not, with Permission error mapping pages. and suggesting a smaller -m/--mmap_pages so I can't test on that system.
rsqrt with a Newton iteration (like compilers use at -Ofast for this) is probably faster or similar to Quake's fast invsqrt, but with about 23 bits of precision.
Is there any difference between the following two code snippets? Which one is better to use? Is one of them faster?
case 1:
int f(int x)
{
int a;
if(x)
a = 42;
else
a = 0;
return a;
}
case 2:
int f(int x)
{
int a;
if(x)
a = 42;
return a;
}
Actually that both snippets can return totally different results, so there is no better...
In case 2 you can return a non initialized variable a, which may result on a garbage value other than zero...
if you mean this:
int f(int x)
{
int a = 0;
if(x)
a = 42;
return a;
}
then I would say is that better, since is more compact(but you are saving only an else, not much computational wasted anyways)
The question is not "which one is better". The question is "will both work?"
And the answer is no, they will not both work. One is correct, the other is out of the question. So, performance is not even an issue.
The following results in a having either an "indeterminate value" or an "unspecified value" mentioned in the c99 standard, sections 3.17.2 and 3.17.3 (Probably the latter, though it is not clear to me.)
int a;
if(x)
a = 42;
return a;
This in turn means that the function will return an unspecified value. This means that that are absolutely no guarantees as to what value you will get.
If you are unlucky, you might get zero, and thus proceed to use the above terrible piece of code without knowing that you are bound to have lots of trouble with it later.
If you are lucky, you will get something like 0x719Ab32d right away, so you will immediately know that you messed up.
Any decent C compiler will give you a warning if you try to compile this, so the fact that you are asking this question means that you do not have a sufficient number of warnings enabled. Do not try to write C code (or any code) without the maximum possible number of warnings enabled; it never leads to any good. Find out how to enable warnings on your C compiler, and enable as many of them as you can.
Note: I assume uninitialized a in your second snippet is a type and it is int a = 0.
We can use gdb to check the difference:
(gdb) list f1
19 {
20 int a;
21 if (x)
22 a = 42;
23 else
24 a = 0;
25 return a;
26 }
(gdb) list f2
28 int f2(int x)
29 {
30 int a = 0;
31 if (x)
32 a = 42;
33 return a;
34 }
Now let's look at the assembler code with -O3:
(gdb) disassemble f1
Dump of assembler code for function f1:
0x00000000004007a0 <+0>: cmp $0x1,%edi
0x00000000004007a3 <+3>: sbb %eax,%eax
0x00000000004007a5 <+5>: not %eax
0x00000000004007a7 <+7>: and $0x2a,%eax
0x00000000004007aa <+10>: retq
End of assembler dump.
(gdb) disassemble f2
Dump of assembler code for function f2:
0x00000000004007b0 <+0>: cmp $0x1,%edi
0x00000000004007b3 <+3>: sbb %eax,%eax
0x00000000004007b5 <+5>: not %eax
0x00000000004007b7 <+7>: and $0x2a,%eax
0x00000000004007ba <+10>: retq
End of assembler dump.
As you can see, there is no difference. Let us disable the optimizations with -O0:
(gdb) disassemble f1
Dump of assembler code for function f1:
0x00000000004006cd <+0>: push %rbp
0x00000000004006ce <+1>: mov %rsp,%rbp
0x00000000004006d1 <+4>: mov %edi,-0x14(%rbp)
0x00000000004006d4 <+7>: cmpl $0x0,-0x14(%rbp)
0x00000000004006d8 <+11>: je 0x4006e3 <f1+22>
0x00000000004006da <+13>: movl $0x2a,-0x4(%rbp)
0x00000000004006e1 <+20>: jmp 0x4006ea <f1+29>
0x00000000004006e3 <+22>: movl $0x0,-0x4(%rbp)
0x00000000004006ea <+29>: mov -0x4(%rbp),%eax
0x00000000004006ed <+32>: pop %rbp
0x00000000004006ee <+33>: retq
End of assembler dump.
(gdb) disassemble f2
Dump of assembler code for function f2:
0x00000000004006ef <+0>: push %rbp
0x00000000004006f0 <+1>: mov %rsp,%rbp
0x00000000004006f3 <+4>: mov %edi,-0x14(%rbp)
0x00000000004006f6 <+7>: movl $0x0,-0x4(%rbp)
0x00000000004006fd <+14>: cmpl $0x0,-0x14(%rbp)
0x0000000000400701 <+18>: je 0x40070a <f2+27>
0x0000000000400703 <+20>: movl $0x2a,-0x4(%rbp)
0x000000000040070a <+27>: mov -0x4(%rbp),%eax
0x000000000040070d <+30>: pop %rbp
0x000000000040070e <+31>: retq
End of assembler dump.
Now there is a difference and the first version in average for random arguments x will be faster as it has one mov less that the second one.
In case your second code is
int f(int x)
{
int a=0;
if(x)
a = 42;
return a;
}
and not
int f(int x)
{
int a;
if(x)
a = 42;
return a;
}
It doesn't matter.The compiler will convert them to same optimized code
I would prefer this (your second snippet):
int f(int x) {
int a = 0;
if (x) {
a = 42;
}
return a;
}
Everything should always have braces. Even if now I only have one line in the if block, I made add more later.
I don't put the braces on their own lines because it's pointless waste of space.
I rarely put the block on the same line as the conditional for readability.
You don't need extra space for a in either case - you can do something like this -
int f(int x)
{
if(x)
return 42;
else
return 0;
}
BTW in your second function you have not initialised a.
I am still struggling with g++ inline assembler and trying to understand how to use it.
I've adapted a piece of code from here: http://asm.sourceforge.net/articles/linasm.html (Quoted from the "Assembler Instructions with C Expressions Operands" section in gcc info files)
static inline uint32_t sum0() {
uint32_t foo = 1, bar=2;
uint32_t ret;
__asm__ __volatile__ (
"add %%ebx,%%eax"
: "=eax"(ret) // ouput
: "eax"(foo), "ebx"(bar) // input
: "eax" // modify
);
return ret;
}
I've compiled disabling optimisations:
g++ -Og -O0 inline1.cpp -o test
The disassembled code puzzles me:
(gdb) disassemble sum0
Dump of assembler code for function sum0():
0x00000000000009de <+0>: push %rbp ;prologue...
0x00000000000009df <+1>: mov %rsp,%rbp ;prologue...
0x00000000000009e2 <+4>: movl $0x1,-0xc(%rbp) ;initialize foo
0x00000000000009e9 <+11>: movl $0x2,-0x8(%rbp) ;initialize bar
0x00000000000009f0 <+18>: mov -0xc(%rbp),%edx ;
0x00000000000009f3 <+21>: mov -0x8(%rbp),%ecx ;
0x00000000000009f6 <+24>: mov %edx,-0x14(%rbp) ; This is unexpected
0x00000000000009f9 <+27>: movd -0x14(%rbp),%xmm1 ; why moving variables
0x00000000000009fe <+32>: mov %ecx,-0x14(%rbp) ; to extended registers?
0x0000000000000a01 <+35>: movd -0x14(%rbp),%xmm2 ;
0x0000000000000a06 <+40>: add %ebx,%eax ; add (as expected)
0x0000000000000a08 <+42>: movd %xmm0,%edx ; copying the wrong result to ret
0x0000000000000a0c <+46>: mov %edx,-0x4(%rbp) ; " " " " " "
0x0000000000000a0f <+49>: mov -0x4(%rbp),%eax ; " " " " " "
0x0000000000000a12 <+52>: pop %rbp ;
0x0000000000000a13 <+53>: retq
End of assembler dump.
As expected, the sum0() function returns the wrong value.
Any thoughts? What is going on? How to get it right?
-- EDIT --
Based on #MarcGlisse comment, I tried:
static inline uint32_t sum0() {
uint32_t foo = 1, bar=2;
uint32_t ret;
__asm__ __volatile__ (
"add %%ebx,%%eax"
: "=a"(ret) // ouput
: "a"(foo), "b"(bar) // input
: "eax" // modify
);
return ret;
}
It seems that the tutorial I've been following is misleading. "eax" in the output/input field does not mean the register itself, but e,a,x abbreviations on the abbrev table.
Anyway, I still do not get it right. The code above results in a compilation error: 'asm' operand has impossible constraints.
I don't see why.
The Extended inline assembly constraints for x86 are listed in the official documentation.
The complete documentation is also worth reading.
As you can see, the constraints are all single letters.
The constraint "eax" fo foo specifies three constraints:
a
The a register.
x
Any SSE register.
e
32-bit signed integer constant, or ...
Since you are telling GCC that eax is clobbered it cannot put the input operand there and it picks xmm0.
When the compiler selects the registers to use to represent the input operands, it does not use any of the clobbered registers
The proper constraint is simply "a".
You need to remove eax (by the way it should be rax due to zeroing of the upper bits) from the clobbers (and add "cc").
If i have a large data file containing integers of an arbitrary length needing to be sorted by it's second field:
1 3 4 5
1 4 5 7
-1 34 56 7
124 58394 1384 -1938
1948 3848089 -14850 0
1048 01840 1039 888
//consider this is a LARGE file, the data goes on for quite some time
and i call upon qsort to be my weapon of choice, inside my sort function, will using the shorthand IF provide a significant performance boost to overall time it takes the data to be sorted? Or is the shorthand IF only used as a convenience tool for organizing code?
num2 = atoi(Str);
num1 = atoi(Str2);
LoggNum = (num2 > num1) ? num2 : num1; //faster?
num2 = atoi(Str);
num1 = atoi(Str2);
if(num2 > num1) //or the same?
LoggNum = num2;
else
LoggNum = num1;
Any modern compiler will build identical code in these two cases, the difference is one of style and convenience only.
There is no answer to this question... the compiler is free to generate whatever code it feels suitable. That said, only a particularly stupid compiler would produce significantly different code for these cases. You should write whatever you feel best expresses how your program works... to me the ternary operator version is more concise and readable, but people not used to C and/or C++ may find the if/else version easier to understand at first.
If you're interested in things like this and want to get a better feel for code generation, you should learn how to invoke your compiler asking it to produce assembly language output showing the CPU instructions for the program. For example, GNU compilers accept a -S flag that produces .s assembly language files.
The only way to know is to profile. However, the ternary operator allows you to do an initialization of an object:
Sometype z = ( x < y) ? something : somethingElse; // copy initialization
With if-else, you would have to use an extra assignment to get the equivalent behaviour.
SomeType z; // default construction
if ( x < y) {
z = something; // assignment
} else {
z = somethingElse; // assignment
}
This could have an impact if the overhead of assigning to a SomeType is large.
We tested these two statements in VC 2010:
The geneated assembly for ?= is:
01207AE1 cmp byte ptr [ebp-101h],0
01207AE8 jne CTestSOFDlg::OnBnClickedButton1+47h (1207AF7h)
01207AEA push offset (1207BE5h)
01207AEF call #ILT+840(__RTC_UninitUse) (120134Dh)
01207AF4 add esp,4
01207AF7 cmp byte ptr [ebp-0F5h],0
01207AFE jne CTestSOFDlg::OnBnClickedButton1+5Dh (1207B0Dh)
01207B00 push offset (1207BE0h)
01207B05 call #ILT+840(__RTC_UninitUse) (120134Dh)
01207B0A add esp,4
01207B0D mov eax,dword ptr [num2]
01207B10 cmp eax,dword ptr [num1]
01207B13 jle CTestSOFDlg::OnBnClickedButton1+86h (1207B36h)
01207B15 cmp byte ptr [ebp-101h],0
01207B1C jne CTestSOFDlg::OnBnClickedButton1+7Bh (1207B2Bh)
01207B1E push offset (1207BE5h)
01207B23 call #ILT+840(__RTC_UninitUse) (120134Dh)
01207B28 add esp,4
01207B2B mov ecx,dword ptr [num2]
01207B2E mov dword ptr [ebp-10Ch],ecx
01207B34 jmp CTestSOFDlg::OnBnClickedButton1+0A5h (1207B55h)
01207B36 cmp byte ptr [ebp-0F5h],0
01207B3D jne CTestSOFDlg::OnBnClickedButton1+9Ch (1207B4Ch)
01207B3F push offset (1207BE0h)
01207B44 call #ILT+840(__RTC_UninitUse) (120134Dh)
01207B49 add esp,4
01207B4C mov edx,dword ptr [num1]
01207B4F mov dword ptr [ebp-10Ch],edx
01207B55 mov eax,dword ptr [ebp-10Ch]
01207B5B mov dword ptr [LoggNum],eax
and for if else operator:
01207B5E cmp byte ptr [ebp-101h],0
01207B65 jne CTestSOFDlg::OnBnClickedButton1+0C4h (1207B74h)
01207B67 push offset (1207BE5h)
01207B6C call #ILT+840(__RTC_UninitUse) (120134Dh)
01207B71 add esp,4
01207B74 cmp byte ptr [ebp-0F5h],0
01207B7B jne CTestSOFDlg::OnBnClickedButton1+0DAh (1207B8Ah)
01207B7D push offset (1207BE0h)
01207B82 call #ILT+840(__RTC_UninitUse) (120134Dh)
01207B87 add esp,4
01207B8A mov eax,dword ptr [num2]
01207B8D cmp eax,dword ptr [num1]
01207B90 jle CTestSOFDlg::OnBnClickedButton1+100h (1207BB0h)
01207B92 cmp byte ptr [ebp-101h],0
01207B99 jne CTestSOFDlg::OnBnClickedButton1+0F8h (1207BA8h)
01207B9B push offset (1207BE5h)
01207BA0 call #ILT+840(__RTC_UninitUse) (120134Dh)
01207BA5 add esp,4
01207BA8 mov eax,dword ptr [num2]
01207BAB mov dword ptr [LoggNum],eax
01207BB0 cmp byte ptr [ebp-0F5h],0
01207BB7 jne CTestSOFDlg::OnBnClickedButton1+116h (1207BC6h)
01207BB9 push offset (1207BE0h)
01207BBE call #ILT+840(__RTC_UninitUse) (120134Dh)
01207BC3 add esp,4
01207BC6 mov eax,dword ptr [num1]
01207BC9 mov dword ptr [LoggNum],eax
as you can see ?= operator has two more asm commands:
01207B4C mov edx,dword ptr [num1]
01207B4F mov dword ptr [ebp-10Ch],edx
which takes more CPU ticks.
for a loop with 97000000 size if else is faster.
This particular optimization has bitten me in a real code base where changing from one form to the other in a locking function saved 10% execution time in a macro benchmark.
Let's test this with gcc 4.2.1 on MacOS:
int global;
void
foo(int x, int y)
{
if (x < y)
global = x;
else
global = y;
}
void
bar(int x, int y)
{
global = (x < y) ? x : y;
}
We get (cleaned up):
_foo:
cmpl %esi, %edi
jge L2
movq _global#GOTPCREL(%rip), %rax
movl %edi, (%rax)
ret
L2:
movq _global#GOTPCREL(%rip), %rax
movl %esi, (%rax)
ret
_bar:
cmpl %edi, %esi
cmovle %esi, %edi
movq _global#GOTPCREL(%rip), %rax
movl %edi, (%rax)
ret
Considering that the cmov instructions were added specifically to improve the performance of this kind of operation (by avoiding pipeline stalls), it is clear that "?:" in this particular compiler is faster.
Should the compiler generate the same code in both cases? Yes. Will it? No of course not, no compiler is perfect. If you really care about performance (in most cases you shouldn't), test and see.
So recently i was thinking about strcpy and back to K&R where they show the implementation as
while (*dst++ = *src++) ;
However I mistakenly transcribed it as:
while (*dst = *src)
{
src++; //technically could be ++src on these lines
dst++;
}
In any case that got me thinking about whether the compiler would actually produce different code for these two. My initial thought is they should be near identical, since src and dst are being incremented but never used I thought the compiler would know not to try to acually preserve them as "variables" in the produced machine code.
Using windows7 with VS 2010 C++ SP1 building in 32 bit Release mode (/O2), I got the dis-assembly code for both of the above incarnations. To prevent the function itself from referencing the input directly and being inlined i made a dll with each of the functions. I have omitted the prologue and epilogue of the produced ASM.
while (*dst++ = *src++)
6EBB1003 8B 55 08 mov edx,dword ptr [src]
6EBB1006 8B 45 0C mov eax,dword ptr [dst]
6EBB1009 2B D0 sub edx,eax //prepare edx so that edx + eax always points to src
6EBB100B EB 03 jmp docopy+10h (6EBB1010h)
6EBB100D 8D 49 00 lea ecx,[ecx] //looks like align padding, never hit this line
6EBB1010 8A 0C 02 mov cl,byte ptr [edx+eax] //ptr [edx+ eax] points to char in src :loop begin
6EBB1013 88 08 mov byte ptr [eax],cl //copy char to dst
6EBB1015 40 inc eax //inc src ptr
6EBB1016 84 C9 test cl,cl // check for 0 (null terminator)
6EBB1018 75 F6 jne docopy+10h (6EBB1010h) //if not goto :loop begin
;
Above I have annotated the code, essentially a single loop , only 1 check for null and 1 memory copy.
Now lets look at my mistake version:
while (*dst = *src)
6EBB1003 8B 55 08 mov edx,dword ptr [src]
6EBB1006 8A 0A mov cl,byte ptr [edx]
6EBB1008 8B 45 0C mov eax,dword ptr [dst]
6EBB100B 88 08 mov byte ptr [eax],cl //copy 0th char to dst
6EBB100D 84 C9 test cl,cl //check for 0
6EBB100F 74 0D je docopy+1Eh (6EBB101Eh) // return if we encounter null terminator
6EBB1011 2B D0 sub edx,eax
6EBB1013 8A 4C 02 01 mov cl,byte ptr [edx+eax+1] //get +1th char :loop begin
{
src++;
dst++;
6EBB1017 40 inc eax
6EBB1018 88 08 mov byte ptr [eax],cl //copy above char to dst
6EBB101A 84 C9 test cl,cl //check for 0
6EBB101C 75 F5 jne docopy+13h (6EBB1013h) // if not goto :loop begin
}
In my version, I see that it first copies the 0th char to the destination, then checks for null , and then finally enters the loop where it checks for null again. So the loop remains largely the same but now it handles the 0th character before the loop. This of course is going to be sub-optimal compared with the first case.
I am wondering if anyone knows why the compiler is being prevented from making the same (or near same) code as the first example. Is this a ms compiler specific issue or possibly with my compiler/linker settings?
here is the full code, 2 files (1 function replaces the other).
// in first dll project
__declspec(dllexport) void docopy(const char* src, char* dst)
{
while (*dst++ = *src++);
}
__declspec(dllexport) void docopy(const char* src, char* dst)
{
while (*dst = *src)
{
++src;
++dst;
}
}
//seprate main.cpp file calls docopy
void docopy(const char* src, char* dst);
char* source ="source";
char destination[100];
int main()
{
docopy(source, destination);
}
Because in the first example, the post-increment happens always, even if src starts out pointing to a null character. In the same starting situation, the second example would not increment the pointers.
Of course the compiler has other options. The "copy first byte then enter the loop if not 0" is what gcc-4.5.1 produces with -O1. With -O2 and -O3, it produces
.LFB0:
.cfi_startproc
jmp .L6 // jump to copy
.p2align 4,,10
.p2align 3
.L4:
addq $1, %rdi // increment pointers
addq $1, %rsi
.L6: // copy
movzbl (%rdi), %eax // get source byte
testb %al, %al // check for 0
movb %al, (%rsi) // move to dest
jne .L4 // loop if nonzero
rep
ret
.cfi_endproc
which is quite similar to what it produces for the K&R loop. Whether that's actually better I can't say, but it looks nicer.
Apart from the jump into the loop, the instructions for the K&R loop are exactly the same, just ordered differently:
.LFB0:
.cfi_startproc
.p2align 4,,10
.p2align 3
.L2:
movzbl (%rdi), %eax // get source byte
addq $1, %rdi // increment source pointer
movb %al, (%rsi) // move byte to dest
addq $1, %rsi // increment dest pointer
testb %al, %al // check for 0
jne .L2 // loop if nonzero
rep
ret
.cfi_endproc
Your second code doesn't "check for null again". In your second version the cycle body works with the characters at edx+eax+1 address (note the +1 part), which would be characters number 1, 2, 3 and so on. The prologue code works with character number 0. That means that the code never checks the same character twice, as you seem to believe. There's no "again" there.
The second code is a bot more convoluted (the first iteration of the cycle is effectively pulled out of it) since, as it has already been explained, its functionality is different. The final values of the pointers differ between your fist and your second version.