Why is fastcall slower than stdcall? - c++

I found following question: Is fastcall really faster?
No clear answers for x86 were given so I decided to create benchmark.
Here is the code:
#include <time.h>
int __fastcall func(int i)
{
return i + 5;
}
int _stdcall func2(int i)
{
return i + 5;
}
int _tmain(int argc, _TCHAR* argv[])
{
int iter = 100;
int x = 0;
clock_t t = clock();
for (int j = 0; j <= iter;j++)
for (int i = 0; i <= 1000000;i++)
x = func(x & 0xFF);
printf("%d\n", clock() - t);
t = clock();
for (int j = 0; j <= iter;j++)
for (int i = 0; i <= 1000000;i++)
x = func2(x & 0xFF);
printf("%d\n", clock() - t);
printf("%d", x);
return 0;
}
In case of no optimization result in MSVC 10 is:
4671
4414
With max optimization fastcall is sometimes faster, but I guess it is multitasking noise. Here is average result (with iter = 5000)
6638
6487
stdcall looks faster!
Here are results for GCC: http://ideone.com/hHcfP
Again, fastcall lost race.
Here is part of disassembly in case of fastcall:
011917EF pop ecx
011917F0 mov dword ptr [ebp-8],ecx
return i + 5;
011917F3 mov eax,dword ptr [i]
011917F6 add eax,5
this is for stdcall:
return i + 5;
0119184E mov eax,dword ptr [i]
01191851 add eax,5
i is passed via ECX, instead of stack, but saved into stack in the body! So all the effect is neglected! this simple function can be calculated using only registers! And there is no real difference between them.
Can anyone explain what is reason for fastcall? Why doesn't it give speedup?
Edit: With optimization it turned out that both functions are inlined. When I turned inlining off they both are compiled to:
00B71000 add eax,5
00B71003 ret
This looks like great optimization, indeed, but it doesn't respect calling conventions at all, so test is not fair.

__fastcall was introduced a long time ago. At the time, Watcom C++ was beating Microsoft for optimization, and a number of reviewers picked out its register-based calling convention as one (possible) reason why.
Microsoft responded by adding __fastcall, and they've retained it ever since -- but I don't think they ever did much more than enough to be able to say "we have a register-based calling convention too..." Their preference (especially since the 32-bit migration) seems to be for __stdcall. They've put quite a bit of work into improving their code generation with it, but (apparently) not nearly so much with __fastcall. With on-chip caching, the gain from passing things in registers isn't nearly as great as it was then anyway.

Your micro-benchmark produces irrelevant results. __fastcall has specific uses with SSE instructions (see XNAMath) , clock() is not even remotely a suitable timer for benchmarking, and __fastcall exists for multiple platforms like Itanium and some others too, not just for x86, and in addition, your whole program can be effectively optimized to nothing except the printf statements, making the relative performance of __fastcall or __stdcall very, very irrelevant.
Finally, you've forgotten to realize the main reason that a lot of things are done the way they are- legacy. __fastcall may well have been significant before compiler inlining became as aggressive and effective as it is today, and no compiler will remove __fastcall as there will be programs that depend on it. That makes __fastcall a fact of life.

Several reasons
At least in most decent x86 implementations, register renaming is in effect -- the effort that looks like's being saved by using a register instead of memory might not be doing anything on the hardware level.
Sure, you save some stack movement effort with __fastcall, but you reduce the number of registers available for use in the function without modifying the stack.
Most of the time where __fastcall would be faster the function is simple enough to be inlined in any case, which means that it really doesn't matter in real software. (Which is one of the main reasons why __fastcall is not often used)
Side note: What was wrong with Anon's answer?

Fastcall is really only meaningful if you use full optimization (otherwise its effects will be buried by other artifacts), but as you note, with full optimization, the functions will be inlined and you won't see the effect of calling conventions at all.
So to actually test this, you need to make the functions extern declarations with the actual definitions in a separate source file that you compile separately and link with your main routine. When you do that, you'll see that __fastcall is consistently ~25% faster with small functions like this.
The upshot is that __fastcall is really only useful if you have a lot of calls to tiny functions that can't be inlined because they need to be separately compiled.
Edit
So with separate compilation and gcc -O3 -fomit-frame-pointer -m32 I see quite different code for the two functions:
func:
leal 5(%ecx), %eax
ret
func2:
movl 4(%esp), %eax
addl $5, %eax
ret
Running that with iter=5000 consistently gives me results close to
9990000
14160000
indicating that the fastcall version is a shade over 40% faster.

I compiled the two function with i686-w64-mingw32-gcc -O2 -fno-inline fastcall.c. This is the assembly generated for func and func2:
#func#4:
leal 5(%ecx), %eax
ret
_func2#4:
movl 4(%esp), %eax
addl $5, %eax
ret $4
__fastcall really looks faster to me. func2 needs to load the input parameter from the stack. func can simply perform a %eax := %ecx + 5 and then returns to the caller.
Furthermore, the output of your programming is typically like this on my system:
2560
3250
154
So __fastcall does not only look faster, it is faster.
Also note that on x86_64 (or x64 as Microsoft calls it), __fastcall is the default and the old non-fastcall convetion does not exist anymore.
http://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions
By making __fastcall the default, x86_64 catches up with other architectures (such as ARM), where passing arguments in registers is also default.

Fastcall itself as a register based calling convention isn't great on x86 because there aren't that many named registers available and by using key registers for passing the values, all you're doing is potentially forcing the calling code to push other values onto the stack and forcing the called function if it is of sufficient complexity to do the same. Essentially from an assembly language perspective, you're increasing the pressure on those named registers and explicitly using stack operations to compensate. So even if the CPU has far more registers available for renaming, it isn't going to refactor the explicit stack operations that have to be inserted.
On the other hand, on more "register rich" architectures like x86-64, register based calling conventions (not exactly the same as fastcall of old, but same concept) are the norm and are used across the board. In other words, once we got out of a few named registers architecture like x86, to something with more register space, fastcall was back in a big way and became the default and really only way used today.

Note: even edited in May 2017 by the OP, this question and answers are likely to be way out of date and not relevant any more by 2019 (if not a few years ago earlier).
A) By at minimal MSVC 2017 (and 2019 released recently). most of the code is going to be inlined in optimized release builds anyhow. Probably the only function body you will see in the entire example now is "_tmain()".
That is unless you specifically do some tricks like declaring the functions as "volatile" and/or wrapping the test functions in pragmas that turn off some optimizations.
B) The latest generation of desktop CPUs (the assumption here) are much improved since the circa 2010 generation. They are much are better at caching the stack, memory alignment matters less, etc.
But don't take my word for it. Load up your executable in a dissembler (IDA Pro, MSVC debugger, etc.) and look for your self (a good way to learn).
Now it would be interesting to see what the performance would be over a large 32bit application. Example, take the last Open sourced DOOM game release and make builds with stdcall and _fastcall and look for framerate differences. And get metrics off of any built-in performance reporting features it has et al.

It does not appear that __fastcall actually indicates that it will be faster. Seems like all you're doing is moving the first fiew variables into registers before making the call to the function. This most likely makes your function call slower since it must move the variables into those registers first. Wikipedia had a pretty good write up about what exactly Fast Call is and how it is implemented.

Related

to how many subroutines can x86 processors call? [duplicate]

This question already has answers here:
C/C++ maximum stack size of program on mainstream OSes
(7 answers)
Closed 1 year ago.
i'm writing a small program to print a polygon with printf("\219") to just see if whatever i'm doing is right for my kernel. but it needs to call many functions and i don't know whether x86 processors can accept that many subroutines and i can't find results in google. so my question is will it accept so many function calls and what is the maximum. (i mean something like this:-)
function a() {b();}
function b() {c();}
function c() {d();}
...
i've used 5 such levels (you know what i mean, right?)
Your function depth is not limited by the processor, but by the size of your stack. Behind the scenes, calls to C++ functions usually translate to call instructions in x86, which push four (or eight, for x64 programs) bytes onto your program's stack for the return pointer. Sometimes calls are optimized and don't touch the stack at all. Functions might also push additional bytes (e.g. local function state) onto the stack.
To get the exact number of functions you can call, you need to disassemble your code to calculate the number of bytes each function pushes to the stack (minimum four/eight because of the return address, but likely many more), then find the maximum stack size and divide it by the function frame size.
call and ret instructions aren't special; the CPU doesn't know how deeply nested it is. (And doesn't know the difference between calling another function or being recursive.) As described in my answer here, functions are a high-level concept that asm gives you the tools to implement.
All the CPU knows is whether pushing a return address causes a page fault or not, if you run out of stack space. (A "stack overflow"). Usually the result of recursion going too deep, or a huge array in automatic storage (a C++ local var) or alloca.
(As mentioned in comments, not every C function call results in a call instruction; inlining can fully optimize away the fact that it's a separate function. You want small functions to inline, and the design of the template classes in the C++ standard library depends on this for efficiency. Also, a tailcall can be just a jmp, having the next function take over this stack space instead of getting new space.)
Linux typically uses 8 MiB stacks for user-space processes. (ulimit -s). Windows is typically 1 MiB. Inside a kernel, you often use smaller stacks. e.g. Linux kernel thread-stacks are currently 16 kiB for x86-64. Previously as small as 4 kiB (one page) for 32-bit x86, but some code has more local vars that take up stack space.
Related: How does the stack work in assembly language?. When ret executes, it just pops into the program counter. It's up to the programmer (or compiler) to create asm that runs ret when the stack pointer is pointing at somewhere you want to jump. (Typically a return address.)
In modern Linux systems, the smallest stack frame is 16 bytes, because the ABI specifies maintaining 16-byte stack alignment before a call. So best case you can have a call depth of 512k before you overflow the stack. (Unless you're in a thread that was started with an extra large thread-stack).
If you're in 32-bit mode with an old version of the i386 System V ABI that only required 4-byte stack alignment (like gcc -mpreferred-stack-boundary=2 instead of the default 4), with functions that just called without using any other stack space, 8 MiB of stack would give you 8 MiB / 4B = 2 Mi call depth.
In real life, some of that 8MiB space on the main thread's stack is used up by env vars and argv[] already on the stack at the process entry point (copied there by the kernel), and then a bit more as _start calls a function that calls main.
To make this able to actually return at the end, instead of just eventually faulting, you'd need either a huge chain, or some recursion with a termination condition, like
void recurse(int n) {
if (n == 1)
return;
recurse(n - 1);
}
and compile with some optimization but not enough to get the compiler to turn it into a do{}while(--n); loop or optimize away.
If you wanted all different functions, that's ok, the code-size limit is at least 2GiB, and call rel32 / ret takes a total of 6 bytes. (Un-optimized code would default to push ebp or push rbp as well, so you'd have to avoid that for 32-bit code if you wanted to meet that 4-byte stack-frame goal of having all your stack space full with just return addresses).
For example with GCC (see How to remove "noise" from GCC/clang assembly output? for the __attribute__((noipa)) option)
__attribute__((noipa)) // don't let other functions even notice that this is empty, let alone inline it
void foo(void){}
__attribute__((noipa))
void bar(){
foo();
}
void baz(){
bar();
}
compiles with GCC (Godbolt compiler explorer) to this 32-bit asm which just calls without using any stack space for anything else:
## GCC11.2 -O1 -m32 -mpreferred-stack-boundary=2
foo():
ret
bar():
call foo()
ret
baz():
call bar()
ret
g++ -O2 would optimize these call/ret functions into a tailcall like jmp foo, which doesn't push a return address. That's why I only used -O1 or -Og. Of course in real life you do want things to inline and optimize away; defeating that is just for this silly computer trick to achieve the longest finite call depth that actually would crash if you made it longer.
You can repeat this pattern indefinitely; GCC allows long symbol names, so it's not a problem to have many different unique function names. You can split across multiple files, with just a prototype for one of the functions in the other file.
If you reduce the -falign-functions tuning setting, you can probably get it down to either 6 bytes per function (no padding) or 8 (align by 8 thus 2 bytes of padding), down from the default of aligning each function label by 16, wasting 10 bytes per function.
Recursive:
I got GCC to make asm that recurses with no gaps between return address:
void recurse(int n){
if (!--n)
return;
recurse(n);
}
## G++11.2 -O1 -mregparm=3 -m32 -mpreferred-stack-boundary=2
recurse(int):
sub eax, 1 # --n
jne .L6 # skip over the ret if the result is non-zero
.L4:
ret
.L6:
call recurse(int) # push a 4-byte ret addr and jump to top
jmp .L4 # silly compiler, should put another ret here. But we told it not to optimize too much
Note the -mregparm=3 option, so the first up-to-3 args are passed in registers, instead of on the stack in the inefficient i386 System V calling convention. (It was designed a long time ago; the x86-64 SysV calling convention is much better.)
With -O2, this function optimizes away to just a ret. (It turns the call into a tailcall which becomes a loop, and then it can see it's a non-infinite loop with no side effects so it just removes it.)
Of course in real life you want optimizations like this. Recursion in asm sucks compared to loops. If you're worried about robust code and call depth, don't write recursive functions in the first place, unless you know recursion depth will be shallow. If a debug build doesn't convert your recursion to iteration, you don't want your kernel to crash.
Just for fun, I got GCC to tighten up the asm even at -O1 by convincing it to do the conditional branch over the call, to only have one ret instruction in the function so tail-duplication wouldn't be relevant anyway. And it means the fast path (recursion) involves a not-taken macro-fused conditional branch plus a call.
void recurse(int n){
if (__builtin_expect(!!--n, 1))
recurse(n);
}
Godbolt
## GCC 11.2 -O1 -mregparm=3 -m32 -mpreferred-stack-boundary=2
recurse(int):
sub eax, 1
je .L1
call recurse(int)
.L1:
ret

Does any of current C++ compilers ever emit "rep movsb/w/d"?

This question made me wonder, if current modern compilers ever emit REP MOVSB/W/D instruction.
Based on this discussion, it seems that using REP MOVSB/W/D could be beneficial on current CPUs.
But no matter how I tried, I cannot made any of the current compilers (GCC 8, Clang 7, MSVC 2017 and ICC 18) to emit this instruction.
For this simple code, it could be reasonable to emit REP MOVSB:
void fn(char *dst, const char *src, int l) {
for (int i=0; i<l; i++) {
dst[i] = src[i];
}
}
But compilers emit a non-optimized simple byte-copy loop, or a huge unrolled loop (basically an inlined memmove). Do any of the compilers use this instruction?
GCC has x86 tuning options to control string-ops strategy and when to inline vs. library call. (See https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html). -mmemcpy-strategy=strategy
takes alg:max_size:dest_align triplets, but the brute-force way is -mstringop-strategy=rep_byte
I had to use __restrict to get gcc to recognize the memcpy pattern, instead of just doing normal auto-vectorization after an overlap check / fallback to a dumb byte loop. (Fun fact: gcc -O3 auto-vectorizes even with -mno-sse, using the full width of an integer register. So you only get a dumb byte loop if you compile with -Os (optimize for size) or -O2 (less than full optimization)).
Note that if src and dst overlap with dst > src, the result is not memmove. Instead, you'll get a repeating pattern with length = dst-src. rep movsb has to correctly implement the exact byte-copy semantics even in case of overlap, so it would still be valid (but slow on current CPUs: I think microcode would just fall back to a byte loop).
gcc only gets to rep movsb via recognizing a memcpy pattern and then choosing to inline memcpy as rep movsb. It doesn't go directly from byte-copy loop to rep movsb, and that's why possible aliasing defeats the optimization. (It might be interesting for -Os to consider using rep movs directly, though, when alias analysis can't prove it's a memcpy or memmove, on CPUs with fast rep movsb.)
void fn(char *__restrict dst, const char *__restrict src, int l) {
for (int i=0; i<l; i++) {
dst[i] = src[i];
}
}
This probably shouldn't "count" because I would probably not recommend those tuning options for any use-case other than "make the compiler use rep movs", so it's not that different from an intrinsic. I didn't check all the -mtune=silvermont / -mtune=skylake / -mtune=bdver2 (Bulldozer version 2 = Piledriver) / etc. tuning options, but I doubt any of them enable that. So this is an unrealistic test because nobody using -march=native would get this code-gen.
But the above C compiles with gcc8.1 -xc -O3 -Wall -mstringop-strategy=rep_byte -minline-all-stringops on the Godbolt compiler explorer to this asm for x86-64 System V:
fn:
test edx, edx
jle .L1 # rep movs treats the counter as unsigned, but the source uses signed
sub edx, 1 # what the heck, gcc? mov ecx,edx would be too easy?
lea ecx, [rdx+1]
rep movsb # dst=rdi and src=rsi
.L1: # matching the calling convention
ret
Fun fact: the x86-64 SysV calling convention being optimized for inlining rep movs is not a coincidence (Why does Windows64 use a different calling convention from all other OSes on x86-64?). I think gcc favoured that when the calling convention was being designed, so it saved instructions.
rep_8byte does a bunch of setup to handle counts that aren't a multiple of 8, and maybe alignment, I didn't look carefully.
I also didn't check other compilers.
Inlining rep movsb would be a poor choice without an alignment guarantee, so it's good that compilers don't do it by default. (As long as they do something better.) Intel's optimization manual has a section on memcpy and memset with SIMD vectors vs. rep movs. See also http://agner.org/optimize/, and other performance links in the x86 tag wiki.
(I doubt that gcc would do anything differently if you did dst=__builtin_assume_aligned(dst, 64); or any other way of communicating alignment to the compiler, though. e.g. alignas(64) on some arrays.)
Intel's IceLake microarchitecture will have a "short rep" feature that presumably reduces startup overhead for rep movs / rep stos, making them much more useful for small counts. (Currently rep string microcode has significant startup overhead: What setup does REP do?)
memmove / memcpy strategies:
BTW, glibc's memcpy uses a pretty nice strategy for small inputs that's insensitive to overlap: Two loads -> two stores that potentially overlap, for copies up to 2 registers wide. This means any input from 4..7 bytes branches the same way, for example.
Glibc's asm source has a nice comment describing the strategy: https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html#19.
For large inputs, it uses SSE XMM registers, AVX YMM registers, or rep movsb (after checking an internal config variable that's set based on CPU-detection when glibc initializes itself). I'm not sure which CPUs it will actually use rep movsb on, if any, but support is there for using it for large copies.
rep movsb might well be a pretty reasonable choice for small code-size and non-terrible scaling with count for a byte loop like this, with safe handling for the unlikely case of overlap.
Microcode startup overhead is a big problem with using it for copies that are usually small, though, on current CPUs.
It's probably better than a byte loop if the average copy size is maybe 8 to 16 bytes on current CPUs, and/or different counts cause branch mispredicts a lot. It's not good, but it's less bad.
Some kind of last-ditch peephole optimization for turning a byte-loop into a rep movsb might be a good idea, if compiling without auto-vectorization. (Or for compilers like MSVC that make a byte loop even at full optimization.)
It would be neat if compilers knew about it more directly, and considered using it for -Os (optimize for code-size more than speed) when tuning for CPUs with the Enhanced Rep Movs/Stos Byte (ERMSB) feature. (See also Enhanced REP MOVSB for memcpy for lots of good stuff about x86 memory bandwidth single threaded vs. all cores, NT stores that avoid RFO, and rep movs using an RFO-avoiding cache protocol...).
On older CPUs, rep movsb wasn't as good for large copies, so the recommended strategy was rep movsd or movsq with special handling for the last few counts. (Assuming you're going to use rep movs at all, e.g. in kernel code where you can't touch SIMD vector registers.)
The -mno-sse auto-vectorization using integer registers is much worse than rep movs for medium sized copies that are hot in L1d or L2 cache, so gcc should definitely use rep movsb or rep movsq after checking for overlap, not a qword copy loop, unless it expects small inputs (like 64 bytes) to be common.
The only advantage of a byte loop is small code size; it's pretty much the bottom of the barrel; a smart strategy like glibc's would be much better for small but unknown copy sizes. But that's too much code to inline, and a function call does have some cost (spilling call-clobbered registers and clobbering the red zone, plus the actual cost of the call / ret instructions and dynamic linking indirection).
Especially in a "cold" function that doesn't run often (so you don't want to spend a lot of code size on it, increasing your program's I-cache footprint, TLB locality, pages to be loaded from disk, etc). If writing asm by hand, you'd usually know more about the expected size distribution and be able to inline a fast-path with a fallback to something else.
Remember that compilers will make their decisions on potentially many loops in one program, and most code in most programs is outside of hot loops. It shouldn't bloat them all. This is why gcc defaults to -fno-unroll-loops unless profile-guided optimization is enabled. (Auto-vectorization is enabled at -O3, though, and can create a huge amount of code for some small loops like this one. It's quite silly that gcc spends huge amounts of code-size on loop prologues/epilogues, but tiny amounts on the actual loop; for all it knows the loop will run millions of iterations for each one time the code outside runs.)
Unfortunately it's not like gcc's auto-vectorized code is very efficient or compact. It spends a lot of code size on the loop cleanup code for the 16-byte SSE case (fully unrolling 15 byte-copies). With 32-byte AVX vectors, we get a rolled-up byte loop to handle the leftover elements. (For a 17 byte copy, this is pretty terrible vs. 1 XMM vector + 1 byte or glibc style overlapping 16-byte copies). With gcc7 and earlier, it does the same full unrolling until an alignment boundary as a loop prologue so it's twice as bloated.
IDK if profile-guided optimization would optimize gcc's strategy here, e.g. favouring smaller / simpler code when the count is small on every call, so auto-vectorized code wouldn't be reached. Or change strategy if the code is "cold" and only runs once or not at all per run of the whole program. Or if the count is usually 16 or 24 or something, then scalar for the last n % 32 bytes is terrible so ideally PGO would get it to special case smaller counts. (But I'm not too optimistic.)
I might report a GCC missed-optimization bug for this, about detecting memcpy after an overlap check instead of leaving it purely up to the auto-vectorizer. And/or about using rep movs for -Os, maybe with -mtune=icelake if more info becomes available about that uarch.
A lot of software gets compiled with only -O2, so a peephole for rep movs other than the auto-vectorizer could make a difference. (But the question is whether it's a positive or negative difference)!

Profiling _mm_setzero_ps and {0.0f,0.0f,0.0f,0.0f}

EDIT: As Cody Gray pointed out in his comment, profiling with disabled optimization is complete waste of time. How then should i approach this test?
Microsoft in its XMVectorZero in case if defined _XM_SSE_INTRINSICS_ uses _mm_setzero_ps and {0.0f,0.0f,0.0f,0.0f} if don't. I decided to check how big is the win. So i used the following program in Release x86 and Configuration Properties>C/C++>Optimization>Optimization set to Disabled (/Od).
constexpr __int64 loops = 1e9;
inline void fooSSE() {
for (__int64 i = 0; i < loops; ++i) {
XMVECTOR zero1 = _mm_setzero_ps();
//XMVECTOR zero2 = _mm_setzero_ps();
//XMVECTOR zero3 = _mm_setzero_ps();
//XMVECTOR zero4 = _mm_setzero_ps();
}
}
inline void fooNoIntrinsic() {
for (__int64 i = 0; i < loops; ++i) {
XMVECTOR zero1 = { 0.f,0.f,0.f,0.f };
//XMVECTOR zero2 = { 0.f,0.f,0.f,0.f };
//XMVECTOR zero3 = { 0.f,0.f,0.f,0.f };
//XMVECTOR zero4 = { 0.f,0.f,0.f,0.f };
}
}
int main() {
fooNoIntrinsic();
fooSSE();
}
I ran the program twice first with only zero1 and second time with all lines uncommented. In the first case intrinsic loses, in the second intrinsic is clear winner. So, my questions are:
Why intrinsic does not always win?
Does the profiler i used is a proper tool for such measurements?
Profiling things with optimization disabled gives you meaningless results and is a complete waste of time. If you are disabling optimization because otherwise the optimizer notices that your benchmark actually does nothing useful and is removing it entirely, then welcome to the difficulties of microbenchmarking!
It is often very difficult to concoct a test case that actually does enough real work that it will not be removed by a sufficiently smart optimizer, yet the cost of that work does not overwhelm and render meaningless your results. For example, a lot of people's first instinct is to print out the incremental results using something like printf, but that's a non-starter because printf is incredibly slow and will absolutely ruin your benchmark. Making the variable that collects the intermediate values as volatile will sometimes work because it effectively disables load/store optimizations for that particular variable. Although this relies on ill-defined semantics, that's not important for a benchmark. Another option is to perform some pointless yet relatively cheap operation on the intermediate results, like add them together. This relies on the optimizer not outsmarting you, and in order to verify that your benchmark results are meaningful, you'll have to examine the object code emitted by the compiler and ensure that the code is actually doing the thing. There is no magic bullet for crafting a microbenchmark, unfortunately.
The best trick is usually to isolate the relevant portion of the code inside of a function, parameterize it on one or more unpredictable input values, arrange for the result to be returned, and then put this function in an external module such that the optimizer can't get its grubby paws on it.
Since you'll need to look at the disassembly anyway to confirm that your microbenchmark case is suitable, this is often a good place to start. If you are sufficiently competent in reading assembly language, and you have sufficiently distilled the code in question, this may even be enough for you to make a judgment about the efficiency of the code. If you can't make heads or tails of the code, then it is probably sufficiently complicated that you can go ahead and benchmark it.
This is a good example of when a cursory examination of the generated object code is sufficient to answer the question without even needing to craft a benchmark.
Following my advice above, let's write a simple function to test out the intrinsic. In this case, we don't have any input to parameterize upon because the code literally just sets a register to 0. So let's just return the zeroed structure from the function:
DirectX::XMVECTOR ZeroTest_Intrinsic()
{
return _mm_setzero_ps();
}
And here is the other candidate that performs the initialization the seemingly-naïve way:
DirectX::XMVECTOR ZeroTest_Naive()
{
return { 0.0f, 0.0f, 0.0f, 0.0f };
}
Here is the object code generated by the compiler for these two functions (it doesn't matter which version, whether you compile for x86-32 or x86-64, or whether you optimize for size or speed; the results are the same):
ZeroTest_Intrinsic
xorps xmm0, xmm0
ret
ZeroTest_Naive
xorps xmm0, xmm0
ret
(If AVX or AVX2 instructions are supported, then these will both be vxorps xmm0, xmm0, xmm0.)
That is pretty obvious, even to someone who cannot read assembly code. They are both identical! I'd say that pretty definitively answers the question of which one will be faster: they will be identical because the optimizer recognizes the seemingly-naïve initializer and translates it into a single, optimized assembly-language instruction for clearing a register.
Now, it is certainly possible that there are cases where this is embedded deep within various complicated code constructs, preventing the optimizer from recognizing it and performing its magic. In other words, the "your test function is too simple!" objection. And that is most likely why the library's implementer chose to explicitly use the intrinsic whenever it is available. Its use guarantees that the code-gen will emit the desired instruction, and therefore the code will be as optimized as possible.
Another possible benefit of explicitly using the intrinsic is to ensure that you get the desired instruction, even if the code is being compiled without SSE/SSE2 support. This isn't a particularly compelling use-case, as I imagine it, because you wouldn't be compiling without SSE/SSE2 support if it was acceptable to be using these instructions. And if you were explicitly trying to disable the generation of SSE/SSE2 instructions so that you could run on legacy systems, the intrinsic would ruin your day because it would force an xorps instruction to be emitted, and the legacy system would throw an invalid operation exception immediately upon hitting this instruction.
I did see one interesting case, though. xorps is the single-precision version of this instruction, and requires only SSE support. However, if I compile the functions shown above with only SSE support (no SSE2), I get the following:
ZeroTest_Intrinsic
xorps xmm0, xmm0
ret
ZeroTest_Naive
push ebp
mov ebp, esp
and esp, -16
sub esp, 16
mov DWORD PTR [esp], 0
mov DWORD PTR [esp+4], 0
mov DWORD PTR [esp+8], 0
mov DWORD PTR [esp+12], 0
movaps xmm0, XMMWORD PTR [esp]
mov esp, ebp
pop ebp
ret
Clearly, for some reason, the optimizer is unable to apply the optimization to the use of the initializer when SSE2 instruction support is not available, even though the xorps instruction that it would be using does not require SSE2 instruction support! This is arguably a bug in the optimizer, but explicit use of the intrinsic works around it.

Could this alternative way to loop be more effcient?

I was bored one rainy afternoon and came up with this:
int ia_array[5][5][5]; //interger array called array
{
int i = 0, j = 0, k = 0;//counters
while( i < 5 )//loop conditions
{
ia_array[i][j][k] = 0;//do something
__asm inc k;//++k;
if( k > 4)
{
__asm inc j; //++j;
__asm mov k,0;///k = 0;
}
if( j > 4)
{
__asm inc i; //++i;
__asm mov j,0;//j = 0;
}
}//end of while
}//i,j,k fall out of scope
its functionally equivalent to three nested for loops. However in a for loop you cannot use __asm statements. Also you have the option to not put the counters in a scope so you can reuse them for other loops. I have looked at the disassembly for both and my alternative has 15 opcodes and the nested for loops have 24. Therefore is it potentially faster? suppose I'm really asking is __asm inc i; faster then ++i;?
note: i don't intent to use this code in any projects, just out of curiosity. thanks for your time.
First off, your compiler will likely store the values of i, j and k in registers.
It's more efficient to do for (i = 4; i <=0; i--) than for(i = 0; i < 5; i++) as the cpu can determine if the result of the last operation it executed was zero for free - it doesn't have to explicitly compare to 4 (see the cmovz instruction).
It's the not the case for x86 that having to execute less instruction will lead to faster code. There are all sorts of issues to do with instruction pipelining that quickly get too much for a programmer to write by hand. Leave it to the compiler, they're sufficiently efficient these days (though definitely not optimal... but who wants to wait hours for their code to compile).
You can check it out yourself by running your function a few hundred thousand times with each implementation and check which is faster. Check if you can write asm instructions in for loops with
__asm {
inc j;
mov k, 0;
}
(it's been a while since I did this)
P.S. Have fun experimenting with asm, it can be very interesting and rewarding!
No, it won't be even remotely faster. Infact, it could quite easily be slower. Your compiler's optimizer is almost certainly more effective at this than you are.
This is going to be very compiler and compiler switch specific, but your code will have three tests per loop iteration where a traditional nested loop would only have one per inner-most loop iteration, so I think your approach would tend to be slower in general.
Several things:
You can't judge the speed of assembly code based on the number of opcodes in the output. Compilers can unroll loops to eliminate branches, and many modern compilers will attempt to vectorize a loop like the one above. The former could have more opcodes than naive code and be faster, and the latter could have fewer and be faster.
By putting __asm statements in your code, you're probably precluding any optimizations the compiler could do on the loop. So if you compiled this with something really fast like, say, the Intel compilers, then you will likely get worse performance with your code than with the compiler. This is especially true for something as simple as your code here, where the array sizes are known statically and the loop bounds are constant.
If you really want to get a sense of what compilers can/can't do, grab a book or take a course on optimizing compilers and vectorization. There are tons of different optimizations and understanding the performance of even a simple piece of code like this on a particular architecture can be subtle.
There are plenty of kernels and number crunching codes where compilers still can't do better than knowledgable humans, but without a lot of experience with architecture details you're not going to do much better than icc -fast or xlC -O5.
While it certainly is possible to beat a compiler at optimization, you're not going to do it this way. The bits you've written in assembly language are pretty obvious, mechanical types of translations that any half-way decent compiler (or even a pretty lousy one) can do easily.
If you want to beat the compiler, you need to go a lot further, such as rearranging instructions to allow more to execute in parallel (decidedly non-trivial) or finding a better sequence of instructions than the compiler can.
In this case, for example, you might at least stand a chance by noting that iarray[5][5][5] can (from an assembly language viewpoint) be treated as a single, flat array of 5*5*5 = 125 elements, and encode most of what's essentially a memset into a single instruction:
mov ecx, 125 // 125 elements
xor eax, eax // set them to zero
mov di, offset ia_array // where we're going to store them
rep stosd // and fill that memory.
Realistically, however, this probably isn't going to be a major (or probably even minor) improvement over what the compiler is likely to generate. It's more likely close to the minimum necessary to (at least nearly) keep up.
The next step would be to consider using non-temporal stores instead of a simple stosd. This won't actually speed up this loop (much, anyway), but it might gain some speed overall by avoiding this store polluting the cache if it's possible that other code already in the cache is more important immediately. You could also use some of the other SSE instructions to gain a little speed -- but even at best, you can't expect much better than a couple of percent out of this. The bottom line is that for zeroing some memory, the speed is limited primarily by the bus speed, not the instructions you use, so nothing you do is likely to help much.

What C++ code compiles down to the x86 REP instruction?

I'm copying elements from one array to another in C++. I found the rep movs instruction in x86 that seems to copy an array at ESI to an array at EDI of size ECX. However, neither the for nor while loops I tried compiled to a rep movs instruction in VS 2008 (on an Intel Xeon x64 processor). How can I write code that will get compiled to this instruction?
Honestly, you shouldn't. REP is sort of an obsolete holdover in the instruction set, and actually pretty slow since it has to call a microcoded subroutine inside the CPU, which has a ROM lookup latency and is nonpipelined as well.
In almost every implementation, you will find that the memcpy() compiler intrinsic both is easier to use and runs faster.
Under MSVC there are the __movsxxx & __stosxxx intrinsics that will generate a REP prefixed instruction.
there is also a 'hack' to force intrinsic memset aka REP STOS under vc9+, as the intrinsic no longer exits, due to the sse2 branching in the crt. this is better that __stosxxx due to the fact the compiler can optimize it for constants and order it correctly.
#define memset(mem,fill,size) memset((DWORD*)mem,((fill) << 24|(fill) << 16|(fill) << 8|(fill)),size)
__forceinline void memset(DWORD* pStart, unsigned long dwFill, size_t nSize)
{
//credits to Nepharius for finding this
DWORD* pLast = pStart + (nSize >> 2);
while(pStart < pLast)
*pStart++ = dwFill;
if((nSize &= 3) == 0)
return;
if(nSize == 3)
{
(((WORD*)pStart))[0] = WORD(dwFill);
(((BYTE*)pStart))[2] = BYTE(dwFill);
}
else if(nSize == 2)
(((WORD*)pStart))[0] = WORD(dwFill);
else
(((BYTE*)pStart))[0] = BYTE(dwFill);
}
of course REP isn't always the best thing to use, imo your way better off using memcpy, it'll branch to either sse2 or REPS MOV based on your system (under msvc), unless you feeling like writing custom assembly for 'hot' areas...
If you need exactly that instruction - use built-in assembler and write that instruction manually. You can't rely on the compiler to produce any specific machine code - even if it emits it in one compilation it can decide to emit some other equivalent during next compilation.
REP and friends was nice once upon a time, when the x86 CPU was a single-pipeline industrial CISC-processor.
But that has changed. Nowadays when the processor encounters any instruction, the first it does is translating it into an easier format (VLIW-like micro-ops) and schedules it for future execution (this is part of out-of-order-execution, part of scheduling between different logical CPU cores, it can be used to simplifying write-after-write-sequences into single-writes, et.c.). This machinery works well for instructions that translates into a few VLIW-like opcodes, but not machine-code that translates into loops. Loop-translated machine code will probably cause the execution pipeline to stall.
Rather than spending hundreds of thousands of transistors into building CPU-circuitry for handling looping portions of the micro-ops in the execution pipeline, they just handle it in some sort of crappy legacy-mode that stutterly stalls the pipeline, and ask modern programmers to write your own damn loops!
Therefore it is seldom used when machines write code. If you encounter REP in a binary executable, its probably a human assembly-muppet who didn't know better, or a cracker that really needed the few bytes it saved to use it instead of an actual loop, that wrote it.
(However. Take everything I just wrote with a grain of salt. Maybe this is not true anymore. I am not 100% up to date with the internals of x86 CPUs anymore, I got into other hobbies..)
I use the rep* prefix variants with cmps*, movs*, scas* and stos* instruction variants to generate inline code which minimizes the code size, avoids unnecessary calls/jumps and thereby keeps down the work done by the caches. The alternative is to set up parameters and call a memset or memcpy somewhere else which may overall be faster if I want to copy a hundred bytes or more but if it's just a matter of 10-20 bytes using rep is faster (or at least was the last time I measured).
Since my compiler allows specification and use of inline assembly functions and includes their register usage/modification in the optimization activities it is possible for me to use them when the circumstances are right.
On a historic note - not having any insight into the manufacturer's strategies - there was a time when the "rep movs*" (etc) instructions were very slow. I think it was around the time of the Pentium/Pentium MMX. A colleague of mine (who had more insight than I) said that the manufacturers had decreased the chip area (<=> fewer transistors/more microcode) allocated to the rep handling and used it to make other, more used instructions faster.
In the fifteen years or so since rep has become relatively speaking faster again which would suggest more transistors/less microcode.