Inline assembler causes freezing inside of another function [duplicate]

Inline assembler causes freezing inside of another function [duplicate] - c++

This question already has answers here:
Is inline assembly language slower than native C++ code?
(21 answers)
Closed 5 years ago.
I've noticed that when using my inline assembly code is either incredibly slow or stops compared to my C++ code which finishes very quickly. I'm curious as to why to happens when I call upon the inline assembler in a different function as opposed to having the assembler where the function was called. I tested both ways and found that my program did not freeze when omitting the function.
__asm {
push dword ptr[rw] //rw is a C++ floating-point variable
fld[esp] // Using the stack as temporary storage in order to insert it into the FPU
add esp, 4 //preserving the memory
push dword ptr[lwB]
fld[esp]
add esp, 4
fsubp ST(1), ST(0) // Subtracting rw - lwB
push dword ptr[sp]
fld[esp]
add esp, 4
fdivp ST(1), ST(0) // Dividing previous resultant by span -> (rw - lwB) / sp
push dword ptr[dimen]
fld[esp]
add esp, 4
fmulp ST(1), ST(0) // Multiplying previous resultant by dimension > ((rw - lwB) / (sp)* dimen)
sub esp, 4 // Allocating space in order to save result temporarily to ram then to eax then to a C++ variable
fstp[esp]
pop eax
mov fCord, eax
}
return (int)fCord; //fCord is also a floating-point C++ variable
The much faster C++ Code:
return (int)(((rw - lwB) / (sp)* dimen));

Today's compilers are much more advanced which can do branch prediction, memory operation reduction etc compared to handcoded assembly. This does not mean hand coded assembly is bad all the time but for most of the cases compiler can do equally good or better job of optimization when configured with right flags. In your case, you have used lot of stack operation and everyone of them leads to a memory load/store which is expensive in terms of CPU cycles. This could be the reason for slower performance. See the disassmbly code of your c++ implementation when compiled in release mode for comparing your handcoded assembly with the compiler generated output.

Thanks all, I had a pretty strange issue but it might just be common. I had an inline assembler in a different function and called upon it for calculations. After moving this function into where it was called instead, I have fixed the issue. I'm sure there is a bigger lesson at hand though.
Obviously, the code is inefficient and the comments/answers are helpful in general, although my problem was a bit different.
For anybody wondering, here is the optimal assembly code that the compiler built:
float finCord;
__asm {
movss xmm0, dword ptr[rw]
subss xmm0, dword ptr[lwB]
divss xmm0, dword ptr[sp]
mulss xmm0, dword ptr[dimen]
movss dword ptr[fCord],xmm0
}
int answer = (int)finCord;

Related

Strange uses of movzx by Clang and GCC

I know that movzx can be used for dependency breaking, but I stumbled on some movzx uses by both Clang and GCC that I really can't see what good they are for. Here's a simple example I tried on Godbolt compiler explorer:
#include <stdint.h>
int add2bytes(uint8_t* a, uint8_t* b) {
return uint8_t(*a + *b);
}
with GCC 12 -O3:
add2bytes(unsigned char*, unsigned char*):
movzx eax, BYTE PTR [rsi]
add al, BYTE PTR [rdi]
movzx eax, al
ret
If I understand correctly, the first movzx here breaks dependency on previous eax value, but what is the second movzx doing? I don't think there's any dependency it can break, and it shouldn't affect the result either.
with clang 14 -O3, it's even more weird:
add2bytes(unsigned char*, unsigned char*): # #add2bytes(unsigned char*, unsigned char*)
mov al, byte ptr [rsi]
add al, byte ptr [rdi]
movzx eax, al
ret
It uses mov where movzx seems more reasonable, and then zero extends al to eax, but wouldn't it be much better to do movzx at the start?
I have 2 more examples here: https://godbolt.org/z/z45xr4hq1
GCC generates both sensible and strange movzx, and Clang's use of mov r8 m and movzx just makes no sense to me. I also tried adding -march=skylake to make sure this isn't a feature for really old architectures, but the generated assembly looks more or less the same.
The closest post I have found is https://stackoverflow.com/a/64915219/14730360 where they showed similar movzx uses that seem useless and/or out of place.
Do the compilers really use movzx poorly here, or am I missing something?
Edit: I have opened bug reports for Clang and GCC:
https://github.com/llvm/llvm-project/issues/56498
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106277
Temporary workarounds using inline assembly:
https://godbolt.org/z/7qob8G3j7
#define addb(a, b) asm (\
"addb %1, %b0"\
: "+r"(a) : "mi"(b))
int add2bytes(uint8_t* a, uint8_t* b) {
int ret = *a;
addb(ret, *b);
return ret;
}
Now Clang -O3 produces:
add2bytes(unsigned char*, unsigned char*): # #add2bytes(unsigned char*, unsigned char*)
movzx eax, byte ptr [rdi]
add al, byte ptr [rsi]
ret

Both compilers are doing a poor job here, but clang's code is especially bad and has no real upside anywhere. And an easily avoidable downside on everything except Intel CPUs a decade old (which rename low-8 partial registers).
The optimal asm is what you suggest, movzx load, then byte add, leaving a uint8_t result in the low byte, correctly zero-extended to int as required by the C semantics. (Thanks for reporting it upstream: https://github.com/llvm/llvm-project/issues/56498 - I commented there about movzx being a good idea for byte loads in general, even when LLVM doesn't need the result zero-extended.)
A movzx is necessary somewhere, but it can be in the initial load. (A movzx is generally a good idea for a byte load anyway, to avoid a false dependency on the old RAX; clang's choice to save 1 byte is probably not a good one even when it doesn't end up needing a separate movzx right after.)
There are basically three relevant behaviours here, among x86-64 CPUs.
Core 2 / Nehalem (the 64-bit capable members of the P6 family): AL renamed separately from RAX if you write AL. A later read of EAX will stall the front-end for about 3 cycles while inserting a merge uop. Less bad than earlier P6-family, but still a significant penalty to avoid. But these CPUs are pretty obsolete, and not something GCC's -mtune=generic should put much weight on for the latest GCC. (Especially given that current nightly GCC's behaviour now won't be get baked into widely used binary packages for another year or more probably, by most stable-release distros.)
Returning an int when the last instruction wrote al will likely lead to a penalty when the caller reads EAX. But mov al, [rdi] can run without any false dependency or merging cost.
Sandybridge and maybe Ivy Bridge: AL still renamed separately, but a merging uop can be inserted without any stalling, in a cycle with other uops.
mov al, [rdi] still has no false dep or merging uop. But a later read of EAX that triggers a merging uop (to merge the add al result with the high bytes of RAX from movzx eax, [rdi]) will get inserted just as cheaply as if we'd put a movzx eax, al in the machine code. (If the upper bytes of RAX are all zero, merge or extend are equivalent.)
Haswell and later (and maybe IvB), and all other x86 vendors, and low-power CPUs from Intel like Silvermont-family: no partial register renaming at all. (Except for AH/BH/CH/DH on Intel SnB-family). The last CPU not in this category is nearly a decade old, and the last CPU with major penalties (P6-family) is over a decade old.
mov al, [rdi] sucks: false dependency and costs an ALU uop in the back-end to merge. So it's extra load latency in the critical path through whatever stored the memory operand.
Reading EAX after writing AL has zero penalty; that's not a special case at all; the merging happened when you wrote AL.
GCC's code is a sensible tradeoff between Core2 / Nehalem vs. modern CPUs: load with movzx to avoid a false dep writing a partial reg. And a final movzx to avoid a partial-register stall in the caller.
But if it's going to do that, it could hurt modern Intel less by picking EDX or ECX as the temporary, since Intel can do zero-latency mov-elimination on movzx r32, r8, but not within the same register. It still costs a front-end uop so it's not free for throughput, only latency and back-end ports. This is a persistent missed-optimization; I don't think GCC or clang know to look for that; they commonly zero-extend 32->64 with mov esi,esi on a function arg, for example.
movzx edx, byte ptr [rdi]
add dl, [rsi]
movzx eax, dl # mov-elimination possible on IvB and later (except Ice Lake with updated microcode which breaks mov-elim).
If optimizing specifically for Core2 / Nehalem, you'd do this:
xor eax, eax # off the critical path, avoids partial-reg stalls for later reads of EAX after writing AL
mov al, [rdi]
add al, [rsi]
That's not bad on later CPUs, although the mov al, [rdi] would still be a micro-fuse load+ALU uop so it has extra load latency, and takes an extra slot in the scheduler and a cycle on a back-end execution port. So 3 back-end uops, up from 2 in IvB and later with eliminated movzx if you pick different registers.
GCC's choice to use movzx because of Core2/Nehalem is highly conservative at this point; probably -mtune=generic in GCC12 shouldn't care about P6-family partial-register stalls since those CPUs are well over a decade old. Especially in 64-bit code where the worst case is Core2/Nehalem, not the even longer stalls with no merging uop on earlier P6-family. (And 64-bit code is more likely to be run on newer CPUs; one of the use-cases for -m32 is to make code for old 32-bit-only CPUs.)
It might well be an intentional tuning choice that needs updating. It's definitely a missed optimization with -march / -mtune= k8 through znver3, or silvermont-family, or sandybridge or newer.
(Also note that some choices which should differ based on -mtune setting actually don't. GCC just has one way it always does some things, and adding hooks to make it differ based on a tuning flag hasn't been done. Clang is the same way. e.g. -mtune=core2 still doesn't know to avoid partial-register stalls!)
Clang normally lives dangerously writing partial registers and otherwise ignoring false dependencies when they're not visibly loop-carried within a single function (which can bite it in the ass). This can save a whole instruction when it skips xor-zeroing, but saving just 1 byte doesn't seem worth it in general. It's a false dependency and means the mov load decodes to load + ALU merge uops (to merge a new low byte into the existing 64-bit register).
Looks like clang just did its usual thing of loading 8-bit values into 8-bit registers ignoring movzx, then realized at the end it needed to zero-extend the result.
An optimization pass looking for a chance to fold zero-extension (after narrow math) into an earlier load would be useful. And/or otherwise look for ways to prove that values are already zero-extended, if it doesn't do that.
Probably in general better to start doing narrow loads with movzx so that's more normally the case.
You might want to report a missed-optimization bug, especially for clang. Their code-gen is already a huge middle finger toward P6-family most of the time with partial-register usage, so they'd probably be interested in trying to generate the 2-instruction version. https://github.com/llvm/llvm-project/issues
Also https://gcc.gnu.org/bugzilla/enter_bug.cgi?product=gcc (use the keyword missed-optimization for GCC bugs. Feel free to link this stack overflow post, and/or quote any of my comments if you want, as well as a Godbolt link. GCC devs prefer AT&T syntax for x86 discussion / bugs.)
See also:
Why doesn't GCC use partial registers?
How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent
https://agner.org/optimize/ (especially his microarch guide re: partial-register details for P6-family CPUs. Last I looked, the guide incorrectly said Haswell doesn't have zero-latency movzx eax, dl, and that AH-merging was free; see my Q&A about HSW/SKL. But Agner's guide is accurate AFAIK for earlier CPUs.)
https://uops.info/ (front-end vs. back-end vs. latency costs for different instructions)
What is the best way to set a register to zero in x86 assembly: xor, mov or and? - including the part about avoiding partial-register stalls on P6, how xor eax,eax sets some kind of internal EAX=AL flag.
I have 2 more examples here: https://godbolt.org/z/z45xr4hq1 GCC generates both sensible and strange movzx, and Clang's use of mov and movzx just makes no sense to me.
clang's mov ecx, edx zero-extension from 32 to 64 instead of from 8 to 64 is because it depends on an unofficial extension to the x86-64 SysV calling convention, that narrow args are extended to 32-bit. AMD Zen CPUs can do mov-elimination on mov ecx, edx but not for movzx-byte, so this is actually more efficient, as well as saving code-size.
(GCC and clang both make callers that respect this unofficial calling-convention feature, but only clang makes callees that depend on it. ICC doesn't do either so is not ABI-compatible with clang.)
Extension to intptr_t is of course necessary for all narrower args if you're going to index an array with one. (In abstract C terms, this is just part of using the value for pointer math). High garbage is allowed in at least the high 32 bits of the 64-bit register.

The clang bit actually seems reasonable. You get a partial register stall if you write to al and then read from eax. Using movzx breaks this partial register stall.
The initial mov to al has no dependencies on existing values of eax (due to register renaming), so the dependencies are just the unavoidable dependencies (wait for [rsi], wait for [rdi], wait for add to complete before zero-extending).
In other words, the top 24 bits must be zeroed and the lower 8 bits must be calculated, but the two actions can be done in either order. clang just chooses to add first, zero later.
[EDIT]
As for GCC, it seems a particularly bad choice. If it had chosen bl as the temporary register, that last movzx would be zero-latency on Haswell/SkyLake, but move elimination does not work on al to eax.

The final MOVZX is mandated by the fact that the function returns an int, extended from a byte. It must be there in the clang version, but with gcc one is extra.

Why are x86-64 C/C++ compilers not generating more efficient assembly for this code?

Consider the following declaration of local variables:
bool a{false};
bool b{false};
bool c{false};
bool d{false};
bool e{false};
bool f{false};
bool g{false};
bool h{false};
in x86-64 architectures, I'd expect the optimizer to reduce the initialization of those variables to something like mov qword ptr [rsp], 0. But instead what I get with all the compilers (regardless of level of optimization) I've been able to try is some form of:
mov byte ptr [rsp + 7], 0
mov byte ptr [rsp + 6], 0
mov byte ptr [rsp + 5], 0
mov byte ptr [rsp + 4], 0
mov byte ptr [rsp + 3], 0
mov byte ptr [rsp + 2], 0
mov byte ptr [rsp + 1], 0
mov byte ptr [rsp], 0
Which seems like a waste of CPU cycles. Using copy-initialization, value-initialization or replacing braces with parentheses made no difference.
But wait, that's not all. Suppose that I have this instead:
struct
{
bool a{false};
bool b{false};
bool c{false};
bool d{false};
bool e{false};
bool f{false};
bool g{false};
bool h{false};
} bools;
Then the initialization of bools generates exactly what I'd expect: mov qword ptr [rsp], 0. What gives?
You can try the code above yourself in this Compiler Explorer link.
The behavior of the different compilers is so consistent that I am forced to think there is some reason for the above inefficiency, but I have not been able to find it. Do you know why?

Compilers are dumb, this is a missed-optimization. mov qword ptr [rsp], 0 would be optimal. Store forwarding from a qword store to a byte reload of any individual byte is efficient on modern CPUs. (https://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/)
(Or even better, push 0 instead of sub rsp, 8 + mov, also a missed optimization because compilers don't bother looking for cases where that's possible.)
Presumably the optimization pass that looks for store merging runs before nailing down the locations of locals in the stack frame relative to each other. (Or before even deciding which locals can be kept in registers and which need memory addresses at all.)
Store merging aka coalescing was only recently reintroduced in GCC8 IIRC, after being dropped as a regression from GCC2.95 to GCC3, again IIRC. (I think other optimizations like assuming no strict-aliasing violations to keep more vars in registers more of the time, were more useful). So it's been missing for decades.
From one POV, you could say consider yourself lucky you're getting any store merging at all (with struct members, and array elements, that are known early to be adjacent). Of course, from another POV, compilers ideally should make good asm. But in practice missed optimizations are common. Fortunately we have beefy CPUs with wide superscalar out-of-order execution to usually chew through this crap to still see upcoming cache miss loads and stores pretty quickly, so wasted instructions sometimes have time to execute in the shadow of other bottlenecks. That's not always true, and clogging up space in the out-of-order execution window is never a good thing.
Related: In x86-64 asm: is there a way of optimising two adjacent 32-bit stores / writes to memory if the source operands are two immediate values? covers the general case for constants other than 0, re: what the optimal asm would be. (The difference between array vs. separate locals was only discussed in comments there.)

Mixing c++ and assembly cant pass multiple paramaters from C++ function to assembly

I've been frustrated by passing parameters from a c++ function to assembly. I couldn't find anything that helped on Google and would really like your help. I am using Visual Studio 2017 and masm to compile my assembly code.
This is a simplified version of my c++ file where I call the assembly procedure set_clock
int main()
{
TimeInfo localTime;
char clock[4] = { 0,0,0,0 };
set_clock(clock,&localTime);
system("pause");
return 0;
}
I run into problems in the assembly file. I can't figure out why the second parameter passed to the function turns out huge. I was going off my textbook, which shows similar code with PROC followed by parameters. I don't know why the first parameter is passed successfully and the second one isn't. Can someone tell me the correct way to pass multiple parameters?
.code
set_clock PROC,
array:qword,address:qword
mov rdx,array ; works fine memory address: 0x1052440000616
mov rdi,address ; value of rdi is 14757395258967641292
mov al, [rdx]
mov [rdi],al ; ERROR: cant access that memory location
ret
set_clock ENDP
END

MASM's high-level crap is biting you in the ass. x64 Windows passes the first 4 args in rcx, rdx, r8, r9 (for any of those 4 that are integer/pointer).
mov rdx,array
mov rdi,address
assembles to
mov rdx, rcx ; clobber 2nd arg with a copy of the 1st
mov rdi, rdx ; copy array again
Use a disassembler to check for yourself. Always a good idea to check the real machine code by disassembling or using your debuggers disassembly instead of source mode, if anything weird is happening with assembler macros.
I'm not sure why this would result in an inaccessible memory location. If both args really are pointers to locals, then it should just be loading and storing back into the same stack location. But if char clock[4] is a const in static storage, it might be in a read-only memory page which would explain the store failing.
Either way, use a debugger and find out.
BTW, rdi is a call-preserved (aka non-volatile) register in the x64 Windows convention. (https://msdn.microsoft.com/en-us/library/9z1stfyw.aspx). Use call-clobbered registers for scratch regs unless you run out and need to save/restore some call-preserved regs. See also Agner Fog's calling conventions doc (http://agner.org/optimize/), and other links in the x86 tag wiki.
It's call-clobbered in x86-64 System V, which also passes args in different registers. Maybe you were looking at a different example?
Hopefully-fixed version, using movzx to avoid a false dependency on RAX when loading a byte.
set_clock PROC,
array:qword,address:qword
movzx eax, byte ptr [array]
mov [address], al
ret
set_clock ENDP
I don't use MASM, but I think array:qword makes array an alias for rcx. Or you could skip declaring the parameters and just use rcx and rdx directly, and document it with comments. That would be easier for everyone to understand.
You definitely don't want useless mov reg,reg instructions cluttering your code; if you're writing in asm in the first place, wasted instructions would cut into any speedups you're getting.

Is Visual Studio emitting erroneous assembly code?

Disassembling the following function using VS2010
int __stdcall modulo(int a, int b)
{
return a % b;
}
gives you:
push ebp
mov ebp,esp
mov eax,dword ptr [ebp+8]
cdq
idiv eax,dword ptr [ebp+0Ch]
mov eax,edx
pop ebp
ret 8
which is pretty straightforward.
Now, trying the same assembly code as inline assembly fails with error C2414: illegal number of operands, pointing to idiv.
I read Intel's manual and it says that idiv accept only 1 operand:
Divides the (signed) value in the AX, DX:AX, or EDX:EAX (dividend) by
the source operand (divisor) and stores the result in the AX (AH:AL),
DX:AX, or EDX:EAX registers
and sure enough, removing the extra eax compiles and the function returns the correct result.
So, what is going on here? why is VS2010 emitting erroneous code ? (btw, VS2012 emits exactly the same assembly)

The difference, most likely, is that the author of the disassembler intended for its output to be read by a human rather than by an assembler. Since it's a disassembly, one assumes the underlying code has already been assembled/compiled, so why worry about whether it can be re-assembled?
There are two major possibilities here:
The author didn't realize or didn't remember that the dividend is implicit for idiv (in which case this could be considered a bug).
They felt that making it explicit in the output would be helpful for a reader trying to understand the flow of the disassembly (in which case it's a feature). Without the explicit operand, it would be easy to overlook that idiv both depends on and modifies eax when scanning the disassembly.

How does memory barrier work?

Under Windows, there are three compiler-intrinsic functions to implement memory barrier:
1. _ReadBarrier;
2. _WriteBarrier;
3. _ReadWriteBarrier;
However, I found a weird problem: _ReadBarrier seems a dummy function doing nothing! The following is my assembly code generated by VC++ 2012.
My question is: How to implement a memory barrier function in assembly instructions?
int main()
{
013EEE10 push ebp
013EEE11 mov ebp,esp
013EEE13 sub esp,0CCh
013EEE19 push ebx
013EEE1A push esi
013EEE1B push edi
013EEE1C lea edi,[ebp-0CCh]
013EEE22 mov ecx,33h
013EEE27 mov eax,0CCCCCCCCh
013EEE2C rep stos dword ptr es:[edi]
int n = 0;
013EEE2E mov dword ptr [n],0
n = n + 1;
013EEE35 mov eax,dword ptr [n]
013EEE38 add eax,1
013EEE3B mov dword ptr [n],eax
_ReadBarrier();
n = n + 1;
013EEE3E mov eax,dword ptr [n]
013EEE41 add eax,1
013EEE44 mov dword ptr [n],eax
}
013EEE56 xor eax,eax
013EEE58 pop edi
013EEE59 pop esi
013EEE5A pop ebx
013EEE5B add esp,0CCh
013EEE61 cmp ebp,esp
013EEE63 call __RTC_CheckEsp (013EC3B0h)
013EEE68 mov esp,ebp
013EEE6A pop ebp
013EEE6B ret

_ReadBarrier, _WriteBarrier, and _ReadWriteBarrier are intrinsics that affect how the compiler can reorder code; they have absolutely nothing to do with CPU memory barriers and are only valid for specific kinds of memory (see "Affected Memory" here).
MemoryBarrier() is the intrinsic that you use to force a CPU memory barrier. However, the recommendation from Microsoft is to use std::atomic<T> going forward with VC++.

Modern processors are capable of executing instructions quite a long way ahead of where it actually is "completing" the instructions, so memory barriers are used to prevent it from running to far ahead when it comes to certain types of memory operations, where strict ordering is required - for most things, it doesn't actually matter if you write to variable a before variable b, or b before a. But sometimes it does.
The x86 instruction set has lfence, sfence and fence, which are instructions that "fence in" loads, stores and all memory operations respectively. The point about a "fence" or "barrier" instruction is to ensure that all the instructions that precede the barrier instruction has completed their loads, stores or both before the next instruction after the barrier can continue.
This is important if you are implementing for example semaphores, mutexes or similar instructions, since it's important to store the value saying "I've locked the semaphore" before you continue to read other data, for example. Otherwise things can go wrong, let's say.
Note that unless you REALLY know what you are doing with memory barriers, it's probably best to NOT use them - and rely on already existing code that solves the same problem - std::atomic are one place to fund such code. I have written quite a bit of "tricky" code, but only once or twice have I needed a memory barrier in my code.
Several times, I've needed to make the compiler not spread the code around, which you can do with "no-op functions", and apparently there are even special intrinsic functions these days to do that.

There are several important points to consider. Perhaps the
first is that barriers only have an effect in multithreaded
code, and most compilers require a special option to produce
multithreaded code. And things like _ReadBarrier are almost
certainly compiler built-ins, and should do nothing unless
you've given the options for multithreaded code.
The second is that what the hardware requires, even in a
multithreaded context, varies. On most of the machines I've
worked on (over some forty years), the machine never required
anything; barriers only become relevant if the machine has
sophisticated read and write pipelines. (Most earlier machines
didn't even have fence or barrier instructions, so the generated
code would have to be empty.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js