Why use xor with a literal instead of inversion (bitwise not) - c++

I have come across this CRC32 code and was curious why the author would choose to use
crc = crc ^ ~0U;
instead of
crc = ~crc;
As far as I can tell, they are equivalent.
I have even disassembled the two versions in Visual Studio 2010.
Not optimized build:
crc = crc ^ ~0U;
009D13F4 mov eax,dword ptr [crc]
009D13F7 xor eax,0FFFFFFFFh
009D13FA mov dword ptr [crc],eax
crc = ~crc;
011C13F4 mov eax,dword ptr [crc]
011C13F7 not eax
011C13F9 mov dword ptr [crc],eax
I also cannot justify the code by thinking about the number of cycles that each instruction takes since both should be taking 1 cycle to complete. In fact, the xor might have a penalty by having to load the literal from somewhere, though I am not certain of this.
So I'm left thinking that it is possibly just a preferred way to describe the algorithm, rather than an optimization... Would that be correct?
Edit 1:
Since I just realized that the type of the crc variable is probably important to mention I am including the whole code (less the lookup table, way too big) here so you don't have to follow the link.
uint32_t crc32(uint32_t crc, const void *buf, size_t size)
{
const uint8_t *p;
p = buf;
crc = crc ^ ~0U;
while (size--)
{
crc = crc32_tab[(crc ^ *p++) & 0xFF] ^ (crc >> 8);
}
return crc ^ ~0U;
}
Edit 2:
Since someone has brought up the fact that an optimized build would be of interest, I have made one and included it below.
Optimized build:
Do note that the whole function (included in the last edit below) was inlined.
// crc = crc ^ ~0U;
zeroCrc = 0;
zeroCrc = crc32(zeroCrc, zeroBufferSmall, sizeof(zeroBufferSmall));
00971148 mov ecx,14h
0097114D lea edx,[ebp-40h]
00971150 or eax,0FFFFFFFFh
00971153 movzx esi,byte ptr [edx]
00971156 xor esi,eax
00971158 and esi,0FFh
0097115E shr eax,8
00971161 xor eax,dword ptr ___defaultmatherr+4 (973018h)[esi*4]
00971168 add edx,ebx
0097116A sub ecx,ebx
0097116C jne main+153h (971153h)
0097116E not eax
00971170 mov ebx,eax
// crc = ~crc;
zeroCrc = 0;
zeroCrc = crc32(zeroCrc, zeroBufferSmall, sizeof(zeroBufferSmall));
01251148 mov ecx,14h
0125114D lea edx,[ebp-40h]
01251150 or eax,0FFFFFFFFh
01251153 movzx esi,byte ptr [edx]
01251156 xor esi,eax
01251158 and esi,0FFh
0125115E shr eax,8
01251161 xor eax,dword ptr ___defaultmatherr+4 (1253018h)[esi*4]
01251168 add edx,ebx
0125116A sub ecx,ebx
0125116C jne main+153h (1251153h)
0125116E not eax
01251170 mov ebx,eax

Something nobody's mentioned yet; if this code is being compiled on a machine with 16 bit unsigned int then these two code snippets are different.
crc is specified as a 32-bit unsigned integral type. ~crc will invert all bits, but if unsigned int is 16bit then crc = crc ^ ~0U will only invert the lower 16 bits.
I don't know enough about the CRC algorithm to know whether this is intentional or a bug, perhaps hivert can clarify; although looking at the sample code posted by OP, it certainly does make a difference to the loop that follows.
NB. Sorry for posting this as an "answer" because it isn't an answer, but it's too big to just fit in a comment :)

The short answer is: Because it allows to have an uniform algorithm for all CRC's
The reason is the following: There is a lot of variant of CRC. Each one depend on a Z/Z2 polynomial which is used for an euclidian division. Usually is it implemented using the algorithm described In this paper by Aram Perez. Now depending on the polynomial you are using, there is a final XOR at the end of the algorithm which depend on the polynomial whose goal is to eliminate some corner case. It happens that for CRC32 this is the same as a global not but this is not true for all CRC. As an evidence on This web page you can read (emphasis mine):
Consider a message that begins with some number of zero bits. The remainder will never contain anything other than zero until the first one in the message is shifted into it. That's a dangerous situation, since packets beginning with one or more zeros may be completely legitimate and a dropped or added zero would not be noticed by the CRC. (In some applications, even a packet of all zeros may be legitimate!) The simple way to eliminate this weakness is to start with a nonzero remainder. The parameter called initial remainder tells you what value to use for a particular CRC standard. And only one small change is required to the crcSlow() and crcFast() functions:
crc remainder = INITIAL_REMAINDER;
The final XOR value exists for a similar reason. To implement this capability, simply change the value that's returned by crcSlow() and crcFast() as follows:
return (remainder ^ FINAL_XOR_VALUE);
If the final XOR value consists of all ones (as it does in the CRC-32 standard), this extra step will have the same effect as complementing the final remainder. However, implementing it this way allows any possible value to be used in your specific application.

Just to add my own guess to the mix, x ^ 0x0001 keeps the last bit and flipps the others; to turn off the last bit use x & 0xFFFE or x & ~0x0001; to turn on the last bit unconditionally use x | 0x0001. I.e., if you are doing lots of bit-twiddling, your fingers probably know those idioms and just roll them out without much thinking.

I doubt there's any deep reason. Maybe that's how the author thought about it ("I'll just xor with all ones"), or perhaps how it was expressed in the algorithm definition.

I think it is for the same reason that some write
const int zero = 0;
and others write
const int zero = 0x00000000;
Different people think different ways. Even about a fundamental operation.

Related

Inline assembly language - Accessing an array's elements in C++

I'm upgrading a Borland C++ Builder 6 project to the latest Embarcadero C++ Builder (11.1)
It's a 32 bit windows app.
Some legacy code, not written by me, contains an array unsigned 32 bit integers, declared outside of a class like this:
UINT32 CRC32_TABLE[] =
{
0x000000000,0x077073096,0x0EE0E612C,0x0990951BA,
0x0076DC419,0x0706AF48F,0x0E963A535,0x09E6495A3]; //the actual array is a lot bigger
Later on there is this function:
static UINT32 GenerateCRC(const void* p_data,
int nbytes)
{
UINT32 rvalue;
__asm
{
mov esi,p_data
mov ecx,nbytes
mov ebx,0FFFFFFFFH // Initialize CRC accumulator to -1
/*--------------------------------------*/
/* Accumulate the CRC in the specified */
/* range of memory. */
/*--------------------------------------*/
loop2: // FOR i = 1 TO nbytes DO
xor eax,eax
mov al,[esi] // Get a byte to be checked
inc esi // Bump index
xor al,bl // XOR NEW BYTE WITH LOW CRC
shl eax,2 // MAKE IT A DWORD INDEX
mov edi,eax //
shr ebx,8 // SHIFT OLD CRC RIGHT 8
xor ebx,CRC32_TABLE[edi] // XOR SHIFTED CRC WITH CONSTANT
loop loop2 // ENDFOR
not ebx
mov rvalue,ebx
}
return rvalue;
}
The current compiler wont accept xor ebx,CRC32_TABLE[edi], saying "cannot use a base register with variable reference".
It is decades since I've done any assembly language work, so any pointers to fixing this would be very much appreciated.
I'm very open to replacing the asm code with C++, btw. I have no idea why this code was written in assembly language. It's not used in a time critical section but to verify logins... I'd like to think there was a better reason than 'because I could' but, judging from the C++ code, this was written by someone (sadly, no longer with us) who liked to complicate things for no reason...
Andy

What is the compiler doing here that allows comparison of many values to be done with few actual comparisons?

My question is about what the compiler is doing in this case that optimizes the code way more than what I would think is possible.
Given this enum:
enum MyEnum {
Entry1,
Entry2,
... // Entry3..27 are the same, omitted for size.
Entry28,
Entry29
};
And this function:
bool MyFunction(MyEnum e)
{
if (
e == MyEnum::Entry1 ||
e == MyEnum::Entry3 ||
e == MyEnum::Entry8 ||
e == MyEnum::Entry14 ||
e == MyEnum::Entry15 ||
e == MyEnum::Entry18 ||
e == MyEnum::Entry21 ||
e == MyEnum::Entry22 ||
e == MyEnum::Entry25)
{
return true;
}
return false;
}
For the function, MSVC generates this assembly when compiled with -Ox optimization flag (Godbolt):
bool MyFunction(MyEnum) PROC ; MyFunction
cmp ecx, 24
ja SHORT $LN5#MyFunction
mov eax, 20078725 ; 01326085H
bt eax, ecx
jae SHORT $LN5#MyFunction
mov al, 1
ret 0
$LN5#MyFunction:
xor al, al
ret 0
Clang generates similar (slightly better, one less jump) assembly when compiled with -O3 flag:
MyFunction(MyEnum): # #MyFunction(MyEnum)
cmp edi, 24
ja .LBB0_2
mov eax, 20078725
mov ecx, edi
shr eax, cl
and al, 1
ret
.LBB0_2:
xor eax, eax
ret
What is happening here? I see that even if I add more enum comparisons to the function, the assembly that is generated does not actually become "more", it's only this magic number (20078725) that changes. That number depends on how many enum comparisons are happening in the function. I do not understand what is happening here.
The reason why I am looking at this is that I was wondering if it is good to write the function as above, or alternatively like this, with bitwise comparisons:
bool MyFunction2(MyEnum e)
{
if (
e == MyEnum::Entry1 |
e == MyEnum::Entry3 |
e == MyEnum::Entry8 |
e == MyEnum::Entry14 |
e == MyEnum::Entry15 |
e == MyEnum::Entry18 |
e == MyEnum::Entry21 |
e == MyEnum::Entry22 |
e == MyEnum::Entry25)
{
return true;
}
return false;
}
This results in this generated assembly with MSVC:
bool MyFunction2(MyEnum) PROC ; MyFunction2
xor edx, edx
mov r9d, 1
cmp ecx, 24
mov eax, edx
mov r8d, edx
sete r8b
cmp ecx, 21
sete al
or r8d, eax
mov eax, edx
cmp ecx, 20
cmove r8d, r9d
cmp ecx, 17
sete al
or r8d, eax
mov eax, edx
cmp ecx, 14
cmove r8d, r9d
cmp ecx, 13
sete al
or r8d, eax
cmp ecx, 7
cmove r8d, r9d
cmp ecx, 2
sete dl
or r8d, edx
test ecx, ecx
cmove r8d, r9d
test r8d, r8d
setne al
ret 0
Since I do not understand what happens in the first case, I can not really judge which one is more efficient in my case.
Quite smart! The first comparison with 24 is to do a rough range check - if it's more than 24 or less than 0 it will bail out; this is important as the instructions that follow that operate on the magic number have a hard cap to [0, 31] for operand range.
For the rest, the magic number is just a bitmask, with the bits corresponding to the "good" values set.
>>> bin(20078725)
'0b1001100100110000010000101'
It's easy to spot the first and third bits (counting from 1 and from right) set, the 8th, 14th, 15th, ...
MSVC checks it "directly" using the BT (bit test) instruction and branching, clang instead shifts it of the appropriate amount (to get the relevant bit in the lowest order position) and keeps just it ANDing it with zero (avoiding a branch).
The C code corresponding to the clang version would be something like:
bool MyFunction(MyEnum e) {
if(unsigned(e) > 24) return false;
return (20078725 >> e) & 1;
}
as for the MSVC version, it's more like
inline bool bit_test(unsigned val, int bit) {
return val & (1<<bit);
}
bool MyFunction(MyEnum e) {
if(unsigned(e) > 24) return false;
return bit_test(20078725, e);
}
(I kept the bit_test function separated to emphasize that it's actually a single instruction in assembly, that val & (1<<bit) thing has no correspondence to the original assembly.
As for the if-based code, it's quite bad - it uses a lot of CMOV and ORs the results together, which is both longer code, and will probably serialize execution. I suspect the corresponding clang code will be better. OTOH, you wrote this code using bitwise OR (|) instead of the more semantically correct logical OR (||), and the compiler is strictly following your orders (typical of MSVC).
Another possibility to try instead could be a switch - but I don't think there's much to gain compared to the code already generated for the first snippet, which looks pretty good to me.
Ok, doing a quick test with all the versions against all compilers, we can see that:
the C translation of the CLang output above results in pretty much that same code (= to the clang output) in all compilers; similarly for the MSVC translation;
the bitwise or version is the same as the logical or version (= good) in both CLang and gcc;
in general, gcc does essentially the same thing as CLang except for the switch case;
switch results are varied:
CLang does best, by generating the exact same code;
both gcc and MSVC generate jump-table based code, which in this case is less good; however:
gcc prefers to emit a table of QWORDs, trading size for simplicity of the setup code;
MSVC instead emits a table of BYTEs, paying it in setup code size; I couldn't get gcc to emit similar code even changing -O3 to -Os (optimize for size).
Ah, the old immediate bitmap trick.
GCC does this too, at least for a switch.
x86 asm casetable implementation. Unfortunately GCC9 has a regression for some cases: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91026#c3 ; GCC8 and earlier do a better job.
Another example of using it, this time for code-golf (fewest bytes of code, in this case x86 machine code) to detect certain letters: User Appreciation Challenge #1: Dennis ♦
The basic idea is to use the input as an index into a bitmap of true/false results.
First you have to range-check because the bitmap is fixed-width, and x86 shifts wrap the shift count. We don't want high inputs to alias into the range where there are some that should return true. cmp edi, 24/ja is doing.
(If the range between the lowest and highest true values was from 120 to 140, for example, it might start with a sub edi,120 to range-shift everything before the cmp.)
Then you use bitmap & (1<<e) (the bt instruction), or (bitmap >> e) & 1 (shr / and) to check the bit in the bitmap that tells you whether that e value should return true or false.
There are many ways to implement that check, logically equivalent but with performance differences.
If the range was wider than 32, it would have to use 64-bit operand-size. If it was wider than 64, the compiler might not attempt this optimization at all. Or might still do it for some of the conditions that are in a narrow range.
Using an even larger bitmap (in .rodata memory) would be possible but probably not something most compilers will invent for you. Either with bt [mem],reg (inefficient) or manually indexing a dword and checking that the same way this code checks the immediate bitmap. If you had a lot of high-entropy ranges it might be worth checking 2x 64-bit immediate bitmap, branchy or branchless...
Clang/LLVM has other tricks up its sleeve for efficiently comparing against multiple values (when it doesn't matter which one is hit), e.g. broadcast a value into a SIMD register and use a packed compare. That isn't dependent on the values being in a dense range. (Clang generates worse code for 7 comparisons than for 8 comparisons)
that optimizes the code way more than what I would think is possible.
These kinds of optimizations come from smart human compiler developers that notice common patterns in source code and think of clever ways to implement them. Then get compilers to recognize those patterns and transform their internal representation of the program logic to use the trick.
Turns out that switch and switch-like if() statements are common, and aggressive optimizations are common.
Compilers are far from perfect, but sometimes they do come close to living up to what people often claim; that compilers will optimize your code for you so you can write it in a human-readable way and still have it run near-optimally. This is sometimes true over the small scale.
Since I do not understand what happens in the first case, I can not really judge which one is more efficient in my case.
The immediate bitmap is vastly more efficient. There's no data memory access in either one so no cache miss loads. The only "expensive" instruction is a variable-count shift (3 uops on mainstream Intel, because of x86's annoying FLAGS-setting semantics; BMI2 shrx is only 1 uop and avoid having to mov the number to ecx.) https://agner.org/optimize. And see other performance analysis links in https://stackoverflow.com/tags/x86/info.
Each instruction in the cmp/cmov chain is at least 1 uop, and there's a pretty long dependency chain through each cmov because MSVC didn't bother to break it into 2 or more parallel chains. But regardless it's just a lot of uops, far more than the bitmap version, so worse for throughput (ability for out-of-order exec to overlap the work with surrounding code) as well as latency.
bt is also cheap: 1 uop on modern AMD and Intel. (bts, btr, btc are 2 on AMD, still 1 on Intel).
The branch in the immediate-bitmap version could have been a setna / and to make it branchless, but especially for this enum definition the compiler expected that it would be in range. It could have increased branch predictability by only requiring e <= 31, not e <= 24.
Since the enum only goes up to 29, and IIRC it's UB to have out-of-range enum values, it could actually optimize it away entirely.
Even if the e>24 branch doesn't predict very well, it's still probably better overall. Given current compilers, we only get a choice between the nasty chain of cmp/cmov or branch + bitmap. Unless turn the asm logic back into C to hand-hold compilers into making the asm we want, then we can maybe get branchless with an AND or CMOV to make it always zero for out-of-range e.
But if we're lucky, profile-guided optimization might let some compilers make the bitmap range check branchless. (In asm the behaviour of shl reg, cl with cl > 31 or 63 is well-defined: on x86 it simply masks the count. In a C equivalent, you could use bitmap >> (e&31) which can still optimize to a shr; compilers know that x86 shr masks the count so they can optimize that away. But not for other ISAs that saturate the shift count...)
There are lots of ways to implement the bitmap check that are pretty much equivalent. e.g. you could even use the CF output of shr, set according to the last bit shifted out. At least if you make sure CF has a known state ahead of time for the cl=0 case.
When you want an integer bool result, right-shifting seems to make more sense than bt / setcc, but with shr costing 3 uops on Intel it might actually be best to use bt reg,reg / setc al. Especially if you only need a bool, and can use EAX as your bitmap destination so the previous value of EAX is definitely ready before setcc. (Avoiding a false dependency on some unrelated earlier dep chain.)
BTW, MSVC has other silliness: as What is the best way to set a register to zero in x86 assembly: xor, mov or and? explains, xor al,al is totally stupid compared to xor eax,eax when you want to zero AL. If you don't need to leave the upper bytes of RAX unmodified, zero the full register with a zeroing idiom.
And of course branching just to return 0 or return 1 makes little sense, unless you expect it to be very predictable and want to break the data dependency. I'd expect that setc al would make more sense to read the CF result of bt

How to measure the number of increments per second

I want to measure the speed in which my PC can increment a counter N times (e.g., for N = 10^9).
I tried the following code:
using namespace std
auto start = chrono::steady_clock::now();
for (int i = 0; i < N; ++i)
{
}
auto end = chrono::steady_clock::now();
However, the compiler is smart enough to simply set i=N, and I get that start==end regardless of the value of N.
How can I change the code to measure the increment speed? (adding costly operations in the loop would dominate the runtime and would not allow the measurement to be correct).
I use Windows 10 and Visual Studio 15.9.7.
A bit of motivation: my code takes about 2 seconds for N=10^9. I'm wondering if there's any "meat" left in optimizing it further (e.g., could it possibly go down to 1 sec? or would the loop itself require more?)
This question doesn't really make sense in C or C++. The compiler aims to generate the fastest code that meets the constraints defined by your source code. In your question, you do not define a constraint that the compiler must do a loop at all. Because the loop has no effect, the optimizer will remove it.
Gabriel Staple's answer is probably the nearest thing you can get to a sensible answer to your question, but it is also not quite right because it defines too many constraints that limits the compiler's freedom to implement optimal code. Volatile often forces the compiler to write the result back to memory each time the variable is modified.
eg, this code:
void foo(int N) {
for (volatile int i = 0; i < N; ++i)
{
}
}
Becomes this assembly (on an x64 compiler I tried):
mov DWORD PTR [rsp-4], 0
mov eax, DWORD PTR [rsp-4]
cmp edi, eax
jle .L1
.L3:
mov eax, DWORD PTR [rsp-4] # Read i from mem
add eax, 1 # i++
mov DWORD PTR [rsp-4], eax # Write i to mem
mov eax, DWORD PTR [rsp-4] # Read it back again before
# evaluating the loop condition.
cmp eax, edi # Is i < N?
jl .L3 # Jump back to L3 if not.
.L1:
It sounds like your real question is more like how fast is:
L1: add eax, 1
jmp L1
Even the answer to that is complex and requires an understanding of the internals of your CPU's pipelines.
I recommend playing with Godbolt to understand more about what the compiler is doing. eg https://godbolt.org/z/59XUSu
You can directly measure the speed of the "empty loop", but it is not easy to convince a C++ compiler to emit it. GCC and Clang can be tricked with asm volatile("") but MSVC inline assembly has always been different and is disabled completely for 64bit programs.
It is possible to use MASM to side-step that restriction:
.MODEL FLAT
.CODE
_testfun PROC
sub ecx, 1
jnz _testfun
ret
_testfun ENDP
END
Import it into your code with extern "C" void testfun(unsigned N);.
Try volatile int i = 0 In your for loop. The volatile keyword tells the compiler this variable could change at any time, due to outside events or threads, and therefore it can't make the same assumptions about what the variable might be in the future.

Faster way of adding negative signed to unsigned

Assuming I have a: usize and a negative b:isize how do I achieve the following semantics - reduce a by absolute value of b in fastest manner possible?
I already thought of a - (b.abs() as usize), but I'm wondering if there is a faster way. Something with bit manipulation, perhaps?
Why do you assume this is slow? If that code is put in a function and compiled, on x86-64 linux, it generates the following:
_ZN6simple20h0f921f89f1d823aeeaaE:
mov rax, rsi
neg rax
cmovl rax, rsi
sub rdi, rax
mov rax, rdi
ret
That's assuming it doesn't get inlined... which I had to work at for a few minutes to prevent the optimiser from doing in order to get the above.
That's not to say it definitely couldn't be done faster, but I'm unconvinced it could be done faster by much.
If b is guaranteed to be negative, then you can just do a + b.
In Rust, we must first cast one of the operands to the same type as the other one, then we must use wrapping_add instead of simply using operator + as debug builds panic on overflow (an overflow occurs when using + on usize because negative numbers become very large positive numbers after the cast).
fn main() {
let a: usize = 5;
let b: isize = -2;
let c: usize = a.wrapping_add(b as usize);
println!("{}", c); // prints 3
}
With optimizations, wrapping_add compiles to a single add instruction.

What is faster than std::pow?

My program spends 90% of CPU time in the std::pow(double,int) function. Accuracy is not a primary concern here, so I was wondering if there were any faster alternatives. One thing I was thinking of trying is casting to float, performing the operation and then back to double (haven't tried this yet); I am concerned that this is not a portable way of improving performance (don't most CPUs operate on doubles intrinsically anyway?)
Cheers
It looks like Martin Ankerl has a few of articles on this, Optimized Approximative pow() in C / C++ is one and it has two fast versions, one is as follows:
inline double fastPow(double a, double b) {
union {
double d;
int x[2];
} u = { a };
u.x[1] = (int)(b * (u.x[1] - 1072632447) + 1072632447);
u.x[0] = 0;
return u.d;
}
which relies on type punning through a union which is undefined behavior in C++, from the draft standard section 9.5 [class.union]:
In a union, at most one of the non-static data members can be active at any time, that is, the value of at
most one of the non-static data members can be stored in a union at any time. [...]
but most compilers including gcc support this with well defined behavior:
The practice of reading from a different union member than the one most recently written to (called “type-punning”) is common. Even with -fstrict-aliasing, type-punning is allowed, provided the memory is accessed through the union type
but this is not universal as this article points out and as I point out in my answer here using memcpy should generate identical code and does not invoke undefined behavior.
He also links to a second one Optimized pow() approximation for Java, C / C++, and C#.
The first article also links to his microbenchmarks here
Depending on what you need to do, operating in the log domain might work — that is, you replace all of your values with their logarithms; multiplication becomes addition, division becomes subtraction, and exponentiation becomes multiplication. But now addition and subtraction become expensive and somewhat error-prone operations.
How big are your integers? Are they known at compile time? It's far better to compute x^2 as x*x as opposed to pow(x,2). Note: Almost all applications of pow() to an integer power involve raising some number to the second or third power (or the multiplicative inverse in the case of negative exponents). Using pow() is overkill in such cases. Use a template for these small integer powers, or just use x*x.
If the integers are small, but not known at compile time, say between -12 and +12, multiplication will still beat pow() and won't lose accuracy. You don't need eleven multiplications to compute x^12. Four will do. Use the fact that x^(2n) = (x^n)^2 and x^(2n+1) = x*((x^n)^2). For example, x^12 is ((x*x*x)^2)^2. Two multiplications to compute x^3 (x*x*x), one more to compute x^6, and one final one to compute x^12.
YES! Very fast if you only need 'y'/'n' as a long/int which allows you to avoid the slow FPU FSCALE function. This is Agner Fog's x86 hand-optimized version if you only need results with 'y'/'n' as an INT. I upgraded it to __fastcall/__declspec(naked) for speed/size, made use of ECX to pass 'n' (floats always are passed in stack for 32-bit MSVC++), so very minor tweaks on my part, it's mostly Agner's work. It was tested/debugged/compiled on MS Visual VC++ 2005 Express/Pro, so should be OK to slip in newer versions. Accuracy against the universal CRT pow() function is very good.
extern double __fastcall fs_power(double x, long n);
// Raise 'x' to the power 'n' (INT-only) in ASM by the great Agner Fog!
__declspec(naked) double __fastcall fs_power(double x, long n) { __asm {
MOV EAX, ECX ;// Move 'n' to eax
;// abs(n) is calculated by inverting all bits and adding 1 if n < 0:
CDQ ;// Get sign bit into all bits of edx
XOR EAX, EDX ;// Invert bits if negative
SUB EAX, EDX ;// Add 1 if negative. Now eax = abs(n)
JZ RETZERO ;// End if n = 0
FLD1 ;// ST(0) = 1.0 (FPU push1)
FLD QWORD PTR [ESP+4] ;// Load 'x' : ST(0) = 'x', ST(1) = 1.0 (FPU push2)
JMP L2 ;// Jump into loop
L1: ;// Top of loop
FMUL ST(0), ST(0) ;// Square x
L2: ;// Loop entered here
SHR EAX, 1 ;// Get each bit of n into carry flag
JNC L1 ;// No carry. Skip multiplication, goto next
FMUL ST(1), ST(0) ;// Multiply by x squared i times for bit # i
JNZ L1 ;// End of loop. Stop when nn = 0
FSTP ST(0) ;// Discard ST(0) (FPU Pop1)
TEST EDX, EDX ;// Test if 'n' was negative
JNS RETPOS ;// Finish if 'n' was positive
FLD1 ;// ST(0) = 1.0, ST(1) = x^abs(n)
FDIVR ;// Reciprocal
RETPOS: ;// Finish, success!
RET 4 ;//(FPU Pop2 occurs by compiler on assignment
RETZERO:
FLDZ ;// Ret 0.0, fail, if n was 0
RET 4
}}