What is faster than std::pow?

What is faster than std::pow? - c++

My program spends 90% of CPU time in the std::pow(double,int) function. Accuracy is not a primary concern here, so I was wondering if there were any faster alternatives. One thing I was thinking of trying is casting to float, performing the operation and then back to double (haven't tried this yet); I am concerned that this is not a portable way of improving performance (don't most CPUs operate on doubles intrinsically anyway?)
Cheers

It looks like Martin Ankerl has a few of articles on this, Optimized Approximative pow() in C / C++ is one and it has two fast versions, one is as follows:
inline double fastPow(double a, double b) {
union {
double d;
int x[2];
} u = { a };
u.x[1] = (int)(b * (u.x[1] - 1072632447) + 1072632447);
u.x[0] = 0;
return u.d;
}
which relies on type punning through a union which is undefined behavior in C++, from the draft standard section 9.5 [class.union]:
In a union, at most one of the non-static data members can be active at any time, that is, the value of at
most one of the non-static data members can be stored in a union at any time. [...]
but most compilers including gcc support this with well defined behavior:
The practice of reading from a different union member than the one most recently written to (called “type-punning”) is common. Even with -fstrict-aliasing, type-punning is allowed, provided the memory is accessed through the union type
but this is not universal as this article points out and as I point out in my answer here using memcpy should generate identical code and does not invoke undefined behavior.
He also links to a second one Optimized pow() approximation for Java, C / C++, and C#.
The first article also links to his microbenchmarks here

Depending on what you need to do, operating in the log domain might work — that is, you replace all of your values with their logarithms; multiplication becomes addition, division becomes subtraction, and exponentiation becomes multiplication. But now addition and subtraction become expensive and somewhat error-prone operations.

How big are your integers? Are they known at compile time? It's far better to compute x^2 as x*x as opposed to pow(x,2). Note: Almost all applications of pow() to an integer power involve raising some number to the second or third power (or the multiplicative inverse in the case of negative exponents). Using pow() is overkill in such cases. Use a template for these small integer powers, or just use x*x.
If the integers are small, but not known at compile time, say between -12 and +12, multiplication will still beat pow() and won't lose accuracy. You don't need eleven multiplications to compute x^12. Four will do. Use the fact that x^(2n) = (x^n)^2 and x^(2n+1) = x*((x^n)^2). For example, x^12 is ((x*x*x)^2)^2. Two multiplications to compute x^3 (x*x*x), one more to compute x^6, and one final one to compute x^12.

YES! Very fast if you only need 'y'/'n' as a long/int which allows you to avoid the slow FPU FSCALE function. This is Agner Fog's x86 hand-optimized version if you only need results with 'y'/'n' as an INT. I upgraded it to __fastcall/__declspec(naked) for speed/size, made use of ECX to pass 'n' (floats always are passed in stack for 32-bit MSVC++), so very minor tweaks on my part, it's mostly Agner's work. It was tested/debugged/compiled on MS Visual VC++ 2005 Express/Pro, so should be OK to slip in newer versions. Accuracy against the universal CRT pow() function is very good.
extern double __fastcall fs_power(double x, long n);
// Raise 'x' to the power 'n' (INT-only) in ASM by the great Agner Fog!
__declspec(naked) double __fastcall fs_power(double x, long n) { __asm {
MOV EAX, ECX ;// Move 'n' to eax
;// abs(n) is calculated by inverting all bits and adding 1 if n < 0:
CDQ ;// Get sign bit into all bits of edx
XOR EAX, EDX ;// Invert bits if negative
SUB EAX, EDX ;// Add 1 if negative. Now eax = abs(n)
JZ RETZERO ;// End if n = 0
FLD1 ;// ST(0) = 1.0 (FPU push1)
FLD QWORD PTR [ESP+4] ;// Load 'x' : ST(0) = 'x', ST(1) = 1.0 (FPU push2)
JMP L2 ;// Jump into loop
L1: ;// Top of loop
FMUL ST(0), ST(0) ;// Square x
L2: ;// Loop entered here
SHR EAX, 1 ;// Get each bit of n into carry flag
JNC L1 ;// No carry. Skip multiplication, goto next
FMUL ST(1), ST(0) ;// Multiply by x squared i times for bit # i
JNZ L1 ;// End of loop. Stop when nn = 0
FSTP ST(0) ;// Discard ST(0) (FPU Pop1)
TEST EDX, EDX ;// Test if 'n' was negative
JNS RETPOS ;// Finish if 'n' was positive
FLD1 ;// ST(0) = 1.0, ST(1) = x^abs(n)
FDIVR ;// Reciprocal
RETPOS: ;// Finish, success!
RET 4 ;//(FPU Pop2 occurs by compiler on assignment
RETZERO:
FLDZ ;// Ret 0.0, fail, if n was 0
RET 4
}}

Related

What is the compiler doing here that allows comparison of many values to be done with few actual comparisons?

My question is about what the compiler is doing in this case that optimizes the code way more than what I would think is possible.
Given this enum:
enum MyEnum {
Entry1,
Entry2,
... // Entry3..27 are the same, omitted for size.
Entry28,
Entry29
};
And this function:
bool MyFunction(MyEnum e)
{
if (
e == MyEnum::Entry1 ||
e == MyEnum::Entry3 ||
e == MyEnum::Entry8 ||
e == MyEnum::Entry14 ||
e == MyEnum::Entry15 ||
e == MyEnum::Entry18 ||
e == MyEnum::Entry21 ||
e == MyEnum::Entry22 ||
e == MyEnum::Entry25)
{
return true;
}
return false;
}
For the function, MSVC generates this assembly when compiled with -Ox optimization flag (Godbolt):
bool MyFunction(MyEnum) PROC ; MyFunction
cmp ecx, 24
ja SHORT $LN5#MyFunction
mov eax, 20078725 ; 01326085H
bt eax, ecx
jae SHORT $LN5#MyFunction
mov al, 1
ret 0
$LN5#MyFunction:
xor al, al
ret 0
Clang generates similar (slightly better, one less jump) assembly when compiled with -O3 flag:
MyFunction(MyEnum): # #MyFunction(MyEnum)
cmp edi, 24
ja .LBB0_2
mov eax, 20078725
mov ecx, edi
shr eax, cl
and al, 1
ret
.LBB0_2:
xor eax, eax
ret
What is happening here? I see that even if I add more enum comparisons to the function, the assembly that is generated does not actually become "more", it's only this magic number (20078725) that changes. That number depends on how many enum comparisons are happening in the function. I do not understand what is happening here.
The reason why I am looking at this is that I was wondering if it is good to write the function as above, or alternatively like this, with bitwise comparisons:
bool MyFunction2(MyEnum e)
{
if (
e == MyEnum::Entry1 |
e == MyEnum::Entry3 |
e == MyEnum::Entry8 |
e == MyEnum::Entry14 |
e == MyEnum::Entry15 |
e == MyEnum::Entry18 |
e == MyEnum::Entry21 |
e == MyEnum::Entry22 |
e == MyEnum::Entry25)
{
return true;
}
return false;
}
This results in this generated assembly with MSVC:
bool MyFunction2(MyEnum) PROC ; MyFunction2
xor edx, edx
mov r9d, 1
cmp ecx, 24
mov eax, edx
mov r8d, edx
sete r8b
cmp ecx, 21
sete al
or r8d, eax
mov eax, edx
cmp ecx, 20
cmove r8d, r9d
cmp ecx, 17
sete al
or r8d, eax
mov eax, edx
cmp ecx, 14
cmove r8d, r9d
cmp ecx, 13
sete al
or r8d, eax
cmp ecx, 7
cmove r8d, r9d
cmp ecx, 2
sete dl
or r8d, edx
test ecx, ecx
cmove r8d, r9d
test r8d, r8d
setne al
ret 0
Since I do not understand what happens in the first case, I can not really judge which one is more efficient in my case.

Quite smart! The first comparison with 24 is to do a rough range check - if it's more than 24 or less than 0 it will bail out; this is important as the instructions that follow that operate on the magic number have a hard cap to [0, 31] for operand range.
For the rest, the magic number is just a bitmask, with the bits corresponding to the "good" values set.
>>> bin(20078725)
'0b1001100100110000010000101'
It's easy to spot the first and third bits (counting from 1 and from right) set, the 8th, 14th, 15th, ...
MSVC checks it "directly" using the BT (bit test) instruction and branching, clang instead shifts it of the appropriate amount (to get the relevant bit in the lowest order position) and keeps just it ANDing it with zero (avoiding a branch).
The C code corresponding to the clang version would be something like:
bool MyFunction(MyEnum e) {
if(unsigned(e) > 24) return false;
return (20078725 >> e) & 1;
}
as for the MSVC version, it's more like
inline bool bit_test(unsigned val, int bit) {
return val & (1<<bit);
}
bool MyFunction(MyEnum e) {
if(unsigned(e) > 24) return false;
return bit_test(20078725, e);
}
(I kept the bit_test function separated to emphasize that it's actually a single instruction in assembly, that val & (1<<bit) thing has no correspondence to the original assembly.
As for the if-based code, it's quite bad - it uses a lot of CMOV and ORs the results together, which is both longer code, and will probably serialize execution. I suspect the corresponding clang code will be better. OTOH, you wrote this code using bitwise OR (|) instead of the more semantically correct logical OR (||), and the compiler is strictly following your orders (typical of MSVC).
Another possibility to try instead could be a switch - but I don't think there's much to gain compared to the code already generated for the first snippet, which looks pretty good to me.
Ok, doing a quick test with all the versions against all compilers, we can see that:
the C translation of the CLang output above results in pretty much that same code (= to the clang output) in all compilers; similarly for the MSVC translation;
the bitwise or version is the same as the logical or version (= good) in both CLang and gcc;
in general, gcc does essentially the same thing as CLang except for the switch case;
switch results are varied:
CLang does best, by generating the exact same code;
both gcc and MSVC generate jump-table based code, which in this case is less good; however:
gcc prefers to emit a table of QWORDs, trading size for simplicity of the setup code;
MSVC instead emits a table of BYTEs, paying it in setup code size; I couldn't get gcc to emit similar code even changing -O3 to -Os (optimize for size).

Ah, the old immediate bitmap trick.
GCC does this too, at least for a switch.
x86 asm casetable implementation. Unfortunately GCC9 has a regression for some cases: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91026#c3 ; GCC8 and earlier do a better job.
Another example of using it, this time for code-golf (fewest bytes of code, in this case x86 machine code) to detect certain letters: User Appreciation Challenge #1: Dennis ♦
The basic idea is to use the input as an index into a bitmap of true/false results.
First you have to range-check because the bitmap is fixed-width, and x86 shifts wrap the shift count. We don't want high inputs to alias into the range where there are some that should return true. cmp edi, 24/ja is doing.
(If the range between the lowest and highest true values was from 120 to 140, for example, it might start with a sub edi,120 to range-shift everything before the cmp.)
Then you use bitmap & (1<<e) (the bt instruction), or (bitmap >> e) & 1 (shr / and) to check the bit in the bitmap that tells you whether that e value should return true or false.
There are many ways to implement that check, logically equivalent but with performance differences.
If the range was wider than 32, it would have to use 64-bit operand-size. If it was wider than 64, the compiler might not attempt this optimization at all. Or might still do it for some of the conditions that are in a narrow range.
Using an even larger bitmap (in .rodata memory) would be possible but probably not something most compilers will invent for you. Either with bt [mem],reg (inefficient) or manually indexing a dword and checking that the same way this code checks the immediate bitmap. If you had a lot of high-entropy ranges it might be worth checking 2x 64-bit immediate bitmap, branchy or branchless...
Clang/LLVM has other tricks up its sleeve for efficiently comparing against multiple values (when it doesn't matter which one is hit), e.g. broadcast a value into a SIMD register and use a packed compare. That isn't dependent on the values being in a dense range. (Clang generates worse code for 7 comparisons than for 8 comparisons)
that optimizes the code way more than what I would think is possible.
These kinds of optimizations come from smart human compiler developers that notice common patterns in source code and think of clever ways to implement them. Then get compilers to recognize those patterns and transform their internal representation of the program logic to use the trick.
Turns out that switch and switch-like if() statements are common, and aggressive optimizations are common.
Compilers are far from perfect, but sometimes they do come close to living up to what people often claim; that compilers will optimize your code for you so you can write it in a human-readable way and still have it run near-optimally. This is sometimes true over the small scale.
Since I do not understand what happens in the first case, I can not really judge which one is more efficient in my case.
The immediate bitmap is vastly more efficient. There's no data memory access in either one so no cache miss loads. The only "expensive" instruction is a variable-count shift (3 uops on mainstream Intel, because of x86's annoying FLAGS-setting semantics; BMI2 shrx is only 1 uop and avoid having to mov the number to ecx.) https://agner.org/optimize. And see other performance analysis links in https://stackoverflow.com/tags/x86/info.
Each instruction in the cmp/cmov chain is at least 1 uop, and there's a pretty long dependency chain through each cmov because MSVC didn't bother to break it into 2 or more parallel chains. But regardless it's just a lot of uops, far more than the bitmap version, so worse for throughput (ability for out-of-order exec to overlap the work with surrounding code) as well as latency.
bt is also cheap: 1 uop on modern AMD and Intel. (bts, btr, btc are 2 on AMD, still 1 on Intel).
The branch in the immediate-bitmap version could have been a setna / and to make it branchless, but especially for this enum definition the compiler expected that it would be in range. It could have increased branch predictability by only requiring e <= 31, not e <= 24.
Since the enum only goes up to 29, and IIRC it's UB to have out-of-range enum values, it could actually optimize it away entirely.
Even if the e>24 branch doesn't predict very well, it's still probably better overall. Given current compilers, we only get a choice between the nasty chain of cmp/cmov or branch + bitmap. Unless turn the asm logic back into C to hand-hold compilers into making the asm we want, then we can maybe get branchless with an AND or CMOV to make it always zero for out-of-range e.
But if we're lucky, profile-guided optimization might let some compilers make the bitmap range check branchless. (In asm the behaviour of shl reg, cl with cl > 31 or 63 is well-defined: on x86 it simply masks the count. In a C equivalent, you could use bitmap >> (e&31) which can still optimize to a shr; compilers know that x86 shr masks the count so they can optimize that away. But not for other ISAs that saturate the shift count...)
There are lots of ways to implement the bitmap check that are pretty much equivalent. e.g. you could even use the CF output of shr, set according to the last bit shifted out. At least if you make sure CF has a known state ahead of time for the cl=0 case.
When you want an integer bool result, right-shifting seems to make more sense than bt / setcc, but with shr costing 3 uops on Intel it might actually be best to use bt reg,reg / setc al. Especially if you only need a bool, and can use EAX as your bitmap destination so the previous value of EAX is definitely ready before setcc. (Avoiding a false dependency on some unrelated earlier dep chain.)
BTW, MSVC has other silliness: as What is the best way to set a register to zero in x86 assembly: xor, mov or and? explains, xor al,al is totally stupid compared to xor eax,eax when you want to zero AL. If you don't need to leave the upper bytes of RAX unmodified, zero the full register with a zeroing idiom.
And of course branching just to return 0 or return 1 makes little sense, unless you expect it to be very predictable and want to break the data dependency. I'd expect that setc al would make more sense to read the CF result of bt

Tricks compiler uses to compile basic arithmetic operations of 128-bit integer

I played on GodBolt to see x86-64 gcc(6.3) compiles the following codes:
typedef __int128_t int128_t;
typedef __uint128_t uint128_t;
uint128_t mul_to_128(uint64_t x, uint64_t y) {
return uint128_t(x)*uint128_t(y);
}
uint128_t mul(uint128_t x, uint128_t y) {
return x*y;
}
uint128_t div(uint128_t x, uint128_t y) {
return x/y;
}
and I got:
mul_to_128(unsigned long, unsigned long):
mov rax, rdi
mul rsi
ret
mul(unsigned __int128, unsigned __int128):
imul rsi, rdx
mov rax, rdi
imul rcx, rdi
mul rdx
add rcx, rsi
add rdx, rcx
ret
div(unsigned __int128, unsigned __int128):
sub rsp, 8
call __udivti3 //what is this???
add rsp, 8
ret
3 questions:
The 1st function(cast 64-bit uint to 128-bit then multiply them) are
much simpler than multiplication of 2 128-bit uints(2nd function). Basically, just
1 multiplication. If you multiply 2 maximums of 64-bit uint, it
definitely overflows out of a 64-bit register...How does it produce
128-bit result by just 1 64-bit-64-bit multiplication???
I cannot read the second result really well...my guess is to break 64-bit number to 2 32-bit numbers(says, hi as higher 4 bytes
and lo as lower 4 bytes), and assemble the result like
(hi1*hi2)<<64 + (hi1*lo2)<<32 + (hi2*lo1)<<32+(lo1*lo2). Apparently
I was wrong...because it uses only 3 multiplications (2 of them
are even imul...signed multiplication???why???). Can anyone tell me
what gcc is thinking? And it is optimal?
Cannot even understand the assembly of the division...push stack -> call something called __udivti3 then pop stack...is __udivti3 something
big?(like table look-up?) and what stuff does gcc try to push before the call?
the godbolt link: https://godbolt.org/g/sIIaM3

You're right that multiplying two unsigned 64-bit values can produce a 128-bit result. Funny thing, hardware designers know that, too. <g> So multiplying two 64-bit values produces a 128-bit result by storing the lower half of the result in one 64-bit register and the upper half of the result in another 64-bit register. The compiler-writer knows which registers are used, and when you call mul_to_128 it will look for the results in the appropriate registers.
In the second example, think of the values as a1*2^64 + a0 and b1*2^64 + b0 (that is, split each 128-bit value into two parts, the upper 64 bits and the lower 64 bits). When you multiply those you get a1*b1*2^64*2^64 + a1*b0*2^64 + a0*b1*2^64 + a0*b0. That's essentially what the assembly code is doing. The parts of the result that overflow 128 bits are ignored.
In the third example,__udivti3 is a function that does the division. It's not simple, so it doesn't get expanded inline.

The mul rsi will produce a 128 bit result in rdx:rax, as any instruction set reference will tell you.
The imul is used to get a 64 bit result. It works even for unsigned. Again, the instruction set reference says: "The two- and three-operand forms may also be used with unsigned operands because the lower half of the product
is the same regardless if the operands are signed or unsigned." Other than that, yes, basically it's doing the double width equivalent of what you described. Only 3 multiplies, because the result of the 4th would not fit in the output 128 bit anyway.
__udivti3 is just a helper function, you can look at its disassembly to see what it's doing.

Does casting and assignment really strip away any extra precision of floats?

I am reading through the new C++ FAQ and I see that even if x == y for double x, y; then it is possible for:
std::cos(x) == std::cos(y)
to evaluate to false. This is because the machine can have a processor which supports extended precision, such that one part of the == is a 64 bit number while the other one is an 80 bit number.
However, the next example seems to be incorrect:
void foo(double x, double y)
{
double cos_x = cos(x);
double cos_y = cos(y);
// the behavior might depend on what's in here
if (cos_x != cos_y) {
std::cout << "Huh?!?\n"; // You might end up here when x == y!!
}
}
As far as I read on en.cppreference.com here:
Cast and assignment strip away any extraneous range and precision:
this models the action of storing a value from an extended-precision
FPU register into a standard-sized memory location.
Hence, assigning:
double cos_x = cos(x);
double cos_y = cos(y);
should trim away any extra precision and make the program perfectly predictable.
So, who is right? The C++ FAQ, or the en.cppreference.com?

Neither the ISOCPP FAQ or cppreference are authoritative sources. cppreference is essentially the equivalent of wikipedia: it contains good information, but anybody can add anything without sources you have to read it with a grain of salt. Are the following three statements true:
Did you catch that? Your particular installation might store the result of one of the cos() calls out into RAM, truncating it in the process, then later compare that truncated value with the untruncated result of the second cos() call. Depending on lots of details, those two values might not be equal.
This is because the machine can have a processor which supports extended precision, such that one part of the == is a 64 bit number while the other one is an 80 bit number.
Cast and assignment strip away any extraneous range and precision: this models the action of storing a value from an extended-precision FPU register into a standard-sized memory location.
Maybe. It depends on the platform, your compiler, the compiler options, or any number of things. Realistically, the above statements only hold true for GCC in 32-bit mode, which defaults to the legacy x87 FPU (where everything is 80-bits internally) and rounded down to 64-bit when storing in memory. 64-bit uses SSE so double temporaries are always 64-bit. You can force SSE with mfpmath=sse and -msse2. Either way, the assembly might look something like this:
movsd QWORD PTR [rsp+8], xmm1
call cos
movsd xmm1, QWORD PTR [rsp+8]
movsd QWORD PTR [rsp], xmm0
movapd xmm0, xmm1
call cos
movsd xmm2, QWORD PTR [rsp]
ucomisd xmm2, xmm0
As you can see there is no truncation of any sort nor 80-bit registers used here.

I wish that assignment or casting would toss away the extra precision. I have worked with "clever" compilers for which this does not happen. Many do, but if you want truly machine-independent code, you need do extra work.
You can force the compiler to always use the value in memory with the expected precision by declaring the variable with the "volatile" keyword.
Some compilers pass parameters in registers rather than on the stack. For these compilers, it could be that x or y could have unintended extra precision. To be totally safe, you could do
volatile double db_x = x, db_y = y;
volatile double cos_x = cos(db_x), cos_y = cos(db_y);
if (cos_x != cos_y)...

Faster way of adding negative signed to unsigned

Assuming I have a: usize and a negative b:isize how do I achieve the following semantics - reduce a by absolute value of b in fastest manner possible?
I already thought of a - (b.abs() as usize), but I'm wondering if there is a faster way. Something with bit manipulation, perhaps?

Why do you assume this is slow? If that code is put in a function and compiled, on x86-64 linux, it generates the following:
_ZN6simple20h0f921f89f1d823aeeaaE:
mov rax, rsi
neg rax
cmovl rax, rsi
sub rdi, rax
mov rax, rdi
ret
That's assuming it doesn't get inlined... which I had to work at for a few minutes to prevent the optimiser from doing in order to get the above.
That's not to say it definitely couldn't be done faster, but I'm unconvinced it could be done faster by much.

If b is guaranteed to be negative, then you can just do a + b.
In Rust, we must first cast one of the operands to the same type as the other one, then we must use wrapping_add instead of simply using operator + as debug builds panic on overflow (an overflow occurs when using + on usize because negative numbers become very large positive numbers after the cast).
fn main() {
let a: usize = 5;
let b: isize = -2;
let c: usize = a.wrapping_add(b as usize);
println!("{}", c); // prints 3
}
With optimizations, wrapping_add compiles to a single add instruction.

Why use xor with a literal instead of inversion (bitwise not)

I have come across this CRC32 code and was curious why the author would choose to use
crc = crc ^ ~0U;
instead of
crc = ~crc;
As far as I can tell, they are equivalent.
I have even disassembled the two versions in Visual Studio 2010.
Not optimized build:
crc = crc ^ ~0U;
009D13F4 mov eax,dword ptr [crc]
009D13F7 xor eax,0FFFFFFFFh
009D13FA mov dword ptr [crc],eax
crc = ~crc;
011C13F4 mov eax,dword ptr [crc]
011C13F7 not eax
011C13F9 mov dword ptr [crc],eax
I also cannot justify the code by thinking about the number of cycles that each instruction takes since both should be taking 1 cycle to complete. In fact, the xor might have a penalty by having to load the literal from somewhere, though I am not certain of this.
So I'm left thinking that it is possibly just a preferred way to describe the algorithm, rather than an optimization... Would that be correct?
Edit 1:
Since I just realized that the type of the crc variable is probably important to mention I am including the whole code (less the lookup table, way too big) here so you don't have to follow the link.
uint32_t crc32(uint32_t crc, const void *buf, size_t size)
{
const uint8_t *p;
p = buf;
crc = crc ^ ~0U;
while (size--)
{
crc = crc32_tab[(crc ^ *p++) & 0xFF] ^ (crc >> 8);
}
return crc ^ ~0U;
}
Edit 2:
Since someone has brought up the fact that an optimized build would be of interest, I have made one and included it below.
Optimized build:
Do note that the whole function (included in the last edit below) was inlined.
// crc = crc ^ ~0U;
zeroCrc = 0;
zeroCrc = crc32(zeroCrc, zeroBufferSmall, sizeof(zeroBufferSmall));
00971148 mov ecx,14h
0097114D lea edx,[ebp-40h]
00971150 or eax,0FFFFFFFFh
00971153 movzx esi,byte ptr [edx]
00971156 xor esi,eax
00971158 and esi,0FFh
0097115E shr eax,8
00971161 xor eax,dword ptr ___defaultmatherr+4 (973018h)[esi*4]
00971168 add edx,ebx
0097116A sub ecx,ebx
0097116C jne main+153h (971153h)
0097116E not eax
00971170 mov ebx,eax
// crc = ~crc;
zeroCrc = 0;
zeroCrc = crc32(zeroCrc, zeroBufferSmall, sizeof(zeroBufferSmall));
01251148 mov ecx,14h
0125114D lea edx,[ebp-40h]
01251150 or eax,0FFFFFFFFh
01251153 movzx esi,byte ptr [edx]
01251156 xor esi,eax
01251158 and esi,0FFh
0125115E shr eax,8
01251161 xor eax,dword ptr ___defaultmatherr+4 (1253018h)[esi*4]
01251168 add edx,ebx
0125116A sub ecx,ebx
0125116C jne main+153h (1251153h)
0125116E not eax
01251170 mov ebx,eax

Something nobody's mentioned yet; if this code is being compiled on a machine with 16 bit unsigned int then these two code snippets are different.
crc is specified as a 32-bit unsigned integral type. ~crc will invert all bits, but if unsigned int is 16bit then crc = crc ^ ~0U will only invert the lower 16 bits.
I don't know enough about the CRC algorithm to know whether this is intentional or a bug, perhaps hivert can clarify; although looking at the sample code posted by OP, it certainly does make a difference to the loop that follows.
NB. Sorry for posting this as an "answer" because it isn't an answer, but it's too big to just fit in a comment :)

The short answer is: Because it allows to have an uniform algorithm for all CRC's
The reason is the following: There is a lot of variant of CRC. Each one depend on a Z/Z2 polynomial which is used for an euclidian division. Usually is it implemented using the algorithm described In this paper by Aram Perez. Now depending on the polynomial you are using, there is a final XOR at the end of the algorithm which depend on the polynomial whose goal is to eliminate some corner case. It happens that for CRC32 this is the same as a global not but this is not true for all CRC. As an evidence on This web page you can read (emphasis mine):
Consider a message that begins with some number of zero bits. The remainder will never contain anything other than zero until the first one in the message is shifted into it. That's a dangerous situation, since packets beginning with one or more zeros may be completely legitimate and a dropped or added zero would not be noticed by the CRC. (In some applications, even a packet of all zeros may be legitimate!) The simple way to eliminate this weakness is to start with a nonzero remainder. The parameter called initial remainder tells you what value to use for a particular CRC standard. And only one small change is required to the crcSlow() and crcFast() functions:
crc remainder = INITIAL_REMAINDER;
The final XOR value exists for a similar reason. To implement this capability, simply change the value that's returned by crcSlow() and crcFast() as follows:
return (remainder ^ FINAL_XOR_VALUE);
If the final XOR value consists of all ones (as it does in the CRC-32 standard), this extra step will have the same effect as complementing the final remainder. However, implementing it this way allows any possible value to be used in your specific application.

Just to add my own guess to the mix, x ^ 0x0001 keeps the last bit and flipps the others; to turn off the last bit use x & 0xFFFE or x & ~0x0001; to turn on the last bit unconditionally use x | 0x0001. I.e., if you are doing lots of bit-twiddling, your fingers probably know those idioms and just roll them out without much thinking.

I doubt there's any deep reason. Maybe that's how the author thought about it ("I'll just xor with all ones"), or perhaps how it was expressed in the algorithm definition.

I think it is for the same reason that some write
const int zero = 0;
and others write
const int zero = 0x00000000;
Different people think different ways. Even about a fundamental operation.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js