In special cases: Is & faster than %?

In special cases: Is & faster than %? - c++

I saw the chosen answer to this post.
I was suprised that (x & 255) == (x % 256) if x is an unsigned integer, I wondered if it makes sense to always replace % with & in x % n for n = 2^a (a = [1, ...]) and x being a positive integer.
Since this is a special case in which I as a human can decide because I know with which values the program will deal with and the compiler does not. Can I gain a significant performance boost if my program uses a lot of modulo operations?
Sure, I could just compile and look at the dissassembly. But this would only answer my question for one compiler/architecture. I would like to know if this is in principle faster.

If your integral type is unsigned, the compiler will optimize it, and the result will be the same. If it's signed, something is different...
This program:
int mod_signed(int i) {
return i % 256;
}
int and_signed(int i) {
return i & 255;
}
unsigned mod_unsigned(unsigned int i) {
return i % 256;
}
unsigned and_unsigned(unsigned int i) {
return i & 255;
}
will be compiled (by GCC 6.2 with -O3; Clang 3.9 produces very similar code) into:
mod_signed(int):
mov edx, edi
sar edx, 31
shr edx, 24
lea eax, [rdi+rdx]
movzx eax, al
sub eax, edx
ret
and_signed(int):
movzx eax, dil
ret
mod_unsigned(unsigned int):
movzx eax, dil
ret
and_unsigned(unsigned int):
movzx eax, dil
ret
The result assembly of mod_signed is different because
If both operands to a multiplication, division, or modulus expression have the same sign, the result is positive. Otherwise, the result is negative. The result of a modulus operation's sign is implementation-defined.
and AFAICT, most of implementation decided that the result of a modulus expression is always the same as the sign of the first operand. See this documentation.
Hence, mod_signed is optimized to (from nwellnhof's comment):
int d = i < 0 ? 255 : 0;
return ((i + d) & 255) - d;
Logically, we can prove that i % 256 == i & 255 for all unsigned integers, hence, we can trust the compiler to do its job.

I did some measurements with gcc, and
if the argument of a / or % is a compiled time constant that's a power of 2, gcc can turn it into the corresponding bit operation.
Here are some of my benchmarks for divisions
What has a better performance: multiplication or division? and as you can see, the running times with divisors that are statically known powers of two are noticably lower than with other statically known divisors.
So if / and % with statically known power-of-two arguments describe your algorithm better than bit ops, feel free to prefer / and %.
You shouldn't lose any performance with a decent compiler.

Related

Why does static_cast conversion speed up an un-optimized build of my integer division function?

... or rather, why does not static_cast-ing slow down my function?
Consider the function below, which performs integer division:
int Divide(int x, int y) {
int ret = 0, i = 32;
long j = static_cast<long>(y) << i;
while (x >= y) {
while (x < j) --i, j >>= 1;
ret += 1 << i, x -= j;
}
return ret;
}
This performs reasonably well, as one might expect.
However, if we remove the static_cast on line 3, like so:
int Divide(int x, int y) {
int ret = 0, i = 32;
long j = y << i;
while (x >= y) {
while (x < j) --i, j >>= 1;
ret += 1 << i, x -= j;
}
return ret;
}
This version performs noticeably slower, sometimes several hundreds times slower (I haven't measured rigorously, but shouldn't be far off) for pathological inputs where x is large and y is small. I was curious and wanted to look into why, and tried digging into the assembly code. However, apart from the casting differences on line 3, I get the exact same output. Here's the line 3 output for reference (source):
With static_cast:
movsxd rax, dword ptr [rbp - 8]
mov ecx, dword ptr [rbp - 16]
shl rax, cl
mov qword ptr [rbp - 24], rax
Without static_cast:
mov eax, dword ptr [rbp - 8]
mov ecx, dword ptr [rbp - 16]
shl eax, cl
cdqe
mov qword ptr [rbp - 24], rax
The rest is identical.
I'm really curious where the overhead is occurring.
EDIT: I've tested a bit further, and it looks like the while loop is where most of the time is spent, not when y is initialized. The additional cdqe instruction doesn't seem to be significant enough to warrant the total increase in wall time.
Some disclaimers, since I've been getting a lot of comments peripheral to the actual question:
I'm aware that shifting an int further than 32 bits is UB.
I'm assuming only positive inputs.
long is 8 bytes long on my platform, so it doesn't overflow.
I'd like to know what might be causing the increased runtime, which the comments criticizing the above don't actually address.

Widening after the shift reduces your loop to naive repeated subtraction
It's not the run-time of cdqe or movsxd vs. mov that's relevant, it's the different starting values for your loop, resulting in a different iteration count, especially for pathological cases.
Clang without optimization compiled your source exactly the way it was written, doing the shift on an int and then sign-extending the result to long. The shift-count UB is invisible to the compiler with optimization disabled because, for consistent debugging, it assumes variable values can change between statements, so the behaviour depends on what the target machine does with a shift-count by the operand-size.
When compiling for x86-64, that results in long j = (long)(y<<0);, i.e. long j = y;, rather than having those bits at the top of a 64-bit value.
x86 scalar shifts like shl eax, cl mask the count with &31 (except with 64-bit operand size) so the shift used a count of 32 % 32 == 0. AArch64 would I think saturate the shift count, i.e. let you shift out all the bits.
Notice that it does a 32-bit operand-size shl eax, cl and then sign-extends the result with cdqe, instead of doing a sign-extending reload of y and then a 64-bit operand-size shl rax,cl.
Your loop has a data-dependent iteration count
If you single-step with a debugger, you could see the local variable values accurately. (That's the main benefit of an un-optimized debug build, which is not what you should be benchmarking.) And you can count iterations.
while (x >= y) {
while (x < j) --i, j >>= 1;
ret += 1 << i, x -= j;
}
With j = y, if we enter the outer loop at all, then the inner loop condition is always false.
So it never runs, j stays constant the whole time, and i stays constant at 32.
1<<32 again compiles to a variable-count shift with 32-bit operand-size, because 1 has type int. (1LL has type long long, and can safely be left-shifted by 32). On x86-64, this is just a slow way to do ret += 1;.
x -= j; is of course just x -= y;, so we're counting how many subtractions to make x < y.
It's well-known that division by repeated subtraction is extremely slow for large quotients, since the run time scales linearly with the quotient.
You do happen to get the right result, though. Yay.
BTW, long is only 32-bit on some targets like Windows x64 and 32-bit platforms; use long long or int64_t if you want a type twice the width of int. And maybe static_assert to make sure int isn't that wide.
With optimization enabled, I think the same things would still hold true: clang looks like it's compiling to similar asm just without the store/reload. So it's effectively / de-facto defining the behaviour of 1<<32 to just compile to an x86 shift instruction.
But I didn't test, that's just from a quick look at the asm https://godbolt.org/z/M33vqGj5P and noting things like mov r8d, 1 ; shl r8d, cl (32-bit operand-size) ; add eax, r8d

int Divide(int x, int y) {
int ret = 0, i = 32;
long j = y << i;
On most systems, the size of int is 32 bits or less. Left-shifting a signed integer by equal or higher number of bits as its size results in undefined behaviour. Don't do this. Since the program is broken, it's irrelevant whether it's slower or faster.
Sidenote: Left shifting a signed 32 bit integer by 31 or fewer bits may also be undefined if that shift causes the left most bit to change due to arithmetic overflow.

I'm keeping this answer up for now as the comments are useful.

Why does division by 3 require a rightshift (and other oddities) on x86?

I have the following C/C++ function:
unsigned div3(unsigned x) {
return x / 3;
}
When compiled using clang 10 at -O3, this results in:
div3(unsigned int):
mov ecx, edi # tmp = x
mov eax, 2863311531 # result = 3^-1
imul rax, rcx # result *= tmp
shr rax, 33 # result >>= 33
ret
What I do understand is: division by 3 is equivalent to multiplying with the multiplicative inverse 3-1 mod 232 which is 2863311531.
There are some things that I don't understand though:
Why do we need to use ecx/rcx at all? Can't we multiply rax with edi directly?
Why do we multiply in 64-bit mode? Wouldn't it be faster to multiply eax and ecx?
Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.
What's up with the 33-bit rightshift at the end? I thought we can just drop the highest 32-bits.
Edit 1
For those who don't understand what I mean by 3-1 mod 232, I am talking about the multiplicative inverse here.
For example:
// multiplying with inverse of 3:
15 * 2863311531 = 42949672965
42949672965 mod 2^32 = 5
// using fixed-point multiplication
15 * 2863311531 = 42949672965
42949672965 >> 33 = 5
// simply dividing by 3
15 / 3 = 5
So multiplying with 42949672965 is actually equivalent to dividing by 3. I assumed clang's optimization is based on modular arithmetic, when it's really based on fixed point arithmetic.
Edit 2
I have now realized that the multiplicative inverse can only be used for divisions without a remainder. For example, multiplying 1 times 3-1 is equal to 3-1, not zero. Only fixed point arithmetic has correct rounding.
Unfortunately, clang does not make any use of modular arithmetic which would just be a single imul instruction in this case, even when it could. The following function has the same compile output as above.
unsigned div3(unsigned x) {
__builtin_assume(x % 3 == 0);
return x / 3;
}
(Canonical Q&A about fixed-point multiplicative inverses for exact division that work for every possible input: Why does GCC use multiplication by a strange number in implementing integer division? - not quite a duplicate because it only covers the math, not some of the implementation details like register width and imul vs. mul.)

Can't we multiply rax with edi directly?
We can't imul rax, rdi because the calling convention allows the caller to leave garbage in the high bits of RDI; only the EDI part contains the value. This is a non-issue when inlining; writing a 32-bit register does implicitly zero-extend to the full 64-bit register, so the compiler will usually not need an extra instruction to zero-extend a 32-bit value.
(zero-extending into a different register is better because of limitations on mov-elimination, if you can't avoid it).
Taking your question even more literally, no, x86 doesn't have any multiply instructions that zero-extend one of their inputs to let you multiply a 32-bit and a 64-bit register. Both inputs must be the same width.
Why do we multiply in 64-bit mode?
(terminology: all of this code runs in 64-bit mode. You're asking why 64-bit operand-size.)
You could mul edi to multiply EAX with EDI to get a 64-bit result split across EDX:EAX, but mul edi is 3 uops on Intel CPUs, vs. most modern x86-64 CPUs having fast 64-bit imul. (Although imul r64, r64 is slower on AMD Bulldozer-family, and on some low-power CPUs.) https://uops.info/ and https://agner.org/optimize/ (instruction tables and microarch PDF)
(Fun fact: mul rdi is actually cheaper on Intel CPUs, only 2 uops. Perhaps something to do with not having to do extra splitting on the output of the integer multiply unit, like mul edi would have to split the 64-bit low half multiplier output into EDX and EAX halves, but that happens naturally for 64x64 => 128-bit mul.)
Also the part you want is in EDX so you'd need another mov eax, edx to deal with it. (Again, because we're looking at code for a stand-alone definition of the function, not after inlining into a caller.)
GCC 8.3 and earlier did use 32-bit mul instead of 64-bit imul (https://godbolt.org/z/5qj7d5). That was not crazy for -mtune=generic when Bulldozer-family and old Silvermont CPUs were more relevant, but those CPUs are farther in the past for more recent GCC, and its generic tuning choices reflect that. Unfortunately GCC also wasted a mov instruction copying EDI to EAX, making this way look even worse :/
# gcc8.3 -O3 (default -mtune=generic)
div3(unsigned int):
mov eax, edi # 1 uop, stupid wasted instruction
mov edx, -1431655765 # 1 uop (same 32-bit constant, just printed differently)
mul edx # 3 uops on Sandybridge-family
mov eax, edx # 1 uop
shr eax # 1 uop
ret
# total of 7 uops on SnB-family
Would only be 6 uops with mov eax, 0xAAAAAAAB / mul edi, but still worse than:
# gcc9.3 -O3 (default -mtune=generic)
div3(unsigned int):
mov eax, edi # 1 uop
mov edi, 2863311531 # 1 uop
imul rax, rdi # 1 uop
shr rax, 33 # 1 uop
ret
# total 4 uops, not counting ret
Unfortunately, 64-bit 0x00000000AAAAAAAB can't be represented as a 32-bit sign-extended immediate, so imul rax, rcx, 0xAAAAAAAB isn't encodeable. It would mean 0xFFFFFFFFAAAAAAAB.
Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.
It is unsigned. Signedness of the inputs only affects the high half of the result, but imul reg, reg doesn't produce the high half. Only the one-operand forms of mul and imul are full multiplies that do NxN => 2N, so only they need separate signed and unsigned versions.
Only imul has the faster and more flexible low-half-only forms. The only thing that's signed about imul reg, reg is that it sets OF based on signed overflow of the low half. It wasn't worth spending more opcodes and more transistors just to have a mul r,r whose only difference from imul r,r is the FLAGS output.
Intel's manual (https://www.felixcloutier.com/x86/imul) even points out the fact that it can be used for unsigned.
What's up with the 33-bit rightshift at the end? I thought we can just drop the highest 32-bits.
No, there's no multiplier constant that would give the exact right answer for every possible input x if you implemented it that way. The "as-if" optimization rule doesn't allow approximations, only implementations that produce the exact same observable behaviour for every input the program uses. Without knowing a value-range for x other than full range of unsigned, compilers don't have that option. (-ffast-math only applies to floating point; if you want faster approximations for integer math, code them manually like below):
See Why does GCC use multiplication by a strange number in implementing integer division? for more about the fixed-point multiplicative inverse method compilers use for exact division by compile time constants.
For an example of this not working in the general case, see my edit to an answer on Divide by 10 using bit shifts? which proposed
// Warning: INEXACT FOR LARGE INPUTS
// this fast approximation can just use the high half,
// so on 32-bit machines it avoids one shift instruction vs. exact division
int32_t div10(int32_t dividend)
{
int64_t invDivisor = 0x1999999A;
return (int32_t) ((invDivisor * dividend) >> 32);
}
Its first wrong answer (if you loop from 0 upward) is div10(1073741829) = 107374183 when 1073741829/10 is actually 107374182. (It rounded up instead of toward 0 like C integer division is supposed to.)
From your edit, I see you were actually talking about using the low half of a multiply result, which apparently works perfectly for exact multiples all the way up to UINT_MAX.
As you say, it completely fails when the division would have a remainder, e.g. 16 * 0xaaaaaaab = 0xaaaaaab0 when truncated to 32-bit, not 5.
unsigned div3_exact_only(unsigned x) {
__builtin_assume(x % 3 == 0); // or an equivalent with if() __builtin_unreachable()
return x / 3;
}
Yes, if that math works out, it would be legal and optimal for compilers to implement that with 32-bit imul. They don't look for this optimization because it's rarely a known fact. IDK if it would be worth adding compiler code to even look for the optimization, in terms of compile time, not to mention compiler maintenance cost in developer time. It's not a huge difference in runtime cost, and it's rarely going to be possible. It is nice, though.
div3_exact_only:
imul eax, edi, 0xAAAAAAAB # 1 uop, 3c latency
ret
However, it is something you can do yourself in source code, at least for known type widths like uint32_t:
uint32_t div3_exact_only(uint32_t x) {
return x * 0xaaaaaaabU;
}

What's up with the 33-bit right shift at the end? I thought we can just drop the highest 32-bits.
Instead of 3^(-1) mod 3 you have to think more about 0.3333333 where the 0 before the . is located in the upper 32 bit and the the 3333 is located in the lower 32 bit.
This fixed point operation works fine, but the result is obviously shifted to the upper part of rax, therefor the CPU must shift the result down again after the operation.
Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.
There is no MUL instruction equivalent to the IMUL instruction. The IMUL variant that is used takes two registers:
a <= a * b
There is no MUL instruction that does that. MUL instructions are more expensive because they store the result as 128 Bit in two registers.
Of course you could use the legacy instructions, but this does not change the fact that the result is stored in two registers.

If you look at my answer to the prior question:
Why does GCC use multiplication by a strange number in implementing integer division?
It contains a link to a pdf article that explains this (my answer clarifies the stuff that isn't explained well in this pdf article):
https://gmplib.org/~tege/divcnst-pldi94.pdf
Note that one extra bit of precision is needed for some divisors, such as 7, the multiplier would normally require 33 bits, and the product would normally require 65 bits, but this can be avoided by handling the 2^32 bit separately with 3 additional instructions as shown in my prior answer and below.
Take a look at the generated code if you change to
unsigned div7(unsigned x) {
return x / 7;
}
So to explain the process, let L = ceil(log2(divisor)). For the question above, L = ceil(log2(3)) == 2. The right shift count would initially be 32+L = 34.
To generate a multiplier with a sufficient number of bits, two potential multipliers are generated: mhi will be the multiplier to be used, and the shift count will be 32+L.
mhi = (2^(32+L) + 2^(L))/3 = 5726623062
mlo = (2^(32+L) )/3 = 5726623061
Then a check is made to see if the number of required bits can be reduced:
while((L > 0) && ((mhi>>1) > (mlo>>1))){
mhi = mhi>>1;
mlo = mlo>>1;
L = L-1;
}
if(mhi >= 2^32){
mhi = mhi-2^32
L = L-1;
; use 3 additional instructions for missing 2^32 bit
}
... mhi>>1 = 5726623062>>1 = 2863311531
... mlo>>1 = 5726623061>>1 = 2863311530 (mhi>>1) > (mlo>>1)
... mhi = mhi>>1 = 2863311531
... mlo = mhi>>1 = 2863311530
... L = L-1 = 1
... the next loop exits since now (mhi>>1) == (mlo>>1)
So the multiplier is mhi = 2863311531 and the shift count = 32+L = 33.
On an modern X86, multiply and shift instructions are constant time, so there's no point in reducing the multiplier (mhi) to less than 32 bits, so that while(...) above is changed to an if(...).
In the case of 7, the loop exits on the first iteration, and requires 3 extra instructions to handle the 2^32 bit, so that mhi is <= 32 bits:
L = ceil(log2(7)) = 3
mhi = (2^(32+L) + 2^(L))/7 = 4908534053
mhi = mhi-2^32 = 613566757
Let ecx = dividend, the simple approach could overflow on the add:
mov eax, 613566757 ; eax = mhi
mul ecx ; edx:eax = ecx*mhi
add edx, ecx ; edx:eax = ecx*(mhi + 2^32), potential overflow
shr edx, 3
To avoid the potential overflow, note that eax = eax*2 - eax:
(ecx*eax) = (ecx*eax)<<1) -(ecx*eax)
(ecx*(eax+2^32)) = (ecx*eax)<<1)+ (ecx*2^32)-(ecx*eax)
(ecx*(eax+2^32))>>3 = ((ecx*eax)<<1)+ (ecx*2^32)-(ecx*eax) )>>3
= (((ecx*eax) )+(((ecx*2^32)-(ecx*eax))>>1))>>2
so the actual code, using u32() to mean upper 32 bits:
... visual studio generated code for div7, dividend is ecx
mov eax, 613566757
mul ecx ; edx = u32( (ecx*eax) )
sub ecx, edx ; ecx = u32( ((ecx*2^32)-(ecx*eax)) )
shr ecx, 1 ; ecx = u32( (((ecx*2^32)-(ecx*eax))>>1) )
lea eax, DWORD PTR [edx+ecx] ; eax = u32( (ecx*eax)+(((ecx*2^32)-(ecx*eax))>>1) )
shr eax, 2 ; eax = u32(((ecx*eax)+(((ecx*2^32)-(ecx*eax))>>1))>>2)
If a remainder is wanted, then the following steps can be used:
mhi and L are generated based on divisor during compile time
...
quotient = (x*mhi)>>(32+L)
product = quotient*divisor
remainder = x - product

x/3 is approximately (x * (2^32/3)) / 2^32. So we can perform a single 32x32->64 bit multiplication, take the higher 32 bits, and get approximately x/3.
There is some error because we cannot multiply exactly by 2^32/3, only by this number rounded to an integer. We get more precision using x/3 ≈ (x * (2^33/3)) / 2^33. (We can't use 2^34/3 because that is > 2^32). And that turns out to be good enough to get x/3 in all cases exactly. You would prove this by checking that the formula gives a result of k if the input is 3k or 3k+2.

Saturate short (int16) in C++

I am optimizing bottleneck code:
int sum = ........
sum = (sum >> _bitShift);
if (sum > 32000)
sum = 32000; //if we get an overflow, saturate output
else if (sum < -32000)
sum = -32000; //if we get an underflow, saturate output
short result = static_cast<short>(sum);
I would like to write the saturation condition as one "if condition" or even better with no "if condition" to make this code faster. I don't need saturation exactly at value 32000, any similar value like 32768 is acceptable.
According this page, there is a saturation instruction in ARM. Is there anything similar in x86/x64?

I'm not at all convinced that attempting to eliminate the if statement(s) is likely to do any real good. A quick check indicates that given this code:
int clamp(int x) {
if (x < -32768)
x = -32768;
else if (x > 32767)
x = 32767;
return x;
}
...both gcc and Clang produce branch-free results like this:
clamp(int):
cmp edi, 32767
mov eax, 32767
cmovg edi, eax
mov eax, -32768
cmp edi, -32768
cmovge eax, edi
ret
You can do something like x = std::min(std::max(x, -32768), 32767);, but this produces the same sequence, and the source seems less readable, at least to me.
You can do considerably better than this if you use Intel's vector instructions, but probably only if you're willing to put a fair amount of work into it--in particular, you'll probably need to operate on an entire (small) vector of values simultaneously to accomplish much this way. If you do go that way, you usually want to take a somewhat different approach to the task than you seem to be taking right now. Right now, you're apparently depending on int being a 32-bit type, so you're doing the arithmetic on a 32-bit type, then afterwards truncating it back down to a (saturated) 16-bit value.
With something like AVX, you'd typically want to use an instruction like _mm256_adds_epi16 to take a vector of 16 values (of 16-bits apiece), and do a saturating addition on all of them at once (or, likewise, _mm256_subs_epi16 to do saturating subtraction).
Since you're writing C++, what I've given above are the names of the compiler intrinsics used in most current compilers (gcc, icc, clang, msvc) for x86 processors. If you're writing assembly language directly, the instructions would be vpaddsw and vpsubsw respectively.
If you can count on a really current processor (one that supports AVX 512 instructions) you can use them instead to operate on a vector of 32 16-bit values simultaneously.

Are you sure you can beat the compiler at this?
Here's x64 retail with max size optimizations enabled. Visual Studio v15.7.5.
ecx contains the intial value at the start of this block. eax is filled with the saturated value when it is done.
return (x > 32767) ? 32767 : ((x < -32768) ? -32768 : x);
mov edx,0FFFF8000h
movzx eax,cx
cmp ecx,edx
cmovl eax,edx
mov edx,7FFFh
cmp ecx,edx
movzx eax,ax
cmovg eax,edx

Convert flag into either 0xFF or 0, based on whether flag equals 1 or 0

I have a binary flag f, equal to either zero or one.
If equal to one, I would like to convert to 0xFF, otherwise, to 0.
Current solution is f*0xFF, but I would rather use bit twiddling to achieve this.

How about just:
(unsigned char)-f
or alternately:
0xFF & -f
If f is already a char, then you just need -f.
This approach works because -0 == 0 and -1 == 0xFFFFF..., so the negation gets you want you want directly, perhaps with some extra high bits set if f is larger than a char (you didn't say).
Remember though that compilers are smart. I tried all of the following solutions, and all compiled down to 3 instructions or less, and none had a branch (even the solution with a conditional):
Conditional
int remap_cond(int f) {
return f ? 0xFF : 0;
}
Compiles to:
remap_cond:
test edi, edi
mov eax, 255
cmove eax, edi
ret
So even the "obvious" conditional works well, in three instructions and a latency of 2 or 3 cycles on most modern x86 hardware, depending on cmov performance.
Multiplication
Your original solution of:
int remap_mul(int f) {
return f * 0xFF;
}
Actually compiles into nice code that avoids the multiplication entirely, replacing it with a shift and subtract:
remap_mul:
mov eax, edi
sal eax, 8
sub eax, edi
ret
This will generally take two cycles on machines with mov-elimination, and the mov would often be removed by inlining anyway.
Subtraction
As corn3lius pointed out, you can do some subtraction from 0x100 and a mask, like so:
int remap_shift_sub(int f) {
return 0xFF & (0x100 - f);
}
This compiles to1:
remap_shift_sub:
neg edi
movzx eax, dil
ret
So that's the best so far I think - a latency of 2 cycles on most hosts, and the movzx can often be eliminated by inlining2 - e.g., since it could use the 8-bit register in a subsequent consuming instruction.
Note that the compiler has smartly eliminated both the masking operation (you could perhaps argue the movzx accounts for it), and the use of the 0x100 constant, because it understands that a simple negation does the same thing here (in particular, all the bits that differ between -f and 0x100 - f are masked away by the 0xFF & ... operation).
That leads directly to the following C code:
int remap_neg_mask(int f) {
return -f;
}
which compiles down the exact same thing.
You can play with all of this on godbolt.
1 Except on clang, which inserts an extra mov to get the result in eax rather than generating it there in the first place.
2 Note that by "inlining" I mean both real inlining the compiler does if you actually write this as a function, but also what happens if you just do the remapping operation directly at the place you need it without a function.

value = 0xFF & ((1 << 16) - f )
If f is one, subtract it from 0x100 giving you 0xFF; otherwise subtract 0 and bitmask with 0xFF and get 0.
Too obvious?
value = ( f == 1 ) ? 0xFF : 0;

What is the correct way to obtain (-1)^n?

Many algorithms require to compute (-1)^n (both integer), usually as a factor in a series. That is, a factor that is -1 for odd n and 1 for even n. In a C++ environment, one often sees:
#include<iostream>
#include<cmath>
int main(){
int n = 13;
std::cout << std::pow(-1, n) << std::endl;
}
What is better or the usual convention? (or something else),
std::pow(-1, n)
std::pow(-1, n%2)
(n%2?-1:1)
(1-2*(n%2)) // (gives incorrect value for negative n)
EDIT:
In addition, user #SeverinPappadeux proposed another alternative based on (a global?) array lookups. My version of it is:
const int res[] {-1, 1, -1}; // three elements are needed for negative modulo results
const int* const m1pow = res + 1;
...
m1pow[n%2]
This is not probably not going to settle the question but, by using the emitted code we can discard some options.
First without optimization, the final contenders are:
1 - ((n & 1) << 1);
(7 operation, no memory access)
mov eax, DWORD PTR [rbp-20]
add eax, eax
and eax, 2
mov edx, 1
sub edx, eax
mov eax, edx
mov DWORD PTR [rbp-16], eax
and
retvals[n&1];
(5 operations, memory --registers?-- access)
mov eax, DWORD PTR [rbp-20]
and eax, 1
cdqe
mov eax, DWORD PTR main::retvals[0+rax*4]
mov DWORD PTR [rbp-8], eax
Now with optimization (-O3)
1 - ((n & 1) << 1);
(4 operation, no memory access)
add edx, edx
mov ebp, 1
and edx, 2
sub ebp, edx
.
retvals[n&1];
(4 operations, memory --registers?-- access)
mov eax, edx
and eax, 1
movsx rcx, eax
mov r12d, DWORD PTR main::retvals[0+rcx*4]
.
n%2?-1:1;
(4 operations, no memory access)
cmp eax, 1
sbb ebx, ebx
and ebx, 2
sub ebx, 1
The test are here. I had to some some acrobatics to have meaningful code that doesn't elide operations all together.
Conclusion (for now)
So at the end it depends on the level optimization and expressiveness:
1 - ((n & 1) << 1); is always good but not very expressive.
retvals[n&1]; pays a price for memory access.
n%2?-1:1; is expressive and good but only with optimization.

You can use (n & 1) instead of n % 2 and << 1 instead of * 2 if you want to be super-pedantic, er I mean optimized.
So the fastest way to compute in an 8086 processor is:
1 - ((n & 1) << 1)
I just want to clarify where this answer is coming from. The original poster alfC did an excellent job of posting a lot of different ways to compute (-1)^n some being faster than others.
Nowadays with processors being as fast as they are and optimizing compilers being as good as they are we usually value readability over the slight (even negligible) improvements from shaving a few CPU cycles from an operation.
There was a time when one pass compilers ruled the earth and MUL operations were new and decadent; in those days a power of 2 operation was an invitation for gratuitous optimization.

Usually you don't actually calculate (-1)^n, instead you track the current sign (as a number being either -1 or 1) and flip it every operation (sign = -sign), do this as you handle your n in order and you will get the same result.
EDIT: Note that part of the reason I recommend this is because there is rarely actually semantic value is the representation (-1)^n it is merely a convenient method of flipping the sign between iterations.

First of all, the fastest isOdd test I do know (in an inline method)
/**
* Return true if the value is odd
* #value the value to check
*/
inline bool isOdd(int value)
{
return (value & 1);
}
Then make use of this test to return -1 if odd, 1 otherwise (which is the actual output of (-1)^N )
/**
* Return the computation of (-1)^N
* #n the N factor
*/
inline int minusOneToN(int n)
{
return isOdd(n)?-1:1;
}
Last as suggested #Guvante, you can spare a multiplication just flipping the sign of a value (avoiding using the minusOneToN function)
/**
* Example of the general usage. Avoids a useless multiplication
* #value The value to flip if it is odd
*/
inline int flipSignIfOdd(int value)
{
return isOdd(value)?-value:value;
}

Many algorithms require to compute (-1)^n (both integer), usually as a
factor in a series. That is, a factor that is -1 for odd n and 1 for
even n.
Consider evaluating the series as a function of -x instead.

If it's speed you need, here goes ...
int inline minus_1_pow(int n) {
static const int retvals[] {1, -1};
return retvals[n&1];
}
The Visual C++ compiler with optimization turned to 11 compiles this down to two machine instructions, neither of which is a branch. It optimizes-away the retvals array also, so no cache misses.

What about
(1 - (n%2)) - (n%2)
n%2 most likely will be computed only once
UPDATE
Actually, simplest and most correct way would be using table
const int res[] {-1, 1, -1};
return res[n%2 + 1];

Well if we are performing the calculation in a series, why not handle the calculation in a positive loop and a negative loop, skipping the evaluation completely?
The Taylor series expansion to approximate the natural log of (1+x) is a perfect example of this type of problem. Each term has (-1)^(n+1), or (1)^(n-1). There is no need to calculate this factor. You can "slice" the problem by either executing 1 loop for every two terms, or two loops, one for the odd terms and one for the even terms.
Of course, since the calculation, by its nature, is one over the domain of real numbers, you will be using a floating point processor to evaluate the individual terms anyway. Once you have decided to do that, you should just use the library implementation for the natural logarithm. But if for some reason, you decide not to, it will certainly be faster, but not by much, not to waste cycles calculating the value of -1 to the nth power.
Perhaps each can even be done in separate threads. Maybe the problem can be vectorized, even.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js