Bit twiddling reorder - bit-manipulation

Bit twiddling reorder - bit-manipulation

I need to do an arbitrary reorder of a 7 bit value (Yes I know I should be using a table) and am wondering if there are any bit hacks to do this.
Example:
// <b0, b1, b2, b3, b4, b5, b6> -> <b3, b2, b4, b1, b5, b0, b6>
// the naive way
out =
(0x020 & In) << 5 |
(0x008 & In) << 2 |
(0x040 & In) |
(0x012 & In) >> 1 |
(0x004 & In) >> 2 |
(0x001 & In) >> 3;
// 6 ANDs, 5 ORs, 5 shifts = 16 ops
edit:
I was thinking of something along the lines of this
Just for kicks and because I was AFTK I'm trying a brute force search for solutions of the form:
((In * C1) >> C2) & 0x7f
No solutions found.

The first step seems to be to understand a mathematical solution and optimize that.
see here of bit hacks

Have a look at the compiler output of your "naive" code, it might surprise you. I once did something like that and the compiler (VC++2005) completely changed the values of all the ands and shifts for me to make them more efficient, eg I'm sure it would remove your "(0x001 & In) >> 3".
But yes, if the reshuffle is a fixed function then a table is probably best.
Update
For a laugh I looked at the compiler output from VC++ 2005....
First I tried a constant value for "In" but the compiler wasn't fooled one bit, it produced this code:
mov eax,469h
ie. it completely optimized it away.
So ... I tried a proper input and got this:
00401D4F mov eax,ecx
00401D51 and eax,20h
00401D54 shl eax,3
00401D57 mov edx,ecx
00401D59 and edx,8
00401D5C or eax,edx
00401D5E mov edx,ecx
00401D60 sar edx,1
00401D62 and edx,2
00401D65 or edx,ecx
00401D67 sar edx,1
00401D69 shl eax,2
00401D6C and edx,9
00401D6F or eax,edx
00401D71 and ecx,40h
00401D74 or eax,ecx
That's four shift operations, five ANDs, four ORs - not bad for six inputs. Probably better than most people could do by hand.
It's probably also optimized for out-of-order execution so it'll be less clock cycles than it seems. :-)

There are plenty of bit-twiddling hacks for common operations, i.e. to reverse the order of the bits in a 32-bit word (10 each of shift, AND and OR, AFAICR).
In this case, with an apparently completely arbitrary mapping from input to output, I can't see any way of cleaning this up.
Use a lookup table :)

Before you optimize, you should make sure your 'naive' way is doing what you intend. If I make your code into a function and run this loop:
for (b=0;b<7;b++)
{
i=1<<b;
printf("%d: %02x -> %02x\n", b, i, shuffle(i));
}
It produces this output, which contradicts the comments. In fact, it loses bits.
0: 01 -> 00
1: 02 -> 01
2: 04 -> 01
3: 08 -> 20
4: 10 -> 08
5: 20 -> 00
6: 40 -> 40
In order to get the shuffle you describe, I would code it like this:
// 0 1 2 3 4 5 6
//-> 3 2 4 1 5 0 6
(0x001 & In) << 3 |
(0x004 & In) << 2 |
(0x020 & In) |
(0x012 & In) << 1 |
(0x008 & In) >> 2 |
(0x020 & In) >> 5 ;

Related

Convert flag into either 0xFF or 0, based on whether flag equals 1 or 0

I have a binary flag f, equal to either zero or one.
If equal to one, I would like to convert to 0xFF, otherwise, to 0.
Current solution is f*0xFF, but I would rather use bit twiddling to achieve this.

How about just:
(unsigned char)-f
or alternately:
0xFF & -f
If f is already a char, then you just need -f.
This approach works because -0 == 0 and -1 == 0xFFFFF..., so the negation gets you want you want directly, perhaps with some extra high bits set if f is larger than a char (you didn't say).
Remember though that compilers are smart. I tried all of the following solutions, and all compiled down to 3 instructions or less, and none had a branch (even the solution with a conditional):
Conditional
int remap_cond(int f) {
return f ? 0xFF : 0;
}
Compiles to:
remap_cond:
test edi, edi
mov eax, 255
cmove eax, edi
ret
So even the "obvious" conditional works well, in three instructions and a latency of 2 or 3 cycles on most modern x86 hardware, depending on cmov performance.
Multiplication
Your original solution of:
int remap_mul(int f) {
return f * 0xFF;
}
Actually compiles into nice code that avoids the multiplication entirely, replacing it with a shift and subtract:
remap_mul:
mov eax, edi
sal eax, 8
sub eax, edi
ret
This will generally take two cycles on machines with mov-elimination, and the mov would often be removed by inlining anyway.
Subtraction
As corn3lius pointed out, you can do some subtraction from 0x100 and a mask, like so:
int remap_shift_sub(int f) {
return 0xFF & (0x100 - f);
}
This compiles to1:
remap_shift_sub:
neg edi
movzx eax, dil
ret
So that's the best so far I think - a latency of 2 cycles on most hosts, and the movzx can often be eliminated by inlining2 - e.g., since it could use the 8-bit register in a subsequent consuming instruction.
Note that the compiler has smartly eliminated both the masking operation (you could perhaps argue the movzx accounts for it), and the use of the 0x100 constant, because it understands that a simple negation does the same thing here (in particular, all the bits that differ between -f and 0x100 - f are masked away by the 0xFF & ... operation).
That leads directly to the following C code:
int remap_neg_mask(int f) {
return -f;
}
which compiles down the exact same thing.
You can play with all of this on godbolt.
1 Except on clang, which inserts an extra mov to get the result in eax rather than generating it there in the first place.
2 Note that by "inlining" I mean both real inlining the compiler does if you actually write this as a function, but also what happens if you just do the remapping operation directly at the place you need it without a function.

value = 0xFF & ((1 << 16) - f )
If f is one, subtract it from 0x100 giving you 0xFF; otherwise subtract 0 and bitmask with 0xFF and get 0.
Too obvious?
value = ( f == 1 ) ? 0xFF : 0;

Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly?

I wrote these two solutions for Project Euler Q14, in assembly and in C++. They implement identical brute force approach for testing the Collatz conjecture. The assembly solution was assembled with:
nasm -felf64 p14.asm && gcc p14.o -o p14
The C++ was compiled with:
g++ p14.cpp -o p14
Assembly, p14.asm:
section .data
fmt db "%d", 10, 0
global main
extern printf
section .text
main:
mov rcx, 1000000
xor rdi, rdi ; max i
xor rsi, rsi ; i
l1:
dec rcx
xor r10, r10 ; count
mov rax, rcx
l2:
test rax, 1
jpe even
mov rbx, 3
mul rbx
inc rax
jmp c1
even:
mov rbx, 2
xor rdx, rdx
div rbx
c1:
inc r10
cmp rax, 1
jne l2
cmp rdi, r10
cmovl rdi, r10
cmovl rsi, rcx
cmp rcx, 2
jne l1
mov rdi, fmt
xor rax, rax
call printf
ret
C++, p14.cpp:
#include <iostream>
int sequence(long n) {
int count = 1;
while (n != 1) {
if (n % 2 == 0)
n /= 2;
else
n = 3*n + 1;
++count;
}
return count;
}
int main() {
int max = 0, maxi;
for (int i = 999999; i > 0; --i) {
int s = sequence(i);
if (s > max) {
max = s;
maxi = i;
}
}
std::cout << maxi << std::endl;
}
I know about the compiler optimizations to improve speed and everything, but I don’t see many ways to further optimize my assembly solution (speaking programmatically, not mathematically).
The C++ code uses modulus every term and division every other term, while the assembly code only uses a single division every other term.
But the assembly is taking on average 1 second longer than the C++ solution. Why is this? I am asking mainly out of curiosity.
Execution times
My system: 64-bit Linux on 1.4 GHz Intel Celeron 2955U (Haswell microarchitecture).
g++ (unoptimized): avg 1272 ms.
g++ -O3: avg 578 ms.
asm (div) (original): avg 2650 ms.
asm (shr): avg 679 ms.
#johnfound asm (assembled with NASM): avg 501 ms.
#hidefromkgb asm: avg 200 ms.
#hidefromkgb asm, optimized by #Peter Cordes: avg 145 ms.
#Veedrac C++: avg 81 ms with -O3, 305 ms with -O0.

If you think a 64-bit DIV instruction is a good way to divide by two, then no wonder the compiler's asm output beat your hand-written code, even with -O0 (compile fast, no extra optimization, and store/reload to memory after/before every C statement so a debugger can modify variables).
See Agner Fog's Optimizing Assembly guide to learn how to write efficient asm. He also has instruction tables and a microarch guide for specific details for specific CPUs. See also the x86 tag wiki for more perf links.
See also this more general question about beating the compiler with hand-written asm: Is inline assembly language slower than native C++ code?. TL:DR: yes if you do it wrong (like this question).
Usually you're fine letting the compiler do its thing, especially if you try to write C++ that can compile efficiently. Also see is assembly faster than compiled languages?. One of the answers links to these neat slides showing how various C compilers optimize some really simple functions with cool tricks. Matt Godbolt's CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” is in a similar vein.
even:
mov rbx, 2
xor rdx, rdx
div rbx
On Intel Haswell, div r64 is 36 uops, with a latency of 32-96 cycles, and a throughput of one per 21-74 cycles. (Plus the 2 uops to set up RBX and zero RDX, but out-of-order execution can run those early). High-uop-count instructions like DIV are microcoded, which can also cause front-end bottlenecks. In this case, latency is the most relevant factor because it's part of a loop-carried dependency chain.
shr rax, 1 does the same unsigned division: It's 1 uop, with 1c latency, and can run 2 per clock cycle.
For comparison, 32-bit division is faster, but still horrible vs. shifts. idiv r32 is 9 uops, 22-29c latency, and one per 8-11c throughput on Haswell.
As you can see from looking at gcc's -O0 asm output (Godbolt compiler explorer), it only uses shifts instructions. clang -O0 does compile naively like you thought, even using 64-bit IDIV twice. (When optimizing, compilers do use both outputs of IDIV when the source does a division and modulus with the same operands, if they use IDIV at all)
GCC doesn't have a totally-naive mode; it always transforms through GIMPLE, which means some "optimizations" can't be disabled. This includes recognizing division-by-constant and using shifts (power of 2) or a fixed-point multiplicative inverse (non power of 2) to avoid IDIV (see div_by_13 in the above godbolt link).
gcc -Os (optimize for size) does use IDIV for non-power-of-2 division,
unfortunately even in cases where the multiplicative inverse code is only slightly larger but much faster.
Helping the compiler
(summary for this case: use uint64_t n)
First of all, it's only interesting to look at optimized compiler output. (-O3).
-O0 speed is basically meaningless.
Look at your asm output (on Godbolt, or see How to remove "noise" from GCC/clang assembly output?). When the compiler doesn't make optimal code in the first place: Writing your C/C++ source in a way that guides the compiler into making better code is usually the best approach. You have to know asm, and know what's efficient, but you apply this knowledge indirectly. Compilers are also a good source of ideas: sometimes clang will do something cool, and you can hand-hold gcc into doing the same thing: see this answer and what I did with the non-unrolled loop in #Veedrac's code below.)
This approach is portable, and in 20 years some future compiler can compile it to whatever is efficient on future hardware (x86 or not), maybe using new ISA extension or auto-vectorizing. Hand-written x86-64 asm from 15 years ago would usually not be optimally tuned for Skylake. e.g. compare&branch macro-fusion didn't exist back then. What's optimal now for hand-crafted asm for one microarchitecture might not be optimal for other current and future CPUs. Comments on #johnfound's answer discuss major differences between AMD Bulldozer and Intel Haswell, which have a big effect on this code. But in theory, g++ -O3 -march=bdver3 and g++ -O3 -march=skylake will do the right thing. (Or -march=native.) Or -mtune=... to just tune, without using instructions that other CPUs might not support.
My feeling is that guiding the compiler to asm that's good for a current CPU you care about shouldn't be a problem for future compilers. They're hopefully better than current compilers at finding ways to transform code, and can find a way that works for future CPUs. Regardless, future x86 probably won't be terrible at anything that's good on current x86, and the future compiler will avoid any asm-specific pitfalls while implementing something like the data movement from your C source, if it doesn't see something better.
Hand-written asm is a black-box for the optimizer, so constant-propagation doesn't work when inlining makes an input a compile-time constant. Other optimizations are also affected. Read https://gcc.gnu.org/wiki/DontUseInlineAsm before using asm. (And avoid MSVC-style inline asm: inputs/outputs have to go through memory which adds overhead.)
In this case: your n has a signed type, and gcc uses the SAR/SHR/ADD sequence that gives the correct rounding. (IDIV and arithmetic-shift "round" differently for negative inputs, see the SAR insn set ref manual entry). (IDK if gcc tried and failed to prove that n can't be negative, or what. Signed-overflow is undefined behaviour, so it should have been able to.)
You should have used uint64_t n, so it can just SHR. And so it's portable to systems where long is only 32-bit (e.g. x86-64 Windows).
BTW, gcc's optimized asm output looks pretty good (using unsigned long n): the inner loop it inlines into main() does this:
# from gcc5.4 -O3 plus my comments
# edx= count=1
# rax= uint64_t n
.L9: # do{
lea rcx, [rax+1+rax*2] # rcx = 3*n + 1
mov rdi, rax
shr rdi # rdi = n>>1;
test al, 1 # set flags based on n%2 (aka n&1)
mov rax, rcx
cmove rax, rdi # n= (n%2) ? 3*n+1 : n/2;
add edx, 1 # ++count;
cmp rax, 1
jne .L9 #}while(n!=1)
cmp/branch to update max and maxi, and then do the next n
The inner loop is branchless, and the critical path of the loop-carried dependency chain is:
3-component LEA (3 cycles)
cmov (2 cycles on Haswell, 1c on Broadwell or later).
Total: 5 cycle per iteration, latency bottleneck. Out-of-order execution takes care of everything else in parallel with this (in theory: I haven't tested with perf counters to see if it really runs at 5c/iter).
The FLAGS input of cmov (produced by TEST) is faster to produce than the RAX input (from LEA->MOV), so it's not on the critical path.
Similarly, the MOV->SHR that produces CMOV's RDI input is off the critical path, because it's also faster than the LEA. MOV on IvyBridge and later has zero latency (handled at register-rename time). (It still takes a uop, and a slot in the pipeline, so it's not free, just zero latency). The extra MOV in the LEA dep chain is part of the bottleneck on other CPUs.
The cmp/jne is also not part of the critical path: it's not loop-carried, because control dependencies are handled with branch prediction + speculative execution, unlike data dependencies on the critical path.
Beating the compiler
GCC did a pretty good job here. It could save one code byte by using inc edx instead of add edx, 1, because nobody cares about P4 and its false-dependencies for partial-flag-modifying instructions.
It could also save all the MOV instructions, and the TEST: SHR sets CF= the bit shifted out, so we can use cmovc instead of test / cmovz.
### Hand-optimized version of what gcc does
.L9: #do{
lea rcx, [rax+1+rax*2] # rcx = 3*n + 1
shr rax, 1 # n>>=1; CF = n&1 = n%2
cmovc rax, rcx # n= (n&1) ? 3*n+1 : n/2;
inc edx # ++count;
cmp rax, 1
jne .L9 #}while(n!=1)
See #johnfound's answer for another clever trick: remove the CMP by branching on SHR's flag result as well as using it for CMOV: zero only if n was 1 (or 0) to start with. (Fun fact: SHR with count != 1 on Nehalem or earlier causes a stall if you read the flag results. That's how they made it single-uop. The shift-by-1 special encoding is fine, though.)
Avoiding MOV doesn't help with the latency at all on Haswell (Can x86's MOV really be "free"? Why can't I reproduce this at all?). It does help significantly on CPUs like Intel pre-IvB, and AMD Bulldozer-family, where MOV is not zero-latency (and Ice Lake with updated microcode). The compiler's wasted MOV instructions do affect the critical path. BD's complex-LEA and CMOV are both lower latency (2c and 1c respectively), so it's a bigger fraction of the latency. Also, throughput bottlenecks become an issue, because it only has two integer ALU pipes. See #johnfound's answer, where he has timing results from an AMD CPU.
Even on Haswell, this version may help a bit by avoiding some occasional delays where a non-critical uop steals an execution port from one on the critical path, delaying execution by 1 cycle. (This is called a resource conflict). It also saves a register, which may help when doing multiple n values in parallel in an interleaved loop (see below).
LEA's latency depends on the addressing mode, on Intel SnB-family CPUs. 3c for 3 components ([base+idx+const], which takes two separate adds), but only 1c with 2 or fewer components (one add). Some CPUs (like Core2) do even a 3-component LEA in a single cycle, but SnB-family doesn't. Worse, Intel SnB-family standardizes latencies so there are no 2c uops, otherwise 3-component LEA would be only 2c like Bulldozer. (3-component LEA is slower on AMD as well, just not by as much).
So lea rcx, [rax + rax*2] / inc rcx is only 2c latency, faster than lea rcx, [rax + rax*2 + 1], on Intel SnB-family CPUs like Haswell. Break-even on BD, and worse on Core2. It does cost an extra uop, which normally isn't worth it to save 1c latency, but latency is the major bottleneck here and Haswell has a wide enough pipeline to handle the extra uop throughput.
Neither gcc, icc, nor clang (on godbolt) used SHR's CF output, always using an AND or TEST. Silly compilers. :P They're great pieces of complex machinery, but a clever human can often beat them on small-scale problems. (Given thousands to millions of times longer to think about it, of course! Compilers don't use exhaustive algorithms to search for every possible way to do things, because that would take too long when optimizing a lot of inlined code, which is what they do best. They also don't model the pipeline in the target microarchitecture, at least not in the same detail as IACA or other static-analysis tools; they just use some heuristics.)
Simple loop unrolling won't help; this loop bottlenecks on the latency of a loop-carried dependency chain, not on loop overhead / throughput. This means it would do well with hyperthreading (or any other kind of SMT), since the CPU has lots of time to interleave instructions from two threads. This would mean parallelizing the loop in main, but that's fine because each thread can just check a range of n values and produce a pair of integers as a result.
Interleaving by hand within a single thread might be viable, too. Maybe compute the sequence for a pair of numbers in parallel, since each one only takes a couple registers, and they can all update the same max / maxi. This creates more instruction-level parallelism.
The trick is deciding whether to wait until all the n values have reached 1 before getting another pair of starting n values, or whether to break out and get a new start point for just one that reached the end condition, without touching the registers for the other sequence. Probably it's best to keep each chain working on useful data, otherwise you'd have to conditionally increment its counter.
You could maybe even do this with SSE packed-compare stuff to conditionally increment the counter for vector elements where n hadn't reached 1 yet. And then to hide the even longer latency of a SIMD conditional-increment implementation, you'd need to keep more vectors of n values up in the air. Maybe only worth with 256b vector (4x uint64_t).
I think the best strategy to make detection of a 1 "sticky" is to mask the vector of all-ones that you add to increment the counter. So after you've seen a 1 in an element, the increment-vector will have a zero, and +=0 is a no-op.
Untested idea for manual vectorization
# starting with YMM0 = [ n_d, n_c, n_b, n_a ] (64-bit elements)
# ymm4 = _mm256_set1_epi64x(1): increment vector
# ymm5 = all-zeros: count vector
.inner_loop:
vpaddq ymm1, ymm0, xmm0
vpaddq ymm1, ymm1, xmm0
vpaddq ymm1, ymm1, set1_epi64(1) # ymm1= 3*n + 1. Maybe could do this more efficiently?
vpsllq ymm3, ymm0, 63 # shift bit 1 to the sign bit
vpsrlq ymm0, ymm0, 1 # n /= 2
# FP blend between integer insns may cost extra bypass latency, but integer blends don't have 1 bit controlling a whole qword.
vpblendvpd ymm0, ymm0, ymm1, ymm3 # variable blend controlled by the sign bit of each 64-bit element. I might have the source operands backwards, I always have to look this up.
# ymm0 = updated n in each element.
vpcmpeqq ymm1, ymm0, set1_epi64(1)
vpandn ymm4, ymm1, ymm4 # zero out elements of ymm4 where the compare was true
vpaddq ymm5, ymm5, ymm4 # count++ in elements where n has never been == 1
vptest ymm4, ymm4
jnz .inner_loop
# Fall through when all the n values have reached 1 at some point, and our increment vector is all-zero
vextracti128 ymm0, ymm5, 1
vpmaxq .... crap this doesn't exist
# Actually just delay doing a horizontal max until the very very end. But you need some way to record max and maxi.
You can and should implement this with intrinsics instead of hand-written asm.
Algorithmic / implementation improvement:
Besides just implementing the same logic with more efficient asm, look for ways to simplify the logic, or avoid redundant work. e.g. memoize to detect common endings to sequences. Or even better, look at 8 trailing bits at once (gnasher's answer)
#EOF points out that tzcnt (or bsf) could be used to do multiple n/=2 iterations in one step. That's probably better than SIMD vectorizing; no SSE or AVX instruction can do that. It's still compatible with doing multiple scalar ns in parallel in different integer registers, though.
So the loop might look like this:
goto loop_entry; // C++ structured like the asm, for illustration only
do {
n = n*3 + 1;
loop_entry:
shift = _tzcnt_u64(n);
n >>= shift;
count += shift;
} while(n != 1);
This may do significantly fewer iterations, but variable-count shifts are slow on Intel SnB-family CPUs without BMI2. 3 uops, 2c latency. (They have an input dependency on the FLAGS because count=0 means the flags are unmodified. They handle this as a data dependency, and take multiple uops because a uop can only have 2 inputs (pre-HSW/BDW anyway)). This is the kind that people complaining about x86's crazy-CISC design are referring to. It makes x86 CPUs slower than they would be if the ISA was designed from scratch today, even in a mostly-similar way. (i.e. this is part of the "x86 tax" that costs speed / power.) SHRX/SHLX/SARX (BMI2) are a big win (1 uop / 1c latency).
It also puts tzcnt (3c on Haswell and later) on the critical path, so it significantly lengthens the total latency of the loop-carried dependency chain. It does remove any need for a CMOV, or for preparing a register holding n>>1, though. #Veedrac's answer overcomes all this by deferring the tzcnt/shift for multiple iterations, which is highly effective (see below).
We can safely use BSF or TZCNT interchangeably, because n can never be zero at that point. TZCNT's machine-code decodes as BSF on CPUs that don't support BMI1. (Meaningless prefixes are ignored, so REP BSF runs as BSF).
TZCNT performs much better than BSF on AMD CPUs that support it, so it can be a good idea to use REP BSF, even if you don't care about setting ZF if the input is zero rather than the output. Some compilers do this when you use __builtin_ctzll even with -mno-bmi.
They perform the same on Intel CPUs, so just save the byte if that's all that matters. TZCNT on Intel (pre-Skylake) still has a false-dependency on the supposedly write-only output operand, just like BSF, to support the undocumented behaviour that BSF with input = 0 leaves its destination unmodified. So you need to work around that unless optimizing only for Skylake, so there's nothing to gain from the extra REP byte. (Intel often goes above and beyond what the x86 ISA manual requires, to avoid breaking widely-used code that depends on something it shouldn't, or that is retroactively disallowed. e.g. Windows 9x's assumes no speculative prefetching of TLB entries, which was safe when the code was written, before Intel updated the TLB management rules.)
Anyway, LZCNT/TZCNT on Haswell have the same false dep as POPCNT: see this Q&A. This is why in gcc's asm output for #Veedrac's code, you see it breaking the dep chain with xor-zeroing on the register it's about to use as TZCNT's destination when it doesn't use dst=src. Since TZCNT/LZCNT/POPCNT never leave their destination undefined or unmodified, this false dependency on the output on Intel CPUs is a performance bug / limitation. Presumably it's worth some transistors / power to have them behave like other uops that go to the same execution unit. The only perf upside is interaction with another uarch limitation: they can micro-fuse a memory operand with an indexed addressing mode on Haswell, but on Skylake where Intel removed the false dep for LZCNT/TZCNT they "un-laminate" indexed addressing modes while POPCNT can still micro-fuse any addr mode.
Improvements to ideas / code from other answers:
#hidefromkgb's answer has a nice observation that you're guaranteed to be able to do one right shift after a 3n+1. You can compute this more even more efficiently than just leaving out the checks between steps. The asm implementation in that answer is broken, though (it depends on OF, which is undefined after SHRD with a count > 1), and slow: ROR rdi,2 is faster than SHRD rdi,rdi,2, and using two CMOV instructions on the critical path is slower than an extra TEST that can run in parallel.
I put tidied / improved C (which guides the compiler to produce better asm), and tested+working faster asm (in comments below the C) up on Godbolt: see the link in #hidefromkgb's answer. (This answer hit the 30k char limit from the large Godbolt URLs, but shortlinks can rot and were too long for goo.gl anyway.)
Also improved the output-printing to convert to a string and make one write() instead of writing one char at a time. This minimizes impact on timing the whole program with perf stat ./collatz (to record performance counters), and I de-obfuscated some of the non-critical asm.
#Veedrac's code
I got a minor speedup from right-shifting as much as we know needs doing, and checking to continue the loop. From 7.5s for limit=1e8 down to 7.275s, on Core2Duo (Merom), with an unroll factor of 16.
code + comments on Godbolt. Don't use this version with clang; it does something silly with the defer-loop. Using a tmp counter k and then adding it to count later changes what clang does, but that slightly hurts gcc.
See discussion in comments: Veedrac's code is excellent on CPUs with BMI1 (i.e. not Celeron/Pentium)

Claiming that the C++ compiler can produce more optimal code than a competent assembly language programmer is a very bad mistake. And especially in this case. The human always can make the code better than the compiler can, and this particular situation is a good illustration of this claim.
The timing difference you're seeing is because the assembly code in the question is very far from optimal in the inner loops.
(The below code is 32-bit, but can be easily converted to 64-bit)
For example, the sequence function can be optimized to only 5 instructions:
.seq:
inc esi ; counter
lea edx, [3*eax+1] ; edx = 3*n+1
shr eax, 1 ; eax = n/2
cmovc eax, edx ; if CF eax = edx
jnz .seq ; jmp if n<>1
The whole code looks like:
include "%lib%/freshlib.inc"
#BinaryType console, compact
options.DebugMode = 1
include "%lib%/freshlib.asm"
start:
InitializeAll
mov ecx, 999999
xor edi, edi ; max
xor ebx, ebx ; max i
.main_loop:
xor esi, esi
mov eax, ecx
.seq:
inc esi ; counter
lea edx, [3*eax+1] ; edx = 3*n+1
shr eax, 1 ; eax = n/2
cmovc eax, edx ; if CF eax = edx
jnz .seq ; jmp if n<>1
cmp edi, esi
cmovb edi, esi
cmovb ebx, ecx
dec ecx
jnz .main_loop
OutputValue "Max sequence: ", edi, 10, -1
OutputValue "Max index: ", ebx, 10, -1
FinalizeAll
stdcall TerminateAll, 0
In order to compile this code, FreshLib is needed.
In my tests, (1 GHz AMD A4-1200 processor), the above code is approximately four times faster than the C++ code from the question (when compiled with -O0: 430 ms vs. 1900 ms), and more than two times faster (430 ms vs. 830 ms) when the C++ code is compiled with -O3.
The output of both programs is the same: max sequence = 525 on i = 837799.

For more performance: A simple change is observing that after n = 3n+1, n will be even, so you can divide by 2 immediately. And n won't be 1, so you don't need to test for it. So you could save a few if statements and write:
while (n % 2 == 0) n /= 2;
if (n > 1) for (;;) {
n = (3*n + 1) / 2;
if (n % 2 == 0) {
do n /= 2; while (n % 2 == 0);
if (n == 1) break;
}
}
Here's a big win: If you look at the lowest 8 bits of n, all the steps until you divided by 2 eight times are completely determined by those eight bits. For example, if the last eight bits are 0x01, that is in binary your number is ???? 0000 0001 then the next steps are:
3n+1 -> ???? 0000 0100
/ 2 -> ???? ?000 0010
/ 2 -> ???? ??00 0001
3n+1 -> ???? ??00 0100
/ 2 -> ???? ???0 0010
/ 2 -> ???? ???? 0001
3n+1 -> ???? ???? 0100
/ 2 -> ???? ???? ?010
/ 2 -> ???? ???? ??01
3n+1 -> ???? ???? ??00
/ 2 -> ???? ???? ???0
/ 2 -> ???? ???? ????
So all these steps can be predicted, and 256k + 1 is replaced with 81k + 1. Something similar will happen for all combinations. So you can make a loop with a big switch statement:
k = n / 256;
m = n % 256;
switch (m) {
case 0: n = 1 * k + 0; break;
case 1: n = 81 * k + 1; break;
case 2: n = 81 * k + 1; break;
...
case 155: n = 729 * k + 425; break;
...
}
Run the loop until n ≤ 128, because at that point n could become 1 with fewer than eight divisions by 2, and doing eight or more steps at a time would make you miss the point where you reach 1 for the first time. Then continue the "normal" loop - or have a table prepared that tells you how many more steps are need to reach 1.
PS. I strongly suspect Peter Cordes' suggestion would make it even faster. There will be no conditional branches at all except one, and that one will be predicted correctly except when the loop actually ends. So the code would be something like
static const unsigned int multipliers [256] = { ... }
static const unsigned int adders [256] = { ... }
while (n > 128) {
size_t lastBits = n % 256;
n = (n >> 8) * multipliers [lastBits] + adders [lastBits];
}
In practice, you would measure whether processing the last 9, 10, 11, 12 bits of n at a time would be faster. For each bit, the number of entries in the table would double, and I excect a slowdown when the tables don't fit into L1 cache anymore.
PPS. If you need the number of operations: In each iteration we do exactly eight divisions by two, and a variable number of (3n + 1) operations, so an obvious method to count the operations would be another array. But we can actually calculate the number of steps (based on number of iterations of the loop).
We could redefine the problem slightly: Replace n with (3n + 1) / 2 if odd, and replace n with n / 2 if even. Then every iteration will do exactly 8 steps, but you could consider that cheating :-) So assume there were r operations n <- 3n+1 and s operations n <- n/2. The result will be quite exactly n' = n * 3^r / 2^s, because n <- 3n+1 means n <- 3n * (1 + 1/3n). Taking the logarithm we find r = (s + log2 (n' / n)) / log2 (3).
If we do the loop until n ≤ 1,000,000 and have a precomputed table how many iterations are needed from any start point n ≤ 1,000,000 then calculating r as above, rounded to the nearest integer, will give the right result unless s is truly large.

On a rather unrelated note: more performance hacks!
[the first «conjecture» has been finally debunked by #ShreevatsaR; removed]
When traversing the sequence, we can only get 3 possible cases in the 2-neighborhood of the current element N (shown first):
[even] [odd]
[odd] [even]
[even] [even]
To leap past these 2 elements means to compute (N >> 1) + N + 1, ((N << 1) + N + 1) >> 1 and N >> 2, respectively.
Let`s prove that for both cases (1) and (2) it is possible to use the first formula, (N >> 1) + N + 1.
Case (1) is obvious. Case (2) implies (N & 1) == 1, so if we assume (without loss of generality) that N is 2-bit long and its bits are ba from most- to least-significant, then a = 1, and the following holds:
(N << 1) + N + 1: (N >> 1) + N + 1:
b10 b1
b1 b
+ 1 + 1
---- ---
bBb0 bBb
where B = !b. Right-shifting the first result gives us exactly what we want.
Q.E.D.: (N & 1) == 1 ⇒ (N >> 1) + N + 1 == ((N << 1) + N + 1) >> 1.
As proven, we can traverse the sequence 2 elements at a time, using a single ternary operation. Another 2× time reduction.
The resulting algorithm looks like this:
uint64_t sequence(uint64_t size, uint64_t *path) {
uint64_t n, i, c, maxi = 0, maxc = 0;
for (n = i = (size - 1) | 1; i > 2; n = i -= 2) {
c = 2;
while ((n = ((n & 3)? (n >> 1) + n + 1 : (n >> 2))) > 2)
c += 2;
if (n == 2)
c++;
if (c > maxc) {
maxi = i;
maxc = c;
}
}
*path = maxc;
return maxi;
}
int main() {
uint64_t maxi, maxc;
maxi = sequence(1000000, &maxc);
printf("%llu, %llu\n", maxi, maxc);
return 0;
}
Here we compare n > 2 because the process may stop at 2 instead of 1 if the total length of the sequence is odd.
[EDIT:]
Let`s translate this into assembly!
MOV RCX, 1000000;
DEC RCX;
AND RCX, -2;
XOR RAX, RAX;
MOV RBX, RAX;
#main:
XOR RSI, RSI;
LEA RDI, [RCX + 1];
#loop:
ADD RSI, 2;
LEA RDX, [RDI + RDI*2 + 2];
SHR RDX, 1;
SHRD RDI, RDI, 2; ror rdi,2 would do the same thing
CMOVL RDI, RDX; Note that SHRD leaves OF = undefined with count>1, and this doesn't work on all CPUs.
CMOVS RDI, RDX;
CMP RDI, 2;
JA #loop;
LEA RDX, [RSI + 1];
CMOVE RSI, RDX;
CMP RAX, RSI;
CMOVB RAX, RSI;
CMOVB RBX, RCX;
SUB RCX, 2;
JA #main;
MOV RDI, RCX;
ADD RCX, 10;
PUSH RDI;
PUSH RCX;
#itoa:
XOR RDX, RDX;
DIV RCX;
ADD RDX, '0';
PUSH RDX;
TEST RAX, RAX;
JNE #itoa;
PUSH RCX;
LEA RAX, [RBX + 1];
TEST RBX, RBX;
MOV RBX, RDI;
JNE #itoa;
POP RCX;
INC RDI;
MOV RDX, RDI;
#outp:
MOV RSI, RSP;
MOV RAX, RDI;
SYSCALL;
POP RAX;
TEST RAX, RAX;
JNE #outp;
LEA RAX, [RDI + 59];
DEC RDI;
SYSCALL;
Use these commands to compile:
nasm -f elf64 file.asm
ld -o file file.o
See the C and an improved/bugfixed version of the asm by Peter Cordes on Godbolt. (editor's note: Sorry for putting my stuff in your answer, but my answer hit the 30k char limit from Godbolt links + text!)

C++ programs are translated to assembly programs during the generation of machine code from the source code. It would be virtually wrong to say assembly is slower than C++. Moreover, the binary code generated differs from compiler to compiler. So a smart C++ compiler may produce binary code more optimal and efficient than a dumb assembler's code.
However I believe your profiling methodology has certain flaws. The following are general guidelines for profiling:
Make sure your system is in its normal/idle state. Stop all running processes (applications) that you started or that use CPU intensively (or poll over the network).
Your datasize must be greater in size.
Your test must run for something more than 5-10 seconds.
Do not rely on just one sample. Perform your test N times. Collect results and calculate the mean or median of the result.

From comments:
But, this code never stops (because of integer overflow) !?! Yves Daoust
For many numbers it will not overflow.
If it will overflow - for one of those unlucky initial seeds, the overflown number will very likely converge toward 1 without another overflow.
Still this poses interesting question, is there some overflow-cyclic seed number?
Any simple final converging series starts with power of two value (obvious enough?).
2^64 will overflow to zero, which is undefined infinite loop according to algorithm (ends only with 1), but the most optimal solution in answer will finish due to shr rax producing ZF=1.
Can we produce 2^64? If the starting number is 0x5555555555555555, it's odd number, next number is then 3n+1, which is 0xFFFFFFFFFFFFFFFF + 1 = 0. Theoretically in undefined state of algorithm, but the optimized answer of johnfound will recover by exiting on ZF=1. The cmp rax,1 of Peter Cordes will end in infinite loop (QED variant 1, "cheapo" through undefined 0 number).
How about some more complex number, which will create cycle without 0?
Frankly, I'm not sure, my Math theory is too hazy to get any serious idea, how to deal with it in serious way. But intuitively I would say the series will converge to 1 for every number : 0 < number, as the 3n+1 formula will slowly turn every non-2 prime factor of original number (or intermediate) into some power of 2, sooner or later. So we don't need to worry about infinite loop for original series, only overflow can hamper us.
So I just put few numbers into sheet and took a look on 8 bit truncated numbers.
There are three values overflowing to 0: 227, 170 and 85 (85 going directly to 0, other two progressing toward 85).
But there's no value creating cyclic overflow seed.
Funnily enough I did a check, which is the first number to suffer from 8 bit truncation, and already 27 is affected! It does reach value 9232 in proper non-truncated series (first truncated value is 322 in 12th step), and the maximum value reached for any of the 2-255 input numbers in non-truncated way is 13120 (for the 255 itself), maximum number of steps to converge to 1 is about 128 (+-2, not sure if "1" is to count, etc...).
Interestingly enough (for me) the number 9232 is maximum for many other source numbers, what's so special about it? :-O 9232 = 0x2410 ... hmmm.. no idea.
Unfortunately I can't get any deep grasp of this series, why does it converge and what are the implications of truncating them to k bits, but with cmp number,1 terminating condition it's certainly possible to put the algorithm into infinite loop with particular input value ending as 0 after truncation.
But the value 27 overflowing for 8 bit case is sort of alerting, this looks like if you count the number of steps to reach value 1, you will get wrong result for majority of numbers from the total k-bit set of integers. For the 8 bit integers the 146 numbers out of 256 have affected series by truncation (some of them may still hit the correct number of steps by accident maybe, I'm too lazy to check).

You did not post the code generated by the compiler, so there' some guesswork here, but even without having seen it, one can say that this:
test rax, 1
jpe even
... has a 50% chance of mispredicting the branch, and that will come expensive.
The compiler almost certainly does both computations (which costs neglegibly more since the div/mod is quite long latency, so the multiply-add is "free") and follows up with a CMOV. Which, of course, has a zero percent chance of being mispredicted.

For the Collatz problem, you can get a significant boost in performance by caching the "tails". This is a time/memory trade-off. See: memoization
(https://en.wikipedia.org/wiki/Memoization). You could also look into dynamic programming solutions for other time/memory trade-offs.
Example python implementation:
import sys
inner_loop = 0
def collatz_sequence(N, cache):
global inner_loop
l = [ ]
stop = False
n = N
tails = [ ]
while not stop:
inner_loop += 1
tmp = n
l.append(n)
if n <= 1:
stop = True
elif n in cache:
stop = True
elif n % 2:
n = 3*n + 1
else:
n = n // 2
tails.append((tmp, len(l)))
for key, offset in tails:
if not key in cache:
cache[key] = l[offset:]
return l
def gen_sequence(l, cache):
for elem in l:
yield elem
if elem in cache:
yield from gen_sequence(cache[elem], cache)
raise StopIteration
if __name__ == "__main__":
le_cache = {}
for n in range(1, 4711, 5):
l = collatz_sequence(n, le_cache)
print("{}: {}".format(n, len(list(gen_sequence(l, le_cache)))))
print("inner_loop = {}".format(inner_loop))

As a generic answer, not specifically directed at this task: In many cases, you can significantly speed up any program by making improvements at a high level. Like calculating data once instead of multiple times, avoiding unnecessary work completely, using caches in the best way, and so on. These things are much easier to do in a high level language.
Writing assembler code, it is possible to improve on what an optimising compiler does, but it is hard work. And once it's done, your code is much harder to modify, so it is much more difficult to add algorithmic improvements. Sometimes the processor has functionality that you cannot use from a high level language, inline assembly is often useful in these cases and still lets you use a high level language.
In the Euler problems, most of the time you succeed by building something, finding why it is slow, building something better, finding why it is slow, and so on and so on. That is very, very hard using assembler. A better algorithm at half the possible speed will usually beat a worse algorithm at full speed, and getting the full speed in assembler isn't trivial.

Even without looking at assembly, the most obvious reason is that /= 2 is probably optimized as >>=1 and many processors have a very quick shift operation. But even if a processor doesn't have a shift operation, the integer division is faster than floating point division.
Edit: your milage may vary on the "integer division is faster than floating point division" statement above. The comments below reveal that the modern processors have prioritized optimizing fp division over integer division. So if someone were looking for the most likely reason for the speedup which this thread's question asks about, then compiler optimizing /=2 as >>=1 would be the best 1st place to look.
On an unrelated note, if n is odd, the expression n*3+1 will always be even. So there is no need to check. You can change that branch to
{
n = (n*3+1) >> 1;
count += 2;
}
So the whole statement would then be
if (n & 1)
{
n = (n*3 + 1) >> 1;
count += 2;
}
else
{
n >>= 1;
++count;
}

The simple answer:
doing a MOV RBX, 3 and MUL RBX is expensive; just ADD RBX, RBX twice
ADD 1 is probably faster than INC here
MOV 2 and DIV is very expensive; just shift right
64-bit code is usually noticeably slower than 32-bit code and the alignment issues are more complicated; with small programs like this you have to pack them so you are doing parallel computation to have any chance of being faster than 32-bit code
If you generate the assembly listing for your C++ program, you can see how it differs from your assembly.

Why is one of these sooooo much faster than the other?

I'm writing C++ code to find the first byte in memory that is non 0xFF. To exploit bitscanforward, I had written an inline assembly code that I like very much. But for "readability" as well as future proofing (i.e. SIMD vectorization) I thought I would give g++ optimizer a chance. g++ didn't vectorize, but it did get to nearly the same non-SIMD solution I did. But for some reason, it's version runs much slower, 260000x slower (i.e. I have to loop my version 260,000x more to get to the same execution time). I excepted some difference but not THAT much! Can some point out why it might be? I just want to know so as to make a mistake in future inline assembly codes.
The C++ starting point is following, (in terms of counting accuracy, there is a bug in this code, but I've simplified it for this speed test):
uint64_t count3 (const void *data, uint64_t const &nBytes) {
uint64_t count = 0;
uint64_t block;
do {
block = *(uint64_t*)(data+count);
if ( block != (uint64_t)-1 ) {
/* count += __builtin_ctz(~block); ignore this for speed test*/
goto done;
};
count += sizeof(block);
} while ( count < nBytes );
done:
return (count>nBytes ? nBytes : count);
}
The assembly code g++ came up with is:
_Z6count3PKvRKm:
.LFB33:
.cfi_startproc
mov rdx, QWORD PTR [rsi]
xor eax, eax
jmp .L19
.p2align 4,,10
.p2align 3
.L21:
add rax, 8
cmp rax, rdx
jnb .L18
.L19:
cmp QWORD PTR [rdi+rax], -1
je .L21
.L18:
cmp rax, rdx
cmova rax, rdx
ret
.cfi_endproc
My inline assembly is
_Z6count2PKvRKm:
.LFB32:
.cfi_startproc
push rbx
.cfi_def_cfa_offset 16
.cfi_offset 3, -16
mov rbx, QWORD PTR [rsi]
# count trailing bytes of 0xFF
xor rax, rax
.ctxff_loop_69:
mov r9, QWORD PTR [rdi+rax]
xor r9, -1
jnz .ctxff_final_69
add rax, 8
cmp rax, rbx
jl .ctxff_loop_69
.ctxff_final_69:
cmp rax,rbx
cmova rax,rbx
pop rbx
.cfi_def_cfa_offset 8
ret
.cfi_endproc
As far as I can see, it is substantially identical, except for the method by which it compare the data byte against 0xFF. But I cannot believe this would cause a great difference in computation time.
It's conceivable my test method is causing the error, but all I do is change the function name and iteration length in the following, simple for-loop shown below: (when N is 1<<20, and all bytes of 'a' except the last byte is 0xFF)
test 1
for (uint64_t i=0; i < ((uint64_t)1<<15); i++) {
n = count3(a,N);
}
test 2
for (uint64_t i=0; i < ((uint64_t)1<<33); i++) {
n = count2(a,N);
}
EDIT:
Here are my real inline assembly codes with SSE count1(), x64-64 count() and then plain-old-c++ versions count0() and count3(). I fell down this rabbit hole hoping that I could get g++ to take my count0() and arrive, on it's own, to my count1() or even count2(). But alas it did nothing, absolutely no optmization :( I should add that my platform doesn't have AVX2, which is why I was hoping to get g++ to automatically vectorize, so that the code would automatically update when I update my platform.
In terms of the explicit register use in the inline assembly, if I didn't make them explicitly, g++ would reuse the same registers for nBytes and count.
In terms of speedup, between XMM and QWORD, I found the real benefit is simply the "loop-unroll" effect, which I replicate in count2().
uint32_t count0(const uint8_t *data, uint64_t const &nBytes) {
for (int i=0; i<nBytes; i++)
if (data[i] != 0xFF) return i;
return nBytes;
}
uint32_t count1(const void *data, uint64_t const &nBytes) {
uint64_t count;
__asm__("# count trailing bytes of 0xFF \n"
" xor %[count], %[count] \n"
" vpcmpeqb xmm0, xmm0, xmm0 \n" // make array of 0xFF
".ctxff_next_block_%=: \n"
" vpcmpeqb xmm1, xmm0, XMMWORD PTR [%[data]+%[count]] \n"
" vpmovmskb r9, xmm1 \n"
" xor r9, 0xFFFF \n" // test if all match (bonus negate r9)
" jnz .ctxff_tzc_%= \n" // if !=0, STOP & tzcnt negated r9
" add %[count], 16 \n" // else inc
" cmp %[count], %[nBytes] \n"
" jl .ctxff_next_block_%= \n" // while count < nBytes, loop
" jmp .ctxff_done_%= \n" // else done + ALL bytes were 0xFF
".ctxff_tzc_%=: \n"
" tzcnt r9, r9 \n" // count bytes up to non-0xFF
" add %[count], r9 \n"
".ctxff_done_%=: \n" // more than 'nBytes' could be tested,
" cmp %[count],%[nBytes] \n" // find minimum
" cmova %[count],%[nBytes] "
: [count] "=a" (count)
: [nBytes] "b" (nBytes), [data] "d" (data)
: "r9", "xmm0", "xmm1"
);
return count;
};
uint64_t count2 (const void *data, uint64_t const &nBytes) {
uint64_t count;
__asm__("# count trailing bytes of 0xFF \n"
" xor %[count], %[count] \n"
".ctxff_loop_%=: \n"
" mov r9, QWORD PTR [%[data]+%[count]] \n"
" xor r9, -1 \n"
" jnz .ctxff_final_%= \n"
" add %[count], 8 \n"
" mov r9, QWORD PTR [%[data]+%[count]] \n" // <--loop-unroll
" xor r9, -1 \n"
" jnz .ctxff_final_%= \n"
" add %[count], 8 \n"
" cmp %[count], %[nBytes] \n"
" jl .ctxff_loop_%= \n"
" jmp .ctxff_done_%= \n"
".ctxff_final_%=: \n"
" bsf r9, r9 \n" // do tz count on r9 (either of first QWORD bits or XMM bytes)
" shr r9, 3 \n" // scale BSF count accordiningly
" add %[count], r9 \n"
".ctxff_done_%=: \n" // more than 'nBytes' bytes could have been tested,
" cmp %[count],%[nBytes] \n" // find minimum of count and nBytes
" cmova %[count],%[nBytes] "
: [count] "=a" (count)
: [nBytes] "b" (nBytes), [data] "D" (data)
: "r9"
);
return count;
}
inline static uint32_t tzcount(uint64_t const &qword) {
uint64_t tzc;
asm("tzcnt %0, %1" : "=r" (tzc) : "r" (qword) );
return tzc;
};
uint64_t count3 (const void *data, uint64_t const &nBytes) {
uint64_t count = 0;
uint64_t block;
do {
block = *(uint64_t*)(data+count);
if ( block != (uint64_t)-1 ) {
count += tzcount(~block);
goto done;
};
count += sizeof(block);
} while ( count < nBytes );
done:
return (count>nBytes ? nBytes : count);
}
uint32_t N = 1<<20;
int main(int argc, char **argv) {
unsigned char a[N];
__builtin_memset(a,0xFF,N);
uint64_t n = 0, j;
for (uint64_t i=0; i < ((uint64_t)1<<18); i++) {
n += count2(a,N);
}
printf("\n\n %x %x %x\n",N, n, 0);
return n;
}

Answer to the question title
Now that you've posted the full code: the call to count2(a,N) is hoisted out of the loop in main. The run time still increases very slightly with the loop count (e.g. 1<<18), but all that loop is doing is a single add. The compiler optimizes it to look more like this source:
uint64_t hoisted_count = count2(a,N);
for (uint64_t i=0; i < ((uint64_t)1<<18); i++) {
n += hoisted_count; // doesn't optimize to a multiply
}
There is no register conflict: %rax holds the result of the asm statement inlined from count2. It's then used as a source operand in the tiny loop that multiplies it by n through repeated addition.
(see the asm on the Godbolt Compiler Explorer, and note all the compiler warnings about arithmetic on void*s: clang refuses to compile your code):
## the for() loop in main, when using count2()
.L23:
addq %rax, %r12
subq $1, %rdx
jne .L23
%rdx is the loop counter here, and %r12 is the accumulator that holds n. IDK why gcc doesn't optimize it to a constant-time multiply.
Presumably the version that was 260k times slower didn't manage to hoist the whole count2 out of the loop. From gcc's perspective, the inline asm version is much simpler: the asm statement is treated as a pure function of its inputs, and gcc doesn't even know anything about it touching memory. The C version touches a bunch of memory, and is much more complicated to prove that it can be hoisted.
Using a "memory" clobber in the asm statement did prevent it from being hoisted when I checked on godbolt. You can tell from the presence or absence of a branch target in main before the vector block.
But anyway, the run time will be something like n + rep_count vs. n * rep_count.
The asm statement doesn't use a "memory" clobber or any memory inputs to tell gcc that it reads the memory pointed to by the input pointers. Incorrect optimizations could happen, e.g. being hoisted out of a loop that modified array elements. (See the Clobbers section in the manual for an example of using a dummy anonymous struct memory input instead of a blanket "memory" clobber. Unfortunately I don't think that's usable when the block of memory doesn't have compile-time-constant size.)
I think -fno-inline prevents hoisting because the function isn't marked with __attribute__((const)) or the slightly weaker __attribute__((pure)) to indicate no side-effects. After inlining, the optimizer can see that for the asm statement.
count0 doesn't get optimized to anything good because gcc and clang can't auto-vectorize loops where the number of iterations isn't known at the start. i.e. they suck at stuff like strlen or memchr, or search loops in general, even if they're told that it's safe to access memory beyond the end of the point where the search loop exits early (e.g. using char buf[static 512] as a function arg).
Optimizations for your asm code:
Like I commented on the question, using xor reg, 0xFFFF / jnz is silly compared to cmp reg, 0xFFFF / jnz, because cmp/jcc can macro-fuse into a compare-and-branch uop. cmp reg, mem / jne can also macro-fuse, so the scalar version that does a load/xor/branch is using 3x as many uops per compare. (Of course, Sandybridge can only micro-fuse the load if it doesn't use an indexed addressing mode. Also, SnB can only macro-fuse one pair per decode block, and but you'd probably get the first cmp/jcc and the loop branch to macro-fuse.) Anyway, the xor is a bad idea. It's better to only xor right before the tzcnt, since saving uops in the loop is more important than code-size or uops total.
Your scalar loop is 9 fused-domain uops, which is one too many to issue at one iteration per 2 clocks. (SnB is a 4-wide pipeline, and for tiny loops it can actually sustain that.)
The indenting in the code in the first version of the question, with the count += __builtin_ctz at the same level as the if, made me think you were counting mismatch blocks, rather than just finding the first.
Unfortunately the asm code I wrote for the first version of this answer doesn't solve the same problem as the OP's updated and clearer code. See an old version of this answer for SSE2 asm that counts 0xFF bytes using pcmpeqb/paddb, and psadbw for the horizontal sum to avoid wraparound.
Getting a speedup with SSE2 (or AVX):
Branching on the result of a pcmpeq takes many more uops than branching on a cmp. If our search array is big, we can use a loop that tests multiple vectors at once, and then figure out which byte had our hit after breaking out of the loop.
This optimization applies to AVX2 as well.
Here's my attempt, using GNU C inline asm with -masm=intel syntax. (Intrinsics might give better results, esp. when inlining, because the compiler understands intrinsics and so can do constant-propagation through them, and stuff like that. OTOH, you can often beat the compiler with hand-written asm if you understand the trade-offs and the microarchitecture you're targeting. Also, if you can safely make some assumptions, but you can't easily communicate them to the compiler.)
#include <stdint.h>
#include <immintrin.h>
// compile with -masm=intel
// len must be a multiple of 32 (TODO: cleanup loop)
// buf should be 16B-aligned for best performance
size_t find_first_zero_bit_avx1(const char *bitmap, size_t len) {
// return size_t not uint64_t. This same code works in 32bit mode, and in the x32 ABI where pointers are 32bit
__m128i pattern, vtmp1, vtmp2;
const char *result_pos;
int tmpi;
const char *bitmap_start = bitmap;
asm ( // modifies the bitmap pointer, but we're inside a wrapper function
"vpcmpeqw %[pat], %[pat],%[pat]\n\t" // all-ones
".p2align 4\n\t" // force 16B loop alignment, for the benefit of CPUs without a loop buffer
//IACA_START // See the godbolt link for the macro definition
".Lcount_loop%=:\n\t"
// " movdqu %[v1], [ %[p] ]\n\t"
// " pcmpeqb %[v1], %[pat]\n\t" // for AVX: fold the load into vpcmpeqb, making sure to still use a one-register addressing mode so it can micro-fuse
// " movdqu %[v2], [ %[p] + 16 ]\n\t"
// " pcmpeqb %[v2], %[pat]\n\t"
" vpcmpeqb %[v1], %[pat], [ %[p] ]\n\t" // Actually use AVX, to get a big speedup over the OP's scalar code on his SnB CPU
" vpcmpeqb %[v2], %[pat], [ %[p] + 16 ]\n\t"
" vpand %[v2], %[v2], %[v1]\n\t" // combine the two results from this iteration
" vpmovmskb %k[result], %[v2]\n\t"
" cmp %k[result], 0xFFFF\n\t" // k modifier: eax instead of rax
" jne .Lfound%=\n\t"
" add %[p], 32\n\t"
" cmp %[p], %[endp]\n\t" // this is only 2 uops after the previous cmp/jcc. We could re-arrange the loop and put the branches farther apart if needed. (e.g. start with a vpcmpeqb outside the loop, so each iteration actually sets up for the next)
" jb .Lcount_loop%=\n\t"
//IACA_END
// any necessary code for the not-found case, e.g. bitmap = endp
" mov %[result], %[endp]\n\t"
" jmp .Lend%=\n\t"
".Lfound%=:\n\t" // we have to figure out which vector the first non-match was in, based on v1 and (v2&v1)
// We could just search the bytes over again, but we don't have to.
// we could also check v1 first and branch, instead of checking both and using a branchless check.
" xor %k[result], 0xFFFF\n\t"
" tzcnt %k[result], %k[result]\n\t" // runs as bsf on older CPUs: same result for non-zero inputs, but different flags. Faster than bsf on AMD
" add %k[result], 16\n\t" // result = byte count in case v1 is all-ones. In that case, v2&v1 = v2
" vpmovmskb %k[tmp], %[v1]\n\t"
" xor %k[tmp], 0xFFFF\n\t"
" bsf %k[tmp], %k[tmp]\n\t" // bsf sets ZF if its *input* was zero. tzcnt's flag results are based on its output. For AMD, it would be faster to use more insns (or a branchy strategy) and avoid bsf, but Intel has fast bsf.
" cmovnz %k[result], %k[tmp]\n\t" // if there was a non-match in v1, use it instead of tzcnt(v2)+16
" add %[result], %[p]\n\t" // If we needed to force 64bit, we could use %q[p]. But size_t should be 32bit in the x32 ABI, where pointers are 32bit. This is one advantage to using size_t over uint64_t
".Lend%=:\n\t"
: [result] "=&a" (result_pos), // force compiler to pic eax/rax to save a couple bytes of code-size from the special cmp eax, imm32 and xor eax,imm32 encodings
[p] "+&r" (bitmap),
// throw-away outputs to let the compiler allocate registers. All early-clobbered so they aren't put in the same reg as an input
[tmp] "=&r" (tmpi),
[pat] "=&x" (pattern),
[v1] "=&x" (vtmp1), [v2] "=&x" (vtmp2)
: [endp] "r" (bitmap+len)
// doesn't compile: len isn't a compile-time constant
// , "m" ( ({ struct { char x[len]; } *dummy = (typeof(dummy))bitmap ; *dummy; }) ) // tell the compiler *which* memory is an input.
: "memory" // we read from data pointed to by bitmap, but bitmap[0..len] isn't an input, only the pointer.
);
return result_pos - bitmap_start;
}
This actually compiles and assembles to asm that looks like what I expected, but I didn't test it. Note that it leaves all register allocation to the compiler, so it's more inlining-friendly. Even without inlining, it doesn't force use of a call-preserved register that has to get saved/restored (e.g. your use of a "b" constraint).
Not done: scalar code to handle the last sub-32B chunk of data.
static perf analysis for Intel SnB-family CPUs based on Agner Fog's guides / tables. See also the x86 tag wiki. I'm assuming we're not bottlenecked on cache throughput, so this analysis only applies when the data is hot in L2 cache, or maybe only L1 cache is fast enough.
This loop can issue out of the front-end at one iteration (two vectors) per 2 clocks, because it's 7 fused-domain uops. (The front-end issues in groups of 4). (It's probably actually 8 uops, if the two cmp/jcc pairs are decoded in the same block. Haswell and later can do two macro-fusions per decode group, but previous CPUs can only macro-fuse the first. We could software-pipeline the loop so the early-out branch is farther from the p < endp branch.)
All of these fused-domain uops include an ALU uop, so the bottleneck will be on ALU execution ports. Haswell added a 4th ALU unit that can handle simple non-vector ops, including branches, so could run this loop at one iteration per 2 clocks (16B per clock). Your i5-2550k (mentioned in comments) is a SnB CPU.
I used IACA to count uops per port, since it's time consuming to do it by hand. IACA is dumb and thinks there's some kind of inter-iteration dependency other than the loop counter, so I had to use -no_interiteration:
g++ -masm=intel -Wall -Wextra -O3 -mtune=haswell find-first-zero-bit.cpp -c -DIACA_MARKS
iaca -64 -arch IVB -no_interiteration find-first-zero-bit.o
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - find-first-zero-bit.o
Binary Format - 64Bit
Architecture - SNB
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 2.50 Cycles Throughput Bottleneck: Port1, Port5
Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |
-------------------------------------------------------------------------
| Cycles | 2.0 0.0 | 2.5 | 1.0 1.0 | 1.0 1.0 | 0.0 | 2.5 |
-------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
# - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | |
---------------------------------------------------------------------
| 2^ | | 1.0 | 1.0 1.0 | | | | CP | vpcmpeqb xmm1, xmm0, xmmword ptr [rdx]
| 2^ | | 0.6 | | 1.0 1.0 | | 0.4 | CP | vpcmpeqb xmm2, xmm0, xmmword ptr [rdx+0x10]
| 1 | 0.9 | 0.1 | | | | 0.1 | CP | vpand xmm2, xmm2, xmm1
| 1 | 1.0 | | | | | | | vpmovmskb eax, xmm2
| 1 | | | | | | 1.0 | CP | cmp eax, 0xffff
| 0F | | | | | | | | jnz 0x18
| 1 | 0.1 | 0.9 | | | | | CP | add rdx, 0x20
| 1 | | | | | | 1.0 | CP | cmp rdx, rsi
| 0F | | | | | | | | jb 0xffffffffffffffe1
On SnB: pcmpeqb can run on p1/p5. Fused compare-and-branch can only run on p5. Non-fused cmp can run on p015. Anyway, if one of the branches doesn't macro-fuse, the loop can run at one iteration per 8/3 = 2.666 cycles. With macro-fusion, best-case is 7/3 = 2.333 cycles. (IACA doesn't try to simulate distribution of uops to ports exactly the way the hardware would dynamically make those decisions. However, we can't expect perfect scheduling from the hardware either, so 2 vectors per 2.5 cycles is probably reasonable with both macro-fusions happening. Uops that could have used port0 will sometimes steal port1 or port5, reducing throughput.)
As I said before, Haswell handles this loop better. IACA thinks HSW could run the loop at one iteration per 1.75c, but that's clearly wrong because the taken loop-branch ends the issue group. It will issue in a repeating 4,3 uop pattern. But the execution units can handle more throughput than the frontend for this loop, so it should really be able to keep up with the frontend on Haswell/Broadwell/Skylake and run at one iteration per 2 clocks.
Further unrolling of more vpcmpeqb / vpand is only 2 uops per vector (or 3 without AVX, where we'd load into a scratch and then use that as the destination for pcmpeqb.) So with sufficient unrolling, we should be able to do 2 vector loads per clock. Without AVX, this wouldn't be possible without the PAND trick, since a vector load/compare/movmsk/test-and-branch is 4 uops. Bigger unrolls make more work to decode the final position where we found a match: a scalar cmp-based cleanup loop might be a good idea once we're in the area. You could maybe use the same scalar loop for cleanup of non-multiple-of-32B sizes.
If using SSE, with movdqu / pcmpeqb xmm,xmm, we can use an indexed addressing mode without it costing us uops, because a movdqu load is always a single load uop regardless of addressing mode. (It doesn't need to micro-fuse with anything, unlike a store). This lets us save a uop of loop overhead by using a base pointer pointing to the end of the array, and the index counting up from zero. e.g. add %[idx], 32 / js to loop while the index is negative.
With AVX, however, we can save 2 uops by using a single-register addressing mode so vpcmpeqb %[v1], %[pat], [ %[p] + 16 ] can micro-fuse. This means we need the add/cmp/jcc loop structure I used in the example. The same applies to AVX2.

So I think I found the problem. I think one of the registers used in my inline assembly, despite the clobber list, was conflicting with g++ use of them, and was corrupting the test iteration. I fed g++ version of the code, back as an inline assembly code and got the same 260000x acceleration as my own. Also, in retrospect, the "accelerated" computation time was absurdly short.
Finally, I was so focus on the code embodied as a function that I failed to notice that g++ had, in fact, in-lined (i was using -O3 optimization) the function into the test for-loop as well. When I forced g++ to not in-line (i.e. -fno-inline), the 260000x acceleration disappeared.
I think g++ failed to take into account the inline assembly code's "clobber list" when it in-lined the entire function without my permission.
Lesson learned. I need to do better on inline assembly constraints or block inline-ing of the function with __attribute__ ((noinline))
EDIT: Definitely found that g++ is using rax for the main() for-loop counter, in conflict with my use of rax.

Checksum for binary PLC communication

I've been scratching my head around calculating a checksum to communicate with Unitronics PLCs using binary commands. They offer the source code but it's in a Windows-only C# implementation, which is of little help to me other than basic syntax.
Specification PDF (the checksum calculation is near the end)
C# driver source (checksum calculation in Utils.cs)
Intended Result
Below is the byte index, message description and the sample which does work.
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | 24 25 26 27 28 29 | 30 31 32
# sx--------------- id FE 01 00 00 00 cn 00 specific--------- lengt CHKSM | numbr ot FF addr- | CHKSM ex
# 2F 5F 4F 50 4C 43 00 FE 01 00 00 00 4D 00 00 00 00 00 01 00 06 00 F1 FC | 01 00 01 FF 01 00 | FE FE 5C
The specification calls for calculating the accumuated value of the 22 byte message header and, seperately, the 6+ byte detail, getting the value of sum modulo 65536, and then returning two's complement of that value.
Attempt #1
My understanding is the tilde (~) operator in Python is directly derived from C/C++. After a day of writing the Python that creates the message I came up with this (stripped down version):
#!/usr/bin/env python
def Checksum( s ):
x = ( int( s, 16 ) ) % 0x10000
x = ( ~x ) + 1
return hex( x ).split( 'x' )[1].zfill( 4 )
Details = ''
Footer = ''
Header = ''
Message = ''
Details += '0x010001FF0100'
Header += '0x2F5F4F504C4300FE010000004D000000000001000600'
Header += Checksum( Header )
Footer += Checksum( Details )
Footer += '5C'
Message += Header.split( 'x' )[1].zfill( 4 )
Message += Details.split( 'x' )[1].zfill( 4 )
Message += Footer
print Message
Message: 2F5F4F504C4300FE010000004D000000000001000600600L010001FF010001005C
I see an L in there, which is a different result to yesterday, which wasn't any closer. If you want a quick formula result based on the rest of the message: Checksum(Header) should return F1FC and Checksum(Details) should return FEFE.
The value it returns is nowhere near the same as the specification's example. I believe the issue may be one or two things: the Checksum method isn't calculating the sum of the hex string correctly or the Python ~ operator is not equivalent to the C++ ~ operator.
Attempt #2
A friend has given me his C++ interpretation of what the calculation SHOULD be, I just can't get my head around this code, my C++ knowledge is minimal.
short PlcBinarySpec::CalcHeaderChecksum( std::vector<byte> _header ) {
short bytesum = 0;
for ( std::vector<byte>::iterator it = _header.begin(); it != _header.end(); ++it ) {
bytesum = bytesum + ( *it );
}
return ( ~( bytesum % 0x10000 ) ) + 1;
}

I'm not entirely sure what the correct code should be… but if the intention is for Checksum(Header) to return f705, and it's returning 08fb, here's the problem:
x = ( ~( x % 0x10000 ) ) + 1
The short version is that you want this:
x = (( ~( x % 0x10000 ) ) + 1) % 0x10000
The problem here isn't that ~ means something different. As the documentation says, ~x returns "the bits of x inverted", which is effectively the same thing it means in C (at least on 2s-complement platforms, which includes all Windows platforms).
You can run into a problem with the difference between C and Python types here (C integral types are fixed-size, and overflow; Python integral types are effectively infinite-sized, and grow as needed). But I don't think that's your problem here.
The problem is just a matter of how you convert the result to a string.
The result of calling Checksum(Header), up to the formatting, is -2299, or -0x08fb, in both versions.
In C, you can pretty much treat a signed integer as an unsigned integer of the same size (although you may have to ignore warnings to do so, in some cases). What exactly that does depends on your platform, but on a 2s-complement platform, signed short -0x08fb is the bit-for-bit equivalent of unsigned 0xf705. So, for example, if you do sprintf(buf, "%04hx", -0x08fb), it works just fine—and it gives you (on most platforms, including everything Windows) the unsigned equivalent, f705.
But in Python, there are no unsigned integers. The int -0x08fb has nothing to do with 0xf705. If you do "%04hx" % -0x08fb, you'll get -8fb, and there's no way to forcibly "cast it to unsigned" or anything like that.
Your code actually does hex(-0x08fb), which gives you -0x8fb, which you then split on the x, giving you 8fb, which you zfill to 08fb, which makes the problem a bit harder to notice (because that looks like a perfectly valid pair of hex bytes, instead of a minus sign and three hex digits), but it's the same problem.
Anyway, you have to explicitly decide what you mean by "unsigned equivalent", and write the code to do that. Since you're trying to match what C does on a 2s-complement platform, you can write that explicit conversion as % 0x10000. If you do "%04hx" % (-0x08fb % 0x10000), you'll get f705, just as you did in C. And likewise for your existing code.

It's quite simple. I checked your friend's algorithm by adding all the header bytes manually on a calculator, and it yields the correct result (0xfcf1).
Now, I don't actually know python, but it looks to me like you are adding up half-byte values. You have made your header string like this:
Header = '2F5F4F504C4300FE010000004D000000000001000600'
And then you go through converting each byte in that string from hex and adding it. That means you are dealing with values from 0 to 15. You need to consider every two bytes as a pair and convert that (values from 0 to 255). Or you need to use actual binary data instead of a text representation of the binary data.
At the end of the algorithm, you don't really need to do the ~ operator if you don't trust it. Instead you can do (0xffff - (x % 0x10000)) + 1. Bear in mind that prior to adding 1, the value might actually be 0xffff, so you need to modulo the entire result by 0x10000 afterwards. Your friend's C++ version uses the short datatype so no modulo is necessary at all because the short will naturally overflow

re implement modulo using bit shifts?

I'm writing some code for a very limited system where the mod operator is very slow. In my code a modulo needs to be used about 180 times per second and I figured that removing it as much as possible would significantly increase the speed of my code, as of now one cycle of my mainloop does not run in 1/60 of a second as it should. I was wondering if it was possible to re-implement the modulo using only bit shifts like is possible with multiplication and division. So here is my code so far in c++ (if i can perform a modulo using assembly it would be even better). How can I remove the modulo without using division or multiplication?
while(input > 0)
{
out = (out << 3) + (out << 1);
out += input % 10;
input = (input >> 8) + (input >> 1);
}
EDIT: Actually I realized that I need to do it way more than 180 times per second. Seeing as the value of input can be a very large number up to 40 digits.

What you can do with simple bitwise operations is taking a power-of-two modulo(divisor) of the value(dividend) by AND'ing it with divisor-1. A few examples:
unsigned int val = 123; // initial value
unsigned int rem;
rem = val & 0x3; // remainder after value is divided by 4.
// Equivalent to 'val % 4'
rem = val % 5; // remainder after value is divided by 5.
// Because 5 isn't power of two, we can't simply AND it with 5-1(=4).
Why it works? Let's consider a bit pattern for the value 123 which is 1111011 and then the divisor 4, which has the bit pattern of 00000100. As we know by now, the divisor has to be power-of-two(as 4 is) and we need to decrement it by one(from 4 to 3 in decimal) which yields us the bit pattern 00000011. After we bitwise-AND both the original 123 and 3, the resulting bit pattern will be 00000011. That turns out to be 3 in decimal. The reason why we need a power-of-two divisor is that once we decrement them by one, we get all the less significant bits set to 1 and the rest are 0. Once we do the bitwise-AND, it 'cancels out' the more significant bits from the original value, and leaves us with simply the remainder of the original value divided by the divisor.
However, applying something specific like this for arbitrary divisors is not going to work unless you know your divisors beforehand(at compile time, and even then requires divisor-specific codepaths) - resolving it run-time is not feasible, especially not in your case where performance matters.
Also there's a previous question related to the subject which probably has interesting information on the matter from different points of view.

Actually division by constants is a well known optimization for compilers and in fact, gcc is already doing it.
This simple code snippet:
int mod(int val) {
return val % 10;
}
Generates the following code on my rather old gcc with -O3:
_mod:
push ebp
mov edx, 1717986919
mov ebp, esp
mov ecx, DWORD PTR [ebp+8]
pop ebp
mov eax, ecx
imul edx
mov eax, ecx
sar eax, 31
sar edx, 2
sub edx, eax
lea eax, [edx+edx*4]
mov edx, ecx
add eax, eax
sub edx, eax
mov eax, edx
ret
If you disregard the function epilogue/prologue, basically two muls (indeed on x86 we're lucky and can use lea for one) and some shifts and adds/subs. I know that I already explained the theory behind this optimization somewhere, so I'll see if I can find that post before explaining it yet again.
Now on modern CPUs that's certainly faster than accessing memory (even if you hit the cache), but whether it's faster for your obviously a bit more ancient CPU is a question that can only be answered with benchmarking (and also make sure your compiler is doing that optimization, otherwise you can always just "steal" the gcc version here ;) ). Especially considering that it depends on an efficient mulhs (ie higher bits of a multiply instruction) to be efficient.
Note that this code is not size independent - to be exact the magic number changes (and maybe also parts of the add/shifts), but that can be adapted.

Doing modulo 10 with bit shifts is going to be hard and ugly, since bit shifts are inherently binary (on any machine you're going to be running on today). If you think about it, bit shifts are simply multiply or divide by 2.
But there's an obvious space-time trade you could make here: set up a table of values for out and out % 10 and look it up. Then the line becomes
out += tab[out]
and with any luck at all, that will turn out to be one 16-bit add and a store operation.

If you want to do modulo 10 and shifts, maybe you can adapt double dabble algorithm to your needs?
This algorithm is used to convert binary numbers to decimal without using modulo or division.

Every power of 16 ends in 6. If you represent the number as a sum of powers of 16 (i.e. break it into nybbles), then each term contributes to the last digit in the same way, except the one's place.
0x481A % 10 = ( 0x4 * 6 + 0x8 * 6 + 0x1 * 6 + 0xA ) % 10
Note that 6 = 5 + 1, and the 5's will cancel out if there are an even number of them. So just sum the nybbles (except the last one) and add 5 if the result is odd.
0x481A % 10 = ( 0x4 + 0x8 + 0x1 /* sum = 13 */
+ 5 /* so add 5 */ + 0xA /* and the one's place */ ) % 10
= 28 % 10
This reduces the 16-bit, 4-nybble modulo to a number at most 0xF * 4 + 5 = 65. In binary, that is annoyingly still 3 nybbles so you would need to repeat the algorithm (although one of them doesn't really count).
But the 286 should have reasonably efficient BCD addition that you can use to perform the sum and obtain the result in one pass. (That requires converting each nybble to BCD manually; I don't know enough about the platform to say how to optimize that or whether it's problematic.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js