SSE instructions: which CPUs can do atomic 16B memory operations? - concurrency

Consider a single memory access (a single read or a single write, not read+write) SSE instruction on an x86 CPU. The instruction is accessing 16 bytes (128 bits) of memory and the accessed memory location is aligned to 16 bytes.
The document "Intel® 64 Architecture Memory Ordering White Paper" states that for "Instructions that read or write a quadword (8 bytes) whose address is aligned on an 8 byte boundary" the memory operation appears to execute as a single memory access regardless of memory type.
The question: Do there exist Intel/AMD/etc x86 CPUs which guarantee that reading or writing 16 bytes (128 bits) aligned to a 16 byte boundary executes as a single memory access? Is so, which particular type of CPU is it (Core2/Atom/K8/Phenom/...)? If you provide an answer (yes/no) to this question, please also specify the method that was used to determine the answer - PDF document lookup, brute force testing, math proof, or whatever other method you used to determine the answer.
This question relates to problems such as http://research.swtch.com/2010/02/off-to-races.html
Update:
I created a simple test program in C that you can run on your computers. Please compile and run it on your Phenom, Athlon, Bobcat, Core2, Atom, Sandy Bridge or whatever SSE2-capable CPU you happen to have. Thanks.
// Compile with:
// gcc -o a a.c -pthread -msse2 -std=c99 -Wall -O2
//
// Make sure you have at least two physical CPU cores or hyper-threading.
#include <pthread.h>
#include <emmintrin.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>
typedef int v4si __attribute__ ((vector_size (16)));
volatile v4si x;
unsigned n1[16] __attribute__((aligned(64)));
unsigned n2[16] __attribute__((aligned(64)));
void* thread1(void *arg) {
for (int i=0; i<100*1000*1000; i++) {
int mask = _mm_movemask_ps((__m128)x);
n1[mask]++;
x = (v4si){0,0,0,0};
}
return NULL;
}
void* thread2(void *arg) {
for (int i=0; i<100*1000*1000; i++) {
int mask = _mm_movemask_ps((__m128)x);
n2[mask]++;
x = (v4si){-1,-1,-1,-1};
}
return NULL;
}
int main() {
// Check memory alignment
if ( (((uintptr_t)&x) & 0x0f) != 0 )
abort();
memset(n1, 0, sizeof(n1));
memset(n2, 0, sizeof(n2));
pthread_t t1, t2;
pthread_create(&t1, NULL, thread1, NULL);
pthread_create(&t2, NULL, thread2, NULL);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
for (unsigned i=0; i<16; i++) {
for (int j=3; j>=0; j--)
printf("%d", (i>>j)&1);
printf(" %10u %10u", n1[i], n2[i]);
if(i>0 && i<0x0f) {
if(n1[i] || n2[i])
printf(" Not a single memory access!");
}
printf("\n");
}
return 0;
}
The CPU I have in my notebook is Core Duo (not Core2). This particular CPU fails the test, it implements 16-byte memory read/writes with a granularity of 8 bytes. The output is:
0000 96905702 10512
0001 0 0
0010 0 0
0011 22 12924 Not a single memory access!
0100 0 0
0101 0 0
0110 0 0
0111 0 0
1000 0 0
1001 0 0
1010 0 0
1011 0 0
1100 3092557 1175 Not a single memory access!
1101 0 0
1110 0 0
1111 1719 99975389

In the Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3A, which nowadays contains the specifications of the memory ordering white paper you mention, it is said in section 8.1.1 that:
The Intel486 processor (and newer processors since) guarantees that
the following basic memory operations will always be carried out
atomically:
Reading or writing a byte.
Reading or writing a word aligned on a 16-bit boundary.
Reading or writing a doubleword aligned on a 32-bit boundary. The Pentium processor (and newer processors since) guarantees that the
following additional memory operations will always be carried out
atomically:
Reading or writing a quadword aligned on a 64-bit boundary.
16-bit accesses to uncached memory locations that fit within a 32-bit data bus.
The P6 family processors (and newer processors since) guarantee that
the following additional memory operation will always be carried out
atomically:
Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line.
Processors that enumerate support for Intel® AVX (by setting the
feature flag CPUID.01H:ECX.AVX[bit 28]) guarantee that the 16-byte
memory operations performed by the following instructions will always
be carried out atomically:
MOVAPD, MOVAPS, and MOVDQA.
VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking disabled).
(Note that these instructions require the linear addresses of their memory operands to be 16-byte aligned.)
Each of the writes x = (v4si){0,0,0,0} and x = (v4si){-1,-1,-1,-1} are probably compiled into a single 16-byte MOVAPS. The address of x is 16-byte aligned. On an Intel processor that supports AVX, these writes are atomic. Otherwise, they are not atomic.
On AMD processors, AMD64 Architecture Programmer's Manual, Section 3.9.1.3 states that
Single load or store operations (from instructions that do just a single load or store) are naturally
atomic on any AMD64 processor as long as they do not cross an aligned 8-byte boundary. Accesses up
to eight bytes in size which do cross such a boundary may be performed atomically using certain
instructions with a lock prefix, such as XCHG, CMPXCHG or CMPXCHG8B, as long as all such
accesses are done using the same technique. (Note that misaligned locked accesses may be subject to
heavy performance penalties.) CMPXCHG16B can be used to perform 16-byte atomic accesses in 64-
bit mode (with certain alignment restrictions).
AMD processors thus do not guarantee that AVX instructions provide 16-byte atomicity.
On Intel processors that don't support AVX and on AMD processor, the CMPXCHG16B instruction with the LOCK prefix can be used. You can use the CPUID instruction to figure out if your processor supports CMPXCHG16B (the "CX16" feature bit).
EDIT: Test program results
(Test program modified to increase #iterations by a factor of 10)
On a Xeon X3450 (x86-64):
0000 999998139 1572
0001 0 0
0010 0 0
0011 0 0
0100 0 0
0101 0 0
0110 0 0
0111 0 0
1000 0 0
1001 0 0
1010 0 0
1011 0 0
1100 0 0
1101 0 0
1110 0 0
1111 1861 999998428
On a Xeon 5150 (32-bit):
0000 999243100 283087
0001 0 0
0010 0 0
0011 0 0
0100 0 0
0101 0 0
0110 0 0
0111 0 0
1000 0 0
1001 0 0
1010 0 0
1011 0 0
1100 0 0
1101 0 0
1110 0 0
1111 756900 999716913
On an Opteron 2435 (x86-64):
0000 999995893 1901
0001 0 0
0010 0 0
0011 0 0
0100 0 0
0101 0 0
0110 0 0
0111 0 0
1000 0 0
1001 0 0
1010 0 0
1011 0 0
1100 0 0
1101 0 0
1110 0 0
1111 4107 999998099
Note that the Intel Xeon X3450 and Xeon 5150 don't support AVX. The Opteron 2435 is an AMD processor.
Does this mean that Intel and/or AMD guarantee that 16 byte memory accesses are atomic on these machines? IMHO, it does not. It's not in the documentation as guaranteed architectural behavior, and thus one cannot know if on these particular processors 16 byte memory accesses really are atomic or whether the test program merely fails to trigger them for one reason or another. And thus relying on it is dangerous.
EDIT 2: How to make the test program fail
Ha! I managed to make the test program fail. On the same Opteron 2435 as above, with the same binary, but now running it via the "numactl" tool specifying that each thread runs on a separate socket, I got:
0000 999998634 5990
0001 0 0
0010 0 0
0011 0 0
0100 0 0
0101 0 0
0110 0 0
0111 0 0
1000 0 0
1001 0 0
1010 0 0
1011 0 0
1100 0 1 Not a single memory access!
1101 0 0
1110 0 0
1111 1366 999994009
So what does this imply? Well, the Opteron 2435 may, or may not, guarantee that 16-byte memory accesses are atomic for intra-socket accesses, but at least the cache coherency protocol running on the HyperTransport interconnect between the two sockets does not provide such a guarantee.
EDIT 3: ASM for the thread functions, on request of "GJ."
Here's the generated asm for the thread functions for the GCC 4.4 x86-64 version used on the Opteron 2435 system:
.globl thread2
.type thread2, #function
thread2:
.LFB537:
.cfi_startproc
movdqa .LC3(%rip), %xmm1
xorl %eax, %eax
.p2align 5,,24
.p2align 3
.L11:
movaps x(%rip), %xmm0
incl %eax
movaps %xmm1, x(%rip)
movmskps %xmm0, %edx
movslq %edx, %rdx
incl n2(,%rdx,4)
cmpl $1000000000, %eax
jne .L11
xorl %eax, %eax
ret
.cfi_endproc
.LFE537:
.size thread2, .-thread2
.p2align 5,,31
.globl thread1
.type thread1, #function
thread1:
.LFB536:
.cfi_startproc
pxor %xmm1, %xmm1
xorl %eax, %eax
.p2align 5,,24
.p2align 3
.L15:
movaps x(%rip), %xmm0
incl %eax
movaps %xmm1, x(%rip)
movmskps %xmm0, %edx
movslq %edx, %rdx
incl n1(,%rdx,4)
cmpl $1000000000, %eax
jne .L15
xorl %eax, %eax
ret
.cfi_endproc
and for completeness, .LC3 which is the static data containing the (-1, -1, -1, -1) vector used by thread2:
.LC3:
.long -1
.long -1
.long -1
.long -1
.ident "GCC: (GNU) 4.4.4 20100726 (Red Hat 4.4.4-13)"
.section .note.GNU-stack,"",#progbits
Also note that this is AT&T ASM syntax, not the Intel syntax Windows programmers might be more familiar with. Finally, this is with march=native which makes GCC prefer MOVAPS; but it doesn't matter, if I use march=core2 it will use MOVDQA for storing to x, and I can still reproduce the failures.

The "AMD Architecture Programmer's Manual Volume 1: Application Programming" says in section 3.9.1: "CMPXCHG16B can be used to perform 16-byte atomic accesses in 64-bit mode (with certain alignment restrictions)."
However, there is no such comment about SSE instructions. In fact, there is a comment in 4.8.3 that the LOCK prefix "causes an invalid-opcode exception when used with 128-bit media instructions". It therefore seems pretty conclusive to me that the AMD processors do NOT guarantee atomic 128-bit accesses for SSE instructions, and the only way to do an atomic 128-bit access is to use CMPXCHG16B.
The "Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1" says in 8.1.1 "An x87 instruction or an SSE instructions that accesses data larger than a quadword may be implemented using multiple memory accesses." This is pretty conclusive that 128-bit SSE instructions are not guaranteed atomic by the ISA. Volume 2A of the Intel docs says of CMPXCHG16B: "This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically."
Further, CPU manufacturers haven't published written guarantees of atomic 128b SSE operations for specific CPU models where that is the case.

Erik Rigtorp has done some experimental testing on recent Intel and AMD CPUs to look for tearing. Results at https://rigtorp.se/isatomic/. Keep in mind there's no documentation or guarantee about this behaviour, and IDK if it's possible for a custom many-socket machine using such CPUs to have less atomicity than the machines he tested on. But on current x86 CPUs (not K10), SIMD atomicity for aligned loads/stores simply scales with data-path width between cache and L1d cache.
The x86 ISA only guarantees atomicity for things up to 8B, so that implementations are free to implement SSE / AVX support the way Pentium III / Pentium M / Core Duo does: internally data is handled in 64bit halves. A 128bit store is done as two 64bit stores. The data path to/from cache is only 64b wide in the Yonah microarchitecture (Core Duo). (source:Agner Fog's microarch doc).
More recent implementations do have wider data paths internally, and handle 128b instructions as a single op. Core 2 Duo (conroe/merom) was the first Intel P6-descended microarch with 128b data paths. (IDK about P4, but fortunately it's old enough to be totally irrelevant.)
This is why the OP finds that 128b ops are not atomic on Intel Core Duo (Yonah), but other posters find that they are atomic on later Intel designs, starting with Core 2 (Merom).
The diagrams on this Realworldtech writeup about Merom vs. Yonah show the 128bit path between ALU and L1 data-cache in Merom (and P4), while the low-power Yonah has a 64bit data path. The data path between L1 and L2 cache is 256b in all 3 designs.
The next jump in data path width came with Intel's Haswell, featuring 256b (32B) AVX/AVX2 loads/stores, and a 64Byte path between L1 and L2 cache. I expect that 256b loads/stores are atomic in Haswell, Broadwell, and Skylake, but I don't have one to test. I forget if Skylake again widened the paths in preparation for AVX512 in Skylake-EP (the server version), or if perhaps the initial implementation of AVX512 will be like SnB/IvB's AVX, and have 512b loads/stores occupy a load/store port for 2 cycles.
As janneb points out in his excellent experimental answer, the cache-coherency protocol between sockets in a multi-core system might be narrower than what you get within a shared-last-level-cache CPU. There is no architectural requirement on atomicity for wide loads/stores, so designers are free to make them atomic within a socket but non-atomic across sockets if that's convenient. IDK how wide the inter-socket logical data path is for AMD's Bulldozer-family, or for Intel. (I say "logical", because even if the data is transferred in smaller chunks, it might not modify a cache line until it's fully received.)
Finding similar articles about AMD CPUs should allow drawing reasonable conclusions about whether 128b ops are atomic or not. Just checking instruction tables is some help:
K8 decodes movaps reg, [mem] to 2 m-ops, while K10 and bulldozer-family decode it to 1 m-op. AMD's low-power bobcat decodes it to 2 ops, while jaguar decodes 128b movaps to 1 m-op. (It supports AVX1 similar to bulldozer-family CPUs: 256b insns (even ALU ops) are split into two 128b ops. Intel SnB only splits 256b loads/stores, while having full-width ALUs.)
janneb's Opteron 2435 is a 6-core Istanbul CPU, which is part of the K10 family, so this single-m-op -> atomic conclusion appears accurate within a single socket.
Intel Silvermont does 128b loads/stores with a single uop, and a throughput of one per clock. This is the same as for integer loads/stores, so it's quite probably atomic.

There is actually a warning in the Intel Architecture Manual Vol 3A. Section 8.1.1 (May 2011), under the section of guaranteed atomic operations:
An x87 instruction or an SSE instructions that accesses data larger
than a quadword may be implemented using multiple memory accesses. If
such an instruction stores to memory, some of the accesses may
complete (writing to memory) while another causes the operation to
fault for architectural reasons (e.g. due an page-table entry that is
marked “not present”). In this case, the effects of the completed
accesses may be visible to software even though the overall
instruction caused a fault. If TLB invalidation has been delayed (see
Section 4.10.4.4), such page faults may occur even if all accesses are
to the same page.
thus SSE instructions are not guaranteed to be atomic, even if the underlying architecture does use a single memory access (this is one reason why the memory fencing was introduced).
Combine that with this statement from the Intel Optimization Manual, Section 13.3 (April 2011)
AVX and FMA instructions do not introduce any new guaranteed atomic
memory operations.
and that fact that none of the load or store operation for SIMD guarantee atomicity, we can come to the conclusion that Intel doesn't not support any form of atomic SIMD (yet).
As an extra bit, if the memory is split along cache lines or page boundaries (when using things like movdqu which permit unaligned access), the following processors will not perform atomic accesses, regardless of alignment, but later processors will (again from the Intel Architecture Manual):
Intel Core 2 Duo, Intel® Atom™, Intel Core Duo, Pentium M, Pentium 4,
Intel Xeon, P6 family, Pentium, and Intel486 processors. The Intel
Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel
Xeon, and P6 family processors

It looks like AMD will also specify in the next revision of their manual that aligned 16b loads and stores are atomic on their x86 processors which supports AVX. (Source)
Apologies for late response!
We would update the AMD APM manuals in the next revision.
For all AMD architectures,
Processors that support AVX extend the atomicity for cacheable, naturally-aligned single loads or stores from a quadword to a double quadword.
which means all 128b instructions, even the *MOVDQU instructions, are atomic if they end up being naturally aligned.
Can we extend this patch to AMD processors as well. If not, I will plan to submit the patch for stage-1!
With this, the patch making libatomic use vmovdqa in their implementation of __atomic_load_16 and __atomic_store_16 not only on Intel processors with AVX but also on AMD processors with AVX has landed on the master branch.

Lot of answers have been posted so far and hence lot of information is already available (as a side effect lot of confusion too). I would like to site facts from Intel manual regarding hardware guaranteed atomic operations ...
In Intel's latest processors of nehalem and sandy bridge family, reading or writing to a quadword aligned to 64 bit boundary is guaranteed.
Even unaligned 2, 4 or 8 byte reads or writes are guaranteed to be atomic provided they are cached memory and fit in a cache line.
Having said that the test posted in this question passes on sandy bridge based intel i5 processor.

EDIT:
In the last two days I have made several tests on my three PCs and I didn't reproduce any memory error, so I can't say anything more precisely. Maybe is this memory error also dependent from OS.
EDIT:
I'm programing in Delphi and not in C but I should understand C. So I have translated the code, here are you have the threads procedures where the main part is made in assembler:
procedure TThread1.Execute;
var
n :cardinal;
const
ConstAll0 :array[0..3] of integer =(0,0,0,0);
begin
for n := 0 to 100000000 do
asm
movdqa xmm0, dqword [x]
movmskps eax, xmm0
inc dword ptr[n1 + eax *4]
movdqu xmm0, dqword [ConstAll0]
movdqa dqword [x], xmm0
end;
end;
{ TThread2 }
procedure TThread2.Execute;
var
n :cardinal;
const
ConstAll1 :array[0..3] of integer =(-1,-1,-1,-1);
begin
for n := 0 to 100000000 do
asm
movdqa xmm0, dqword [x]
movmskps eax, xmm0
inc dword ptr[n2 + eax *4]
movdqu xmm0, dqword [ConstAll1]
movdqa dqword [x], xmm0
end;
end;
Result: no mistake on my quad core PC and no mistake on my dual core PC as expected!
PC with Intel Pentium4 CPU
PC with Intel Core2 Quad CPU Q6600
PC with Intel Core2 Duo CPU P8400
Can you show how debuger see your thread procedure code? Please...

Related

Why is memcpy with 16 byte twice as fast as memcpy with 1 byte?

I'm doing some performance measurements. When looking at memcpy, I found a curious effect with small byte sizes. In particular, the fastest byte count to use for my system is 16 bytes. Both smaller and larger sizes get slower. Here's a screenshot of my results from a larger test program.
I've minimized a complete program to reproduce the effect just for 1 and 16 bytes (note this is MSVC code to suppress inlining to prevent optimizer to nuke everything):
#include <chrono>
#include <iostream>
using dbl_ns = std::chrono::duration<double, std::nano>;
template<size_t n>
struct memcopy_perf {
uint8_t m_source[n]{};
uint8_t m_target[n]{};
__declspec(noinline) auto f() -> void
{
constexpr int repeats = 50'000;
const auto t0 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < repeats; ++i)
std::memcpy(m_target, m_source, n);
const auto t1 = std::chrono::high_resolution_clock::now();
std::cout << "time per memcpy: " << dbl_ns(t1 - t0).count()/repeats << " ns\n";
}
};
int main()
{
memcopy_perf<1>{}.f();
memcopy_perf<16>{}.f();
return 0;
}
I would have hand-waved away a minimum at 8 bytes (maybe because of 16 bit on a 64 bit register size) or at 64 bytes (cache line size). But I'm somewhat puzzled at the 16 bytes. The effect is reproducible on my system and not a fluke.
Notes: I'm aware of the intricacies of performance measurements. Yes, this is in release mode. Yes, I made sure things are not optimized away. Yes, I'm aware there are libraries for this. Yes, n is too low, etc etc. This is a minimal example. I checked the asm, it calls memcpy.
I assume that MSVC 19 is used since there is not information about the MSVC version yet.
The benchmark is biased so it cannot actually help to tell which one is faster, especially for such a small timing.
Indeed, here is the assembly code of MSVC 19 with /O2 between the two calls to std::chrono::high_resolution_clock::now:
mov cl, BYTE PTR [esi]
add esp, 4
mov eax, 50000 ; 0000c350H
npad 6
$LL4#f:
sub eax, 1 <--- This is what you measure !
jne SHORT $LL4#f
lea eax, DWORD PTR _t1$[esp+20]
mov BYTE PTR [esi+1], cl
push eax
One can see that you measure a nearly empty loop that is completely useless and more than half the instructions are not the ones meant to be measured. The only useful instructions are certainly:
mov cl, BYTE PTR [esi]
mov BYTE PTR [esi+1], cl
Even assuming that the compiler could generate such a code, the call to std::chrono::high_resolution_clock::now certainly takes several dozens of nano seconds (since the usual solution to measure very small timing precisely is to use RDTSC and RDTSCP). This is far more than the time to execute such instruction. In fact, at such a granularity the notion of wall clock time vanishes. Instructions are typically executed in parallel in an out of order way and are also pipelined. At this scale one need to consider the latency of each instruction, their reciprocal throughput, their dependencies, etc.
An alternative solution is to reimplement the benchmark so the compiler cannot optimize this. But this is pretty hard to do since copying 1 byte is nearly free on modern x86 architectures (compared to the overhead of the related instructions needed to compute the addresses, loop, etc.).
AFAIK copying 1 byte on an AMD Zen2 processor has a reciprocal throughput of ~1 cycle (1 load + 1 store scheduled on both 2 load ports and 2 store ports) assuming data is in the L1 cache. The latency to read/write a value in the L1 cache is 4-5 cycles so the latency of the copy may be 8-10 cycles. For more information about this architecture please check this.
Best way to know is to check the resulting assembly. for example if 16 byte version is aligned on 16 then it uses aligned load/store operations and becomes faster than 4 byte / 1 byte version.
Maybe if it can not vectorize(use register) due to bad alignment, it falls back to do cache copy like mov operations on memory address instead of register.
The memcpy implementation is depend on many factors, such as:
OS: windows 10, linux, macos,...
Compiler: GCC 4.4, GCC 8.5, CL, cywin-gcc, clang...
CPU architecture: arm, arm64, x86, x86_64,....
...
So, with difference system, memcpy may have difference performance/behavior.
Example: SSE2, SSE3, AVX, AVX512 memcpy version be implmented on this system, but not on other system.
In your case, I think it's mainly caused by memory alignment. And with 1 byte copy, the un-alignment memory access is always happen (more read/write cycle).

why does GDB not tab-complete mmx register name(mm0-mm7)

I use gdb info registers <tab> to see all the registers, but I don't see MMX registers.
My CPU is Xeon Platinum 8163, a modern Xeon cpu that supports SSE and MMX. So i think its a gdb problem(if i am right).
Why does gdb not support showing mmx register while mmx register should be of same importance level compared to basic registers and sse registers.
The MMX registers don't have their own separate architectural state; they alias the x87 registers st0..st7. (Intel did this so OSes wouldn't to special support to save/restore the MMX state on context switch via FXSAVE/FXRSTOR). That's different from all the other registers.
But I think this is a GDB bug, not an intentional decision to not expose the MMX state except via the x87 state. info reg mmx tab-completes but prints nothing. (GDB 10.1 on x86-64 Arch GNU/Linux)
Even when running a program with the FPU in MMX state (after executing movd mm0, eax for example), it still doesn't tab complete. In fact, even p $mm0 just prints void (because that GDB variable name isn't recognized as being tied to an MMX register).
You can see the MMX state via i r float
e.g. after mov eax, 231 / movd mm0, eax,
starti
stepi
si
(gdb) p $mm0
$1 = void
(gdb) i r mm0
Invalid register `mm0'
(gdb) i r mmx
(gdb) i r float
st0 <invalid float value> (raw 0xffff00000000000000e7)
st1 0 (raw 0x00000000000000000000)
...
After another single step, of pshufw mm1, mm0, 0
(gdb) si
0x000000000040100c in ?? ()
(gdb) i r float
st0 <invalid float value> (raw 0xffff00000000000000e7)
st1 <invalid float value> (raw 0xffff00e700e700e700e7)
st2 0 (raw 0x00000000000000000000)
So if you ignore the high 16 bits of the 80-bit extended precision bit-pattern, you can look at the 64-bit mantissa part as the MMX register value.
I assume this has gone unfixed for so long because SSE2 makes MMX mostly obsolete, providing more wider registers and not needing a slow emms to leave MMX state before a potential x87 FPU instruction like fld. (And on modern CPUs like Skylake, MMX instructions don't have mov-elimination, and some run on fewer execution ports than their SSE2 equivalents, like paddd)
Of course, some existing code, notably x264 and FFmpeg's h.264 software decoder, still use hand-written MMX asm instead of the low qword of an XMM register. This is sometimes advantageous, e.g. to allow punpcklbw mm0, [rdi].
BTW, the test program I single-stepped was assembled + linked from this NASM source into a static executable:
mov eax, 231 ; __NR_exit_group = 0xe7
movd mm0, eax
pshufw mm1, mm0, 0 ; broadcast the low word
emms
nop
syscall

Intel store instructions on delibrately overlapping memory regions

I have to store the lower 3 doubles in YMM register into an unaligned double array of size 3 (that is, cannot write the 4th element). But being a bit naughty, I'm wondering if the AVX intrinsic _mm256_storeu2_m128d can do the trick. I had
reg = _mm256_permute4x64_pd(reg, 0b10010100); // [0 1 1 2]
_mm256_storeu2_m128d(vec, vec + 1, reg);
and compiling by clang gives
vmovupd xmmword ptr [rsi + 8], xmm1 # reg in ymm1 after perm
vextractf128 xmmword ptr [rsi], ymm0, 1
If storeu2 had semantics like memcpy then it most definitely triggers undefined behavior. But with the generated instructions, would this be free of race conditions (or other potential problems)?
Other ways to store YMM into size 3 arrays are welcomed as well.
There isn't really a formal spec for Intel's intrinsics, AFAIK, other than what Intel has published as documentation. e.g. their intrinsics guide. Also examples from their whitepapers and so on; e.g. examples that need to work are one way GCC/clang know they have to define __m128 with __attribute__((may_alias)).
It's all within one thread, fully synchronous, so definitely no "race condition". In your case it doesn't even matter which order the stores happen in (assuming they don't overlap with the __m256d reg object itself! That would be the equivalent of an overlapping memcpy problem.) What you're doing might be like two indeterminately sequenced memcpy to overlapping destinations: they definitely happen in one order or the other, and the compiler could pick either.
The observable difference for order of stores is performance: if you want to do a SIMD reload very soon after, then store forwarding will work better if the 16-byte reload takes its data from one 16-byte store, not the overlap of two stores.
In general overlapping stores are fine for performance, though; the store buffer will absorb them. It means one of them is unaligned, though, and crossing a cache-line boundary would be more expensive.
However, that's all moot: Intel's intrinsics guide does list an "operation" section for that compound intrinsic:
Operation
MEM[loaddr+127:loaddr] := a[127:0]
MEM[hiaddr+127:hiaddr] := a[255:128]
So it's strictly defined as low address store first (the second arg; I think you got this backwards).
And all of that is also moot because there's a more efficient way
Your way costs 1 lane-crossing shuffle + vmovups + vextractf128 [mem], ymm, 1. Depending on how it compiles, neither store can start until after the shuffle. (Although it looks like clang might have avoided that problem).
On Intel CPUs, vextractf128 [mem], ymm, imm costs 2 uops for the front-end, not micro-fused into one. (Also 2 uops on Zen for some reason.)
On AMD CPUs before Zen 2, lane-crossing shuffles are more than 1 uop, so _mm256_permute4x64_pd is more expensive than necessary.
You just want to store the low lane of the input vector, and the low element of the high lane. The cheapest shuffle is vextractf128 xmm, ymm, 1 - 1 uop / 1c latency on Zen (which splits YMM vectors into two 128-bit halves anyway). It's as cheap as any other lane-crossing shuffle on Intel.
The asm you want the compiler to make is probably this, which only requires AVX1. AVX2 doesn't have any useful instructions for this.
vextractf128 xmm1, ymm0, 1 ; single uop everywhere
vmovupd [rdi], xmm0 ; single uop everywhere
vmovsd [rdi+2*8], xmm1 ; single uop everywhere
So you want something like this, which should compile efficiently.
_mm_store_pd(vec, _mm256_castpd256_pd128(reg)); // low half
__m128d hi = _mm256_extractf128_pd(reg, 1);
_mm_store_sd(vec+2, hi);
// or vec[2] = _mm_cvtsd_f64(hi);
vmovlps (_mm_storel_pi) would also work, but with AVX VEX encoding it doesn't save any code size, and would require even more casting to keep compilers happy.
There's unfortunately no vpextractq [mem], ymm, only with an XMM source so that doesn't help.
Masked store:
As discussed in comments, yes you could do vmaskmovps but it's unfortunately not as efficient as we might like on all CPUs. Until AVX512 makes masked loads/stores first-class citizens, it may be best to shuffle and do 2 stores. Or pad your array / struct so you can at least temporarily step on later stuff.
Zen has 2-uop vmaskmovpd ymm loads, but very expensive vmaskmovpd stores (42 uops, 1 per 11 cycles for YMM). Or Zen+ and Zen2 are 18 or 19 uops, 6 cycle throughput. If you care at all about Zen, avoid vmaskmov.
On Intel Broadwell and earlier, vmaskmov stores are 4 uops according to Agner's Fog's testing, so that's 1 more fused-domain uop than we get from shuffle + movups + movsd. But still, Haswell and later do manage 1/clock throughput so if that's a bottleneck then it beats the 2-cycle throughput of 2 stores. SnB/IvB of course take 2 cycles for a 256-bit store, even without masking.
On Skylake, vmaskmov mem, ymm, ymm is only 3 uops (Agner Fog lists 4, but his spreadsheets are hand-edited and have been wrong before. I think it's safe to assume uops.info's automated testing is right. And that makes sense; Skylake-client is basically the same core as Skylake-AVX512, just without actually enabling AVX512. So they could implement vmaskmovpd by decoding it into test into a mask register (1 uop) + masked store (2 more uops without micro-fusion).
So if you only care about Skylake and later, and can amortize the cost of loading a mask into a vector register (reusable for loads and stores), vmaskmovpd is actually pretty good. Same front-end cost but cheaper in the back-end: only 1 each store-address and store-data uops, instead of 2 separate stores. Note the 1/clock throughput on Haswell and later vs. the 2-cycle throughput for doing 2 separate stores.
vmaskmovpd might even store-forward efficiently to a masked reload; I think Intel mentioned something about this in their optimization manual.

Split a number into several numbers, each with only one significant bit

Is there any efficient algorithm (or processor instruction) that will help divide the number (32bit and 64bit) into several numbers, in which there will be only one 1-bit.
I want to isolate each set bit in a number. For example,
input:
01100100
output:
01000000
00100000
00000100
Only comes to mind number & mask.
Assembly or С++.
Yes, in a similar way as Brian Kernighan's algorithm to count set bits, except instead of counting the bits we extract and use the lowest set bit in every intermediary result:
while (number) {
// extract lowest set bit in number
uint64_t m = number & -number;
/// use m
...
// remove lowest set bit from number
number &= number - 1;
}
In modern x64 assembly, number & -number may be compiled to blsi, and number &= number - 1 may be compiled to blsr which are both fast, so this would only take a couple of efficient instructions to implement.
Since m is available, resetting the lowest set bit may be done with number ^= m but that may make it harder for the compiler to see that it can use blsr, which is a better choice because it depends only directly on number so it shortens the loop carried dependency chain.
The standard way is
while (num) {
unsigned mask = num ^ (num & (num-1)); // This will have just one bit set
...
num ^= mask;
}
for example starting with num = 2019 you will get in order
1
2
32
64
128
256
512
1024
If you are going to iterate over the single-bit-isolated masks one at a time, generating them one at a time is efficient; see #harold's answer.
But if you truly just want all the masks, x86 with AVX512F can usefully parallelize this. (At least potentially useful depending on surrounding code. More likely this is just a fun exercise in applying AVX512 and not useful for most use-cases).
The key building block is AVX512F vpcompressd : given a mask (e.g. from a SIMD compare) it will shuffle the selected dword elements to contiguous elements at the bottom of a vector.
An AVX512 ZMM / __m512i vector holds 16x 32-bit integers, so we only need 2 vectors to hold every possible single-bit mask. Our input number is a mask that selects which of those elements should be part of the output. (No need to broadcast it into a vector and vptestmd or anything like that; we can just kmov it into a mask register and use it directly.)
See also my AVX512 answer on AVX2 what is the most efficient way to pack left based on a mask?
#include <stdint.h>
#include <immintrin.h>
// suggest 64-byte alignment for out_array
// returns count of set bits = length stored
unsigned bit_isolate_avx512(uint32_t out_array[32], uint32_t x)
{
const __m512i bitmasks_lo = _mm512_set_epi32(
1UL << 15, 1UL << 14, 1UL << 13, 1UL << 12,
1UL << 11, 1UL << 10, 1UL << 9, 1UL << 8,
1UL << 7, 1UL << 6, 1UL << 5, 1UL << 4,
1UL << 3, 1UL << 2, 1UL << 1, 1UL << 0
);
const __m512i bitmasks_hi = _mm512_slli_epi32(bitmasks_lo, 16); // compilers actually do constprop and load another 64-byte constant, but this is more readable in the source.
__mmask16 set_lo = x;
__mmask16 set_hi = x>>16;
int count_lo = _mm_popcnt_u32(set_lo); // doesn't actually cost a kmov, __mask16 is really just uint16_t
_mm512_mask_compressstoreu_epi32(out_array, set_lo, bitmasks_lo);
_mm512_mask_compressstoreu_epi32(out_array+count_lo, set_hi, bitmasks_hi);
return _mm_popcnt_u32(x);
}
Compiles nicely with clang on Godbolt, and with gcc other than a couple minor sub-optimal choices with mov, movzx, and popcnt, and making a frame pointer for no reason. (It also can compile with -march=knl; it doesn't depend on AVX512BW or DQ.)
# clang9.0 -O3 -march=skylake-avx512
bit_isolate_avx512(unsigned int*, unsigned int):
movzx ecx, si
popcnt eax, esi
shr esi, 16
popcnt edx, ecx
kmovd k1, ecx
vmovdqa64 zmm0, zmmword ptr [rip + .LCPI0_0] # zmm0 = [1,2,4,8,16,32,64,128,256,512,1024,2048,4096,8192,16384,32768]
vpcompressd zmmword ptr [rdi] {k1}, zmm0
kmovd k1, esi
vmovdqa64 zmm0, zmmword ptr [rip + .LCPI0_1] # zmm0 = [65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728,268435456,536870912,1073741824,2147483648]
vpcompressd zmmword ptr [rdi + 4*rdx] {k1}, zmm0
vzeroupper
ret
On Skylake-AVX512, vpcompressd zmm{k1}, zmm is 2 uops for port 5. Latency from input vector -> output is 3 cycles, but latency from input mask -> output is 6 cycles. (https://www.uops.info/table.html / https://www.uops.info/html-instr/VPCOMPRESSD_ZMM_K_ZMM.html). The memory destination version is 4 uops: 2p5 + the usual store-address and store-data uops which can't micro-fuse when part of a larger instruction.
It might be better to compress into a ZMM reg and then store, at least for the first compress, to save total uops. The 2nd should probably still take advantage of the masked-store feature of vpcompressd [mem]{k1} so the output array doesn't need padding for it to step on. IDK if that helps with cache-line splits, i.e. whether masking can avoid replaying the store uop for the part with an all-zero mask in the 2nd cache line.
On KNL, vpcompressd zmm{k1} is only a single uop. Agner Fog didn't test it with a memory destination (https://agner.org/optimize/).
This is 14 fused-domain uops for the front-end on Skylake-X for the real work (e.g. after inlining into a loop over multiple x values, so we could hoist the vmovdqa64 loads out of the loop. Otherwise that's another 2 uops). So front-end bottleneck = 14 / 4 = 3.5 cycles.
Back-end port pressure: 6 uops for port 5 (2x kmov(1) + 2x vpcompressd(2)): 1 iteration per 6 cycles. (Even on IceLake (instlatx64), vpcompressd is still 2c throughput, unfortunately, so apparently ICL's extra shuffle port doesn't handle either of those uops. And kmovw k, r32 is still 1/clock, so presumably still port 5 as well.)
(Other ports are fine: popcnt runs on port 1, and that port's vector ALU is shut down when 512-bit uops are in flight. But not its scalar ALU, the only one that handles 3-cycle latency integer instructions. movzx dword, word can't be eliminated, only movzx dword, byte can do that, but it runs on any port.)
Latency: integer result is just one popcnt (3 cycles). First part of the memory result is stored about 7 cycles after the mask is ready. (kmov -> vpcompressd). The vector source for vpcompressd is a constant so OoO exec can get it ready plenty early unless it misses in cache.
Compacting the 1<<0..15 constant would be possible but probably not worth it, by building it with a shift. e.g. loading 16-byte _mm_setr_epi8(0..15) with vpmovzxbd, then using that with vpsllvd on a vector of set1(1) (which you can get from a broadcast or generate on the fly with vpternlogd+shift). But that's probably not worth it even if you're writing by hand in asm (so it's your choice instead of the compiler) since this already uses a lot of shuffles, and constant-generation would take at least 3 or 4 instructions (each of which is at least 6 bytes long; EVEX prefixes alone are 4 bytes each).
I would generate the hi part with a shift from lo, instead of loading it separately, though. Unless the surrounding code bottlenecks hard on port 0, an ALU uop isn't worse than a load uop. One 64-byte constant fills a whole cache line.
You could compress the lo constant with a vpmovzxwd load: each element fits in 16 bits. Worth considering if you can hoist that outside of a loop so it doesn't cost an extra shuffle per operation.
If you wanted the result in a SIMD vector instead of stored to memory, you could 2x vpcompressd into registers and maybe use count_lo to look up a shuffle control vector for vpermt2d. Possibly from a sliding-window on an array instead of 16x 64-byte vectors? But the result isn't guaranteed to fit in one vector unless you know your input had 16 or fewer bits set.
Things are much worse for 64-bit integers 8x 64-bit elements means we need 8 vectors. So maybe not worth it vs. scalar, unless your inputs have lots of bits set.
You can do it in a loop, though, using vpslld by 8 to move bits in vector elements. You'd think kshiftrq would be good, but with 4 cycle latency that's a long loop-carried dep chain. And you need scalar popcnt of each 8-bit chunk anyway to adjust the pointer. So your loop should use shr / kmov and movzx / popcnt. (Using a counter += 8 and bzhi to feed popcnt would cost more uops).
The loop-carried dependencies are all short (and the loop only runs 8 iterations to cover mask 64 bits), so out-of-order exec should be able to nicely overlap work for multiple iterations. Especially if we unroll by 2 so the vector and mask dependencies can get ahead of the pointer update.
vector: vpslld immediate, starting from the vector constant
mask: shr r64, 8 starting with x. (Could stop looping when this becomes 0 after shifting out all the bits. This 1-cycle dep chain is short enough for OoO exec to zip through it and hide most of the mispredict penalty, when it happens.)
pointer: lea rdi, [rdi + rax*4] where RAX holds a popcnt result.
The rest of the work is all independent across iterations. Depending on surrounding code, we probably bottleneck on port 5 with vpcompressd shuffles and kmov

fastest way to convert two-bit number to low-memory representation

I have a 56-bit number with potentially two set bits, e.g., 00000000 00000000 00000000 00000000 00000000 00000000 00000011. In other words, two bits are distributed among 56 bits, so that we have bin(56,2)=1540 possible permutations.
I now look for a loss-free mapping of such an 56 bit number to an 11-bit number that can carry 2048 and therefore also 1540. Knowing the structure, this 11-bit number is enough to store the value of my low-density (of ones) 56 bit number.
I want to maximize performance (this function should run millions or even billions of times per second if possible). So far, I only came up with some loop:
int inputNumber = 24; // 11000
int bitMask = 1;
int bit1 = 0, bit2 = 0;
for(int n = 0; n < 54; ++n, bitMask *= 2)
{
if((inputNumber & bitMask) != 0)
{
if(bit1 != 0)
bit1 = n;
else
{
bit2 = n;
break;
}
}
}
and using these two bits, I can easily generate some 1540 max number.
But is there no faster version than using such a loop?
Most ISAs have hardware support for a bit-scan instruction that finds the position of a set bit. Use that instead of a naive loop or bithack for any architecture where you care about this running fast. https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious has some tricks that are better than nothing, but those are all still much worse than a single efficient asm instruction.
But ISO C++ doesn't portably expose clz/ctz operations; it's only available via intrinsics / builtins for various implementations. (And the x86 intrinsincs have quirks for all-zero input, corresponding to the asm instruction behaviour).
For some ISAs, it's a count-leading-zeros giving you 31 - highbit_index. For others, it's a CTZ count trailing zeros operation, giving you the index of the low bit. x86 has both. (And its high-bit finder actually directly finds the high-bit index, not a leading-zero count, unless you use BMI1 lzcnt instead of traditional bsr) https://en.wikipedia.org/wiki/Find_first_set has a table of what different ISAs have.
GCC portably provides __builtin_clz and __builtin_ctz; on ISAs without hardware support, they compile to a call to a helper functions. See What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C? and Implementation of __builtin_clz
(For 64-bit integers, you want the long long versions: like __builtin_ctzll GCC manual.)
If we only have a CLZ, use high=63-CLZ(n) and low= 63-CLZ((-n) & n) to isolate the low bit. Note that x86's bsr instruction actually produces 63-CLZ(), i.e. the bit-index instead of the leading-zero count. So 63-__builtin_clzll(n) can compile to a single instruction on x86; IIRC gcc does notice this. Or 2 instructions if GCC uses an extra xor-zeroing to avoid the inconvenient false dependency.
If we only have CTZ, do low = CTZ(n) and high = CTZ(n & (n - 1)) to clear the lowest set bit. (Leaving the high bit, assuming the number has exactly 2 set bits).
If we have both, low = CTZ(n) and high = 63-CLZ(n). I'm not sure what GCC does on non-x86 ISAs where they aren't both available natively. The GCC builtins are always available even when targeting HW that doesn't have it. But the internal implementation can't use the above tricks because it doesn't know there are always exactly 2 bits set.
(I wrote out the full formulas; an earlier version of this answer had CLZ and CTZ reversed in this part. I find that happens to me easily, especially when I also have to keep track of x86's bsr and bsr (bitscan reverse and forward) and remember that those are leading and trailing, respectively.)
So if you just use both CTZ and CLZ, you might end up with slow emulation for one of them. Or fast emulation on ARM with rbit to bit-reverse for clz, which is 100% fine.
AVX512CD has SIMD VPLZCNTQ for 64-bit integers, so you could encode 2, 4, or 8x 64-bit integers in parallel with that on recent Intel CPUs. For SSSE3 or AVX2, you can build a SIMD lzcnt by using pshufb _mm_shuffle_epi8 byte-shuffle as a 4-bit LUT and combining with _mm_max_epu8. There was a recent Q&A about this but I can't find it. (It might have been for 16-bit integers only; wider requires more work.)
With this, a Skylake-X or Cascade Lake CPU could maybe compress 8x 64-bit integers per 2 or 3 clock cycles once you factor in the throughput cost of packing the results. SIMD is certainly useful for packing 12-bit or 11-bit results into a contiguous bitstream, e.g. with variable-shift instructions, if that's what you want to do with the results. At ~3 or 4GHz clock speed, that could maybe get you over 10 billion per clock with a single thread. But only if the inputs come from contiguous memory. Depending what you want to do with the results, it might cost a few more cycles to do more than just pack them down to 16-bit integers. e.g. to pack into a bitstream. But SIMD should be good for that with variable-shift instructions that can line up the 11 or 12 bits from each register into the right position to OR together after shuffling.
There's a tradeoff between coding efficiency and encode performance. Using 12 bits for two 6-bit indices (of bit positions) is very simple both to compress and decompress, at least on hardware that has bit-scan instructions.
Or instead of bit-indices, one or both could be leading zero counts, so decoding would be (1ULL << 63) >> a. 1ULL>>63 is a fixed constant that you can actually right-shift, or the compiler could turn it into a left-shift of 1ULL << (63-a) which IIRC optimizes to 1 << (-a) in assembly for ISAs like x86 where shift instructions mask the shift count (look only at the low 6 bits).
Also, 2x 12 bits is a whole number of bytes, but 11 bits only gives you a whole number of bytes every 8 outputs, if you're packing them. So indexing a bit-packed array is simpler.
0 is still a special case: maybe handle that by using all-ones bit-indices (i.e. index = bit 63, which is outside the low 56 bits). On decode/decompress, you set the 2 bit positions (1ULL<<a) | (1ULL<<b) and then & mask to clear high bits. Or bias your bit indices and have decode right shift by 1.
If we didn't have to handle zero then a modern x86 CPU could do 1 or 2 billion encodes per second if it didn't have to do anything else. e.g. Skylake has 1 per clock throughput for bit-scan instructions and should be able to encode at 1 number per 2 clocks just bottlenecked on that. (Or maybe better with SIMD). With just 4 scalar instructions, we can get the low and high indices (64-bit tzcnt + bsr), shift by 6 bits, and OR together.1 Or on AMD, avoid bsr / bsf and manually do 63-lzcnt.
A branchy or branchless check for input == 0 to to set the final result to whatever hard-coded constant (like 63 , 63) should be cheap, though.
Compression on other ISAs like AArch64 is also cheap. It has clz but not ctz. Probably your best bet there is use an intrinsic for rbit to bit-reverse a number (so clz on the bit-reversed number directly gives you the bit-index of the low bit. Which is now the high bit of the reversed version.) Assuming rbit is as fast as add / sub, this is cheaper than using multiple instructions to clear the low bit.
If you really want 11 bits then you need to avoid the redundancy of 2x 6-bit being able to have either index larger than the other. Like maybe have 6-bit a and 5-bit b, and have a<=b mean something special like b+=32. I haven't thought this through fully. You need to be able to encode 2 adjacent bits either near the top or bottom of the registers, or the 2 set bits could be as far apart as 28 bits, if we consider wrapping at the boundaries like a 56-bit rotate.
Melpomene's suggestion to isolate the low and high set bits might be useful as part of something else, but is only useful for encoding on targets where you only have one direction of bit-scan available, not both. Even so, you wouldn't actually use both expressions. Leading-zero count doesn't require you to isolate the low bit, you just need to clear it to get at the high bit.
Footnote 1: decoding on x86 is also cheap: x |= (1<<a) is 1 instruction: bts. But many compilers have missed optimizations and don't notice this, instead actually shifting a 1. bts reg, reg is 1 uop / 1 cycle latency on Intel since PPro, or sometimes 2 uops on AMD. (Only the memory destination version is slow.) https://agner.org/optimize/
Best encoding performance on AMD CPUs requires BMI1 tzcnt / lzcnt because bsr and bsf are slower (6 uops instead of 1 https://agner.org/optimize/). On Ryzen, lzcnt is 1 uop, 1c latency, 4 per clock throughput. But tzcnt is 2 uops.
With BMI1, the compiler could use blsr to clear the lowest set bit of a register (and copy it). i.e. modern x86 has an instruction for dst = (SRC-1) bitwiseAND ( SRC ); that are single-uop on Intel but 2 uops on AMD.
But with lzcnt being more efficient than tzcnt on AMD Ryzen, probably the best asm for AMD doesn't use it.
Or maybe something like this (assuming exactly 2 bits, which apparently we can do).
(This asm is what you'd like to get your compiler to emit. Don't actually use inline asm!)
Ryzen_encode_scalar: ; input in RDI, output in EAX
lzcnt rcx, rdi ; 63-high bit index
tzcnt rdx, rdi ; low bit
mov eax, 63
sub eax, ecx
shl edx, 6
or eax, edx ; (low_bit << 6) | high_bit
ret ; goes away with inlining.
Shifting the low bit-index balances the lengths of the critical path, allowing better instruction-level parallelism, if we need 63-CLZ for the high bit.
Throughput: 7 uops total, and no execution-unit bottlenecks. So at 5 uops per clock pipeline width, that's better than 1 per 2 clocks.
Skylake_encode_scalar: ; input in RDI, output in EAX
tzcnt rax, rdi ; low bit. No false dependency on Skylake. GCC will probably xor-zero RAX because there is on Broadwell and earlier.
bsr rdi, rdi ; high bit index. same,same reg avoids false dep
shl eax, 6
or eax, edx
ret ; goes away with inlining.
This has 5 cycle latency from input to output: bitscan instructions are 3 cycles on Intel vs. 1 on AMD. SHL + OR each add 1 cycle.
For throughput, we only bottleneck on one bit-scan per cycle (execution port 1), so we can do one encode per 2 cycles with 4 uops of front-end bandwidth left over for load, store, and loop overhead (or something else), assuming we have multiple independent encodes to do.
(But for the multiple independent encode case, SIMD may still be better for both AMD and Intel, if a cheap emulation of vplzcntq exists and the data is coming from memory.)
Scalar decode can be something like this:
decode: ;; input in EDI, output in RAX
xor eax, eax ; RAX=0
bts rax, rdi ; RAX |= 1ULL << (high_bit_idx & 63)
shr edi, 6 ; extract low_bit_idx
bts rax, rdi ; RAX |= 1ULL << low_bit_idx
ret
This has 3 shifts (including the bts) which on Skylake can only run on port0 or port6. So on Intel it only costs 4 uops for the front-end (so 1 per clock as part of doing something else). But if doing only this, it bottlenecks on shift throughput at 1 decode per 1.5 clock cycles.
On a 4GHz CPU, that's 2.666 billion decodes per second, so yeah we're doing pretty well hitting your targets :)
Or Ryzen, bts reg,reg is 2 uops , with 0.5c throughput, but shr can run on any port. So it doesn't steal throughput from bts, and the whole thing is 6 uops (vs. Ryzen's pipeline being 5-wide at the narrowest point). So 1 encode per 1.2 clock cycles, just bottlenecked on front-end cost.
With BMI2 available, starting with a 1 in a register and using shlx rax, rbx, rdi can replace the xor-zeroing + first BTS with a single uop, assuming the 1 in a register can be reused in a loop.
(This optimization is totally dependent on your compiler to find; flag-less shifts are just more efficient ways to copy-and-shift that become available with -march=haswell or -march=znver1, or other targets that have BMI2.)
Either way you're just going to write retval = 1ULL << (packed & 63) for decoding the first bit. But if you're wondering which compilers make nice code here, this is what you're looking for.