Store four 16bit integers with SSE intrinsics - c++

I multiply and round four 32bit floats, then convert it to four 16bit integers with SSE intrinsics. I'd like to store the four integer results to an array. With floats it's easy: _mm_store_ps(float_ptr, m128value). However I haven't found any instruction to do this with 16bit (__m64) integers.
void process(float *fptr, int16_t *sptr, __m128 factor)
{
__m128 a = _mm_load_ps(fptr);
__m128 b = _mm_mul_ps(a, factor);
__m128 c = _mm_round_ps(b, _MM_FROUND_TO_NEAREST_INT);
__m64 s =_mm_cvtps_pi16(c);
// now store the values to sptr
}
Any help would be appreciated.

Personally I would avoid using MMX. Also, I would use an explicit store rather than implicit which often only work on certain compilers. The following codes works find in MSVC2012 and SSE 4.1.
Note that fptr needs to be 16-byte aligned. This is not a problem if you compile in 64-bit mode but in 32-bit mode you should make sure it's aligned.
#include <stdio.h>
#include <stdint.h>
#include <smmintrin.h>
void process(float *fptr, int16_t *sptr, __m128 factor)
{
__m128 a = _mm_load_ps(fptr);
__m128 b = _mm_mul_ps(a, factor);
__m128i c = _mm_cvttps_epi32(b);
__m128i d = _mm_packs_epi32(c,c);
_mm_storel_epi64((__m128i*)sptr, d);
}
int main() {
float x[] = {1.0, 2.0, 3.0, 4.0};
int16_t y[4];
__m128 factor = _mm_set1_ps(3.14159f);
process(x, y, factor);
printf("%d %d %d %d\n", y[0], y[1], y[2], y[3]);
}
Note that _mm_cvtps_pi16 is not a simple instrinsic the Intel Intrinsic Guide says "This intrinsic creates a sequence of two or more instructions, and may perform worse than a native instruction. Consider the performance impact of this intrinsic."
Here is the assembly output using the MMX version
mulps (%rdi), %xmm0
roundps $0, %xmm0, %xmm0
movaps %xmm0, %xmm1
cvtps2pi %xmm0, %mm0
movhlps %xmm0, %xmm1
cvtps2pi %xmm1, %mm1
packssdw %mm1, %mm0
movq %mm0, (%rsi)
ret
Here is the assembly output ussing the SSE only version
mulps (%rdi), %xmm0
cvttps2dq %xmm0, %xmm0
packssdw %xmm0, %xmm0
movq %xmm0, (%rsi)
ret

With __m64 types, you can just cast the destination pointer appropriately:
void process(float *fptr, int16_t *sptr, __m128 factor)
{
__m128 a = _mm_load_ps(fptr);
__m128 b = _mm_mul_ps(a, factor);
__m128 c = _mm_round_ps(b, _MM_FROUND_TO_NEAREST_INT);
__m64 s =_mm_cvtps_pi16(c);
*((__m64 *) sptr) = s;
}
There is no distinction between aligned and unaligned stores with MMX instructions like there is with SSE/AVX; therefore, you don't need the intrinsics to perform a store.

I think you're safe moving that to a general 64bit register (long long will work for both Linux LLP64 and Windows LP64) and copy it yourself.
From what I read in xmmintrin.h, gcc will handle the cast perfectly fine from __m64 to a long long.
To be sure, you can use _mm_cvtsi64_si64x.
short* f;
long long b = _mm_cvtsi64_si64x(s);
f[0] = b >> 48;
f[1] = b >> 32 & 0x0000FFFFLL;
f[2] = b >> 16 & 0x000000000FFFFLL;
f[3] = b & 0x000000000000FFFFLL;
You could type pune that with an union to make it look better, but I guess that would fall in undefined behavior.

Related

Slow performances of CPUs set with AVX2 in handling int and double datatypes in C++

I have a strange problem with some AVX / AVX2 codes that I am working on. I have set
up a test application console developed in cpp (Visual Studio 2017 on Windows 7) with the aim of comparing the routines written in Cpp with the equivalent routine written with the set-instruction AVX / AVX2; each routine is timed.
A first problem: the timed time of the single routine changes according to the position of the call of the same;
void TraditionalAVG_UncharToDouble(const unsigned char *vec1, const unsigned char *vec2, double* doubleArray, const unsigned int length) {
int sumTot = 0;
double* ptrDouble = doubleArray;
for (unsigned int packIdx = 0; packIdx < length; ++packIdx) {
*ptrDouble = ((double)(*(vec1 + packIdx) + *(vec2 + packIdx)))/ ((double)2);
ptrDouble++;
}
}
void AVG_uncharToDoubleArray(const unsigned char *vec1, const unsigned char *vec2, double* doubleArray, const unsigned int length) {
//constexpr unsigned int memoryAlignmentBytes = 32;
constexpr unsigned int bytesPerPack = 256 / 16;
unsigned int packCount = length / bytesPerPack;
double* ptrDouble = doubleArray;
__m128d divider=_mm_set1_pd(2);
for (unsigned int packIdx = 0; packIdx < packCount; ++packIdx)
{
auto x1 = _mm_loadu_si128((const __m128i*)vec1);
auto x2 = _mm_loadu_si128((const __m128i*)vec2);
unsigned char index = 0;
while(index < 8) {
index++;
auto x1lo = _mm_cvtepu8_epi64(x1);
auto x2lo = _mm_cvtepu8_epi64(x2);
__m128d x1_pd = int64_to_double_full(x1lo);
__m128d x2_pd = int64_to_double_full(x2lo);
_mm_store_pd(ptrDouble, _mm_div_pd(_mm_add_pd(x1_pd, x2_pd), divider));
ptrDouble = ptrDouble + 2;
x1 = _mm_srli_si128(x1, 2);
x2 = _mm_srli_si128(x2, 2);
}
vec1 += bytesPerPack;
vec2 += bytesPerPack;
}
for (unsigned int ii = 0 ; ii < length % packCount; ++ii)
{
*(ptrDouble + ii) = (double)(*(vec1 + ii) + *(vec2 + ii))/ (double)2;
}
}
... on main ...
timeAvg02 = 0;
Start_TimerMS();
AVG_uncharToDoubleArray(unCharArray, unCharArrayBis, doubleArray, N);
End_TimerMS(&timeAvg02);
std::cout << "AVX2_AVG UncharTodoubleArray:: " << timeAvg02 << " ms" << std::endl;
//printerDouble("AvxDouble", doubleArray, N);
std::cout << std::endl;
timeAvg01 = 0;
Start_TimerMS3();
TraditionalAVG_UncharToDouble(unCharArray, unCharArrayBis, doubleArray, N);
End_TimerMS3(&timeAvg01);
std::cout << "Traditional_AVG UncharTodoubleArray: " << timeAvg01 << " ms" << std::endl;
//printerDouble("TraditionalAvgDouble", doubleArray, N);
std::cout << std::endl;
the second problem is that routines written in AVX2 are slower than routines written in cpp. The images represent the results of the two tests
How can I overcome this strange behavior? What is the reason behind it?
MSVC doesn't optimize intrinsics (much), so you get an actual vdivpd by 2.0, not a multiply by 0.5. That's a worse bottleneck than scalar, less than one element per clock cycle on most CPUs. (e.g. Skylake / Ice Lake / Alder Lake-P: 4 cycle throughput for vdivpd xmm, or 8 cycles for vdivpd ymm, either way 2 cycles per element. https://uops.info)
From Godbolt, with MSVC 19.33 -O2 -arch:AVX2, with a version that compiles (replacing your undefined int64_to_double_full with efficient 32-bit conversion). Your version is probably even worse.
$LL5#AVG_unchar:
vpmovzxbd xmm0, xmm5
vpmovzxbd xmm1, xmm4
vcvtdq2pd xmm3, xmm0
vcvtdq2pd xmm2, xmm1
vaddpd xmm0, xmm3, xmm2
vdivpd xmm3, xmm0, xmm6 ;; performance disaster
vmovupd XMMWORD PTR [r8], xmm3
add r8, 16
vpsrldq xmm4, xmm4, 2
vpsrldq xmm5, xmm5, 2
sub rax, 1
jne SHORT $LL5#AVG_unchar
Also, AVX2 implies support for 256-bit integer as well as FP vectors, so you can use __m256i. Although with this shift strategy for using the chars of a vector, you wouldn't want to. You'd just want to use __m256d.
Look at how clang vectorizes the scalar C++: https://godbolt.org/z/Yzze98qnY 2x vpmovzxbd-load of __m128i / vpaddd __m128i / vcvtdq2pd to __m256d / vmulpd __m256d (by 0.5) / vmovupd. (Narrow loads as a memory source for vpmovzxbd are good, especially with an XMM destination so they can micro-fuse on Intel CPUs. Writing this with intrinsics relies on compilers optimizing _mm_loadu_si32 into a memory source for _mm_cvtepu8_epi32. Looping to use all bytes of a wider load isn't crazy, but costs more shuffles. clang unrolls that loop, replacing later vpsrldq / vpmovzxbd with vpshufb shuffles to move bytes directly to where they're needed, at the cost of needing more constants.)
IDK what wrong with MSVC, why it failed to auto-vectorize with -O2 -arch:AVX2, but at least it optimized /2.0 to *0.5. When the reciprocal is exactly representable as a double, that's a well-known safe and valuable optimization.
With a good compiler, there'd be no need for intrinsics. But "good" seems to only include clang; GCC makes a bit of a mess with converting vector widths.
Your scalar C is strangely obfuscated as *ptrDouble = ((double)(*(vec1 + packIdx) + *(vec2 + packIdx)))/ ((double)2); instead of
(vec1[packIdx] + vec2[packIdx]) / 2.0.
Doing integer addition like this scalar code before conversion to FP is a good idea, especially for a vectorized version, so there's only one conversion. Each input already needs to get widened separately to 32-bit elements.
IDK what int64_to_double_full is, but if it's manual emulation of AVX-512 vcvtqq2pd, it makes no sense to use use it on values zero-extended from char. That value-range fits comfortably in int32_t, so you can widen only to 32-bit elements, and let hardware int->FP packed conversion with _mm256_cvtepi32_pd (vcvtdq2pd) widen the elements.

Efficient overflow-immune arithmetic mean in C/C++

The arithmetic mean of two unsigned integers is defined as:
mean = (a+b)/2
Directly implementing this in C/C++ may overflow and produce a wrong result. A correct implementation would avoid this. One way of coding it could be:
mean = a/2 + b/2 + (a%2 + b%2)/2
But this produces rather a lot of code with typical compilers. In assembler, this usually can be done much more efficiently. For example, the x86 can do this in the following way (assembler pseudo code, I hope you get the point):
ADD a,b ; addition, leaving the overflow condition in the carry bit
RCR a,1 ; rotate right through carry, effectively a division by 2
After those two instructions, the result is in a, and the remainder of the division is in the carry bit. If correct rounding is desired, a third ADC instruction would have to add the carry into the result.
Note that the RCR instruction is used, which rotates a register through the carry. In our case, it is a rotate by one position, so that the previous carry becomes the most significant bit in the register, and the new carry holds the previous LSB from the register. It seems that MSVC doesn't even offer an intrinsic for this instruction.
Is there a known C/C++ pattern that can be expected to be recognized by an optimizing compiler so that it produces such efficient code? Or, more generally, is there a rational way how to program in C/C++ source level so that the carry bit is being used by the compiler to optimize the generated code?
EDIT:
A 1-hour lecture about std::midpoint: https://www.youtube.com/watch?v=sBtAGxBh-XI
Wow!
EDIT2: Great discussion on Microsoft blog
The following method avoids overflow and should result in fairly efficient assembly (example) without depending on non-standard features:
mean = (a&b) + (a^b)/2;
There are three typical methods to compute average without overflow, one of which is limited to uint32_t (on 64-bit architectures).
// average "SWAR" / Montgomery
uint32_t avg(uint32_t a, uint32_t b) {
return (a & b) + ((a ^ b) >> 1);
}
// in case the relative magnitudes are known
uint32_t avg2(uint32_t min, uint32_t max) {
return min + (max - min) / 2;
}
// in case the relative magnitudes are not known
uint32_t avg2_constrained(uint32_t a, uint32_t b) {
return a + (int32_t)(b - a) / 2;
}
// average increase width (not applicable to uint64_t)
uint32_t avg3(uint32_t a, uint32_t b) {
return ((uint64_t)a + b) >> 1;
}
The corresponding assembler sequences from clang in two architectures are
avg(unsigned int, unsigned int)
mov eax, esi
and eax, edi
xor esi, edi
shr esi
add eax, esi
avg2(unsigned int, unsigned int)
sub esi, edi
shr esi
lea eax, [rsi + rdi]
avg3(unsigned int, unsigned int)
mov ecx, edi
mov eax, esi
add rax, rcx
shr rax
vs.
avg(unsigned int, unsigned int)
and w8, w1, w0
eor w9, w1, w0
add w0, w8, w9, lsr #1
ret
avg2(unsigned int, unsigned int)
sub w8, w1, w0
add w0, w0, w8, lsr #1
ret
avg3(unsigned int, unsigned int):
mov w8, w1
add x8, x8, w0, uxtw
lsr x0, x8, #1
ret
Out of these three versions, avg2 would perform in ARM64 as well, as the optimal sequence using carry flag -- and also it's likely that avg3 would perform as well, noticing that the mov w8,w1 is used to clear the top 32-bits, which may be unnecessary given that the compiler knows they are cleared by any previous instruction that is used to produce the value.
Similar statement can be made of the Intel version for avg3, which would in optimal case compiled to just the two meaningful instructions:
add rax, rcx
shr rax
See https://godbolt.org/z/5TMd3zr81 for online comparison.
The "SWAR"/Montgomery version is typically only justified, when trying to compute multiple averages packed to a single (large) integer in which case the full formula contains masking with the bit positions of the highest bits: return (a & b) + (((a ^ b) >> 1) & ~kH;.

Convert 16 bits mask to 16 bytes mask

Is there any way to convert the following code:
int mask16 = 0b1010101010101010; // int or short, signed or unsigned, it does not matter
to
__uint128_t mask128 = ((__uint128_t)0x0100010001000100 << 64) | 0x0100010001000100;
So to be extra clear something like:
int mask16 = 0b1010101010101010;
__uint128_t mask128 = intrinsic_bits_to_bytes(mask16);
or by applying directly the mask:
int mask16 = 0b1010101010101010;
__uint128_t v = ((__uint128_t)0x2828282828282828 << 64) | 0x2828282828282828;
__uint128_t w = intrinsic_bits_to_bytes_mask(v, mask16); // w = ((__uint128_t)0x2928292829282928 << 64) | 0x2928292829282928;
Bit/byte order: Unless noted, these follow the question, putting the LSB of the uint16_t in the least significant byte of the __uint128_t (lowest memory address on little-endian x86). This is what you want for an ASCII dump of a bitmap for example, but it's opposite of place-value printing order for the base-2 representation of a single 16-bit number.
The discussion of efficiently getting values (back) into RDX:RAX integer registers has no relevance for most normal use-cases since you'd just store to memory from vector registers, whether that's 0/1 byte integers or ASCII '0'/'1' digits (which you can get most efficiently without ever having 0/1 integers in a __m128i, let alone in an unsigned __int128).
Table of contents:
SSE2 / SSSE3 version: good if you want the result in a vector, e.g. for storing a char array.
(SSE2 NASM version, shuffling into MSB-first printing order and converting to ASCII.)
BMI2 pdep: good for scalar unsigned __int128 on Intel CPUs with BMI2, if you're going to make use of the result in scalar registers. Slow on AMD.
Pure C++ with a multiply bithack: pretty reasonable for scalar
AVX-512: AVX-512 has masking as a first-class operation using scalar bitmaps. Possibly not as good as BMI2 pdep if you're using the result as scalar halves, otherwise even better than SSSE3.
AVX2 printing order (MSB at lowest address) dump of a 32-bit integer.
See also is there an inverse instruction to the movemask instruction in intel avx2? for other variations on element size and mask width. (SSE2 and multiply bithack were adapted from answers linked from that collection.)
With SSE2 (preferably SSSE3)
See #aqrit's How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD answer
Adapting that to work with 16 bits -> 16 bytes, we need a shuffle that replicates the first byte of the mask to the first 8 bytes of the vector, and the 2nd mask byte to the high 8 vector bytes. That's doable with one SSSE3 pshufb, or with punpcklbw same,same + punpcklwd same,same + punpckldq same,same to finally duplicate things up to two 64-bit qwords.
typedef unsigned __int128 u128;
u128 mask_to_u128_SSSE3(unsigned bitmap)
{
const __m128i shuffle = _mm_setr_epi32(0,0, 0x01010101, 0x01010101);
__m128i v = _mm_shuffle_epi8(_mm_cvtsi32_si128(bitmap), shuffle); // SSSE3 pshufb
const __m128i bitselect = _mm_setr_epi8(
1, 1<<1, 1<<2, 1<<3, 1<<4, 1<<5, 1<<6, 1U<<7,
1, 1<<1, 1<<2, 1<<3, 1<<4, 1<<5, 1<<6, 1U<<7 );
v = _mm_and_si128(v, bitselect);
v = _mm_min_epu8(v, _mm_set1_epi8(1)); // non-zero -> 1 : 0 -> 0
// return v; // if you want a SIMD vector result
alignas(16) u128 tmp;
_mm_store_si128((__m128i*)&tmp, v);
return tmp; // optimizes to movq / pextrq (with SSE4)
}
(To get 0 / 0xFF instead of 0 / 1, replace _mm_min_epu8 with v= _mm_cmpeq_epi8(v, bitselect). If you want a string of ASCII '0' / '1' characters, do cmpeq and _mm_sub_epi8(_mm_set1_epi8('0'), v). That avoids the set1(1) vector constant.)
Godbolt including test-cases. (For this and other non-AVX-512 versions.)
# clang -O3 for Skylake
mask_to_u128_SSSE3(unsigned int):
vmovd xmm0, edi # _mm_cvtsi32_si128
vpshufb xmm0, xmm0, xmmword ptr [rip + .LCPI2_0] # xmm0 = xmm0[0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1]
vpand xmm0, xmm0, xmmword ptr [rip + .LCPI2_1] # 1<<0, 1<<1, etc.
vpminub xmm0, xmm0, xmmword ptr [rip + .LCPI2_2] # set1_epi8(1)
# done here if you return __m128i v or store the u128 to memory
vmovq rax, xmm0
vpextrq rdx, xmm0, 1
ret
BMI2 pdep: good on Intel, bad on AMD
BMI2 pdep is fast on Intel CPUs that have it (since Haswell), but very slow on AMD (over a dozen uops, high latency.)
typedef unsigned __int128 u128;
inline u128 assemble_halves(uint64_t lo, uint64_t hi) {
return ((u128)hi << 64) | lo; }
// could replace this with __m128i using _mm_set_epi64x(hi, lo) to see how that compiles
#ifdef __BMI2__
#include <immintrin.h>
auto mask_to_u128_bmi2(unsigned bitmap) {
// fast on Intel, slow on AMD
uint64_t tobytes = 0x0101010101010101ULL;
uint64_t lo = _pdep_u64(bitmap, tobytes);
uint64_t hi = _pdep_u64(bitmap>>8, tobytes);
return assemble_halves(lo, hi);
}
Good if you want the result in scalar registers (not one vector) otherwise probably prefer the SSSE3 way.
# clang -O3
mask_to_u128_bmi2(unsigned int):
movabs rcx, 72340172838076673 # 0x0101010101010101
pdep rax, rdi, rcx
shr edi, 8
pdep rdx, rdi, rcx
ret
# returns in RDX:RAX
Portable C++ with a magic multiply bithack
Not bad on x86-64; AMD since Zen has fast 64-bit multiply, and Intel's had that since Nehalem. Some low-power CPUs still have slowish imul r64, r64
This version may be optimal for __uint128_t results, at least for latency on Intel without BMI2, and on AMD, since it avoids a round-trip to XMM registers. But for throughput it's quite a few instructions
See #phuclv's answer on How to create a byte out of 8 bool values (and vice versa)? for an explanation of the multiply, and for the reverse direction. Use the algorithm from unpack8bools once for each 8-bit half of your mask.
//#include <endian.h> // glibc / BSD
auto mask_to_u128_magic_mul(uint32_t bitmap) {
//uint64_t MAGIC = htobe64(0x0102040810204080ULL); // For MSB-first printing order in a char array after memcpy. 0x8040201008040201ULL on little-endian.
uint64_t MAGIC = 0x0102040810204080ULL; // LSB -> LSB of the u128, regardless of memory order
uint64_t MASK = 0x0101010101010101ULL;
uint64_t lo = ((MAGIC*(uint8_t)bitmap) ) >> 7;
uint64_t hi = ((MAGIC*(bitmap>>8)) ) >> 7;
return assemble_halves(lo & MASK, hi & MASK);
}
If you're going to store the __uint128_t to memory with memcpy, you might want to control for host endianness by using htole64(0x0102040810204080ULL); (from GNU / BSD <endian.h>) or equivalent to always map the low bit of input to the lowest byte of output, i.e. to the first element of a char or bool array. Or htobe64 for the other order, e.g. for printing. Using that function on a constant instead of the variable data allows constant-propagation at compile time.
Otherwise, if you truly want a 128-bit integer whose low bit matches the low bit of the u16 input, the multiplier constant is independent of host endianness; there's no byte access to wider types.
clang 12.0 -O3 for x86-64:
mask_to_u128_magic_mul(unsigned int):
movzx eax, dil
movabs rdx, 72624976668147840 # 0x0102040810204080
imul rax, rdx
shr rax, 7
shr edi, 8
imul rdx, rdi
shr rdx, 7
movabs rcx, 72340172838076673 # 0x0101010101010101
and rax, rcx
and rdx, rcx
ret
AVX-512
This is easy with AVX-512BW; you can use the mask for a zero-masked load from a repeated 0x01 constant.
__m128i bits_to_bytes_avx512bw(unsigned mask16) {
return _mm_maskz_mov_epi8(mask16, _mm_set1_epi8(1));
// alignas(16) unsigned __int128 tmp;
// _mm_store_si128((__m128i*)&u128, v); // should optimize into vmovq / vpextrq
// return tmp;
}
Or avoid a memory constant (because compilers can do set1(-1) with just a vpcmpeqd xmm0,xmm0): Do a zero-masked absolute-value of -1. The constant setup can be hoisted, same as with set1(1).
__m128i bits_to_bytes_avx512bw_noconst(unsigned mask16) {
__m128i ones = _mm_set1_epi8(-1); // extra instruction *off* the critical path
return _mm_maskz_abs_epi8(mask16, ones);
}
But note that if doing further vector stuff, the result of maskz_mov might be able to optimize into other operations. For example vec += maskz_mov could optimize into a merge-masked add. But if not, vmovdqu8 xmm{k}{z}, xmm needs an ALU port like vpabsb xmm{k}{z}, xmm, but vpabsb can't run on port 5 on Skylake/Ice Lake. (A zero-masked vpsubb from a zeroed register would avoid that possible throughput problem, but then you'd be setting up 2 registers just to avoid loading a constant. In hand-written asm, you'd just materialize set1(1) using vpcmpeqd / vpabsb yourself if you wanted to avoid a 4-byte broadcast-load of a constant.)
(Godbolt compiler explorer with gcc and clang -O3 -march=skylake-avx512. Clang sees through the masked vpabsb and compiles it the same as the first version, with a memory constant.)
Even better if you can use a vector 0 / -1 instead of 0 / 1: use return _mm_movm_epi8(mask16). Compiles to just kmovd k0, edi / vpmovm2b xmm0, k0
If you want a vector of ASCII characters like '0' or '1', you could use _mm_mask_blend_epi8(mask, ones, zeroes). (That should be more efficient than a merge-masked add into a vector of set1(1) which would require an extra register copy, and also better than sub between set1('0') and _mm_movm_epi8(mask16) which would require 2 instructions: one to turn the mask into a vector, and a separate vpsubb.)
AVX2 with bits in printing order (MSB at lowest address), bytes in mem order, as ASCII '0' / '1'
With [] delimiters and \t tabs like this output format, from this codereview Q&A:
[01000000] [01000010] [00001111] [00000000]
Obviously if you want all 16 or 32 ASCII digits contiguous, that's easier and doesn't require shuffling the output to store each 8-byte chunk separately. Mostly of the reason for posting here is that it has the shuffle and mask constants in the right order for printing, and to show a version optimized for ASCII output after it turned out that's what the question really wanted.
Using How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?, basically a 256-bit version the SSSE3 code.
#include <limits.h>
#include <stdint.h>
#include <stdio.h>
#include <immintrin.h>
#include <string.h>
// https://stackoverflow.com/questions/21622212/how-to-perform-the-inverse-of-mm256-movemask-epi8-vpmovmskb
void binary_dump_4B_avx2(const void *input)
{
char buf[CHAR_BIT*4 + 2*4 + 3 + 1 + 1]; // bits, 4x [], 3x \t, \n, 0
buf[0] = '[';
for (int i=9 ; i<sizeof(buf) - 8; i+=11){ // GCC strangely doesn't unroll this loop
memcpy(&buf[i], "]\t[", 4); // 4-byte store as a single; we overlap the 0 later
}
__m256i v = _mm256_castps_si256(_mm256_broadcast_ss(input)); // aliasing-safe load; use _mm256_set1_epi32 if you know you have an int
const __m256i shuffle = _mm256_setr_epi64x(0x0000000000000000, // low byte first, bytes in little-endian memory order
0x0101010101010101, 0x0202020202020202, 0x0303030303030303);
v = _mm256_shuffle_epi8(v, shuffle);
// __m256i bit_mask = _mm256_set1_epi64x(0x8040201008040201); // low bits to low bytes
__m256i bit_mask = _mm256_set1_epi64x(0x0102040810204080); // MSB to lowest byte; printing order
v = _mm256_and_si256(v, bit_mask); // x & mask == mask
// v = _mm256_cmpeq_epi8(v, _mm256_setzero_si256()); // -1 / 0 bytes
// v = _mm256_add_epi8(v, _mm256_set1_epi8('1')); // '0' / '1' bytes
v = _mm256_cmpeq_epi8(v, bit_mask); // 0 / -1 bytes
v = _mm256_sub_epi8(_mm256_set1_epi8('0'), v); // '0' / '1' bytes
__m128i lo = _mm256_castsi256_si128(v);
_mm_storeu_si64(buf+1, lo);
_mm_storeh_pi((__m64*)&buf[1+8+3], _mm_castsi128_ps(lo));
// TODO?: shuffle first and last bytes into the high lane initially to allow 16-byte vextracti128 stores, with later stores overlapping to replace garbage.
__m128i hi = _mm256_extracti128_si256(v, 1);
_mm_storeu_si64(buf+1+11*2, hi);
_mm_storeh_pi((__m64*)&buf[1+11*3], _mm_castsi128_ps(hi));
// buf[32 + 2*4 + 3] = '\n';
// buf[32 + 2*4 + 3 + 1] = '\0';
// fputs
memcpy(&buf[32 + 2*4 + 2], "]", 2); // including '\0'
puts(buf); // appends a newline
// appending our own newline and using fputs or fwrite is probably more efficient.
}
void binary_dump(const void *input, size_t bytecount) {
}
// not shown: portable version, see Godbolt, or my or #chux's answer on the codereview question
int main(void)
{
int t = 1000000;
binary_dump_4B_avx2(&t);
binary_dump(&t, sizeof(t));
t++;
binary_dump_4B_avx2(&t);
binary_dump(&t, sizeof(t));
}
Runnable Godbolt demo with gcc -O3 -march=haswell.
Note that GCC10.3 and earlier are dumb and duplicate the AND/CMPEQ vector constant, once as bytes and once as qwords. (In that case, comparing against zero would be better, or using OR with an inverted mask and comparing against all-ones). GCC11.1 fixes that with a .set .LC1,.LC2, but still loads it twice, as memory operands instead of loading once into a register. Clang doesn't have either of these problems.
Fun fact: clang -march=icelake-client manages to turn the 2nd part of this into an AVX-512 masked blend between '0' and '1' vectors, but instead of just kmov it uses a broadcast-load, vpermb byte shuffle, then test-into-mask with the bitmask.
For each bit in the mask, you want to move a bit at position n to the low-order bit of the byte at position n, i.e. bit position 8 * n. You can do this with a loop:
__uint128_t intrinsic_bits_to_bytes(uint16_t mask)
{
int i;
__uint128_t result = 0;
for (i=0; i<16; i++) {
result |= (__uint128_t )((mask >> i) & 1) << (8 * i);
}
return result;
}
If you can use AVX512, you can do it in one instruction, no loop:
#include <immintrin.h>
__m128i intrinsic_bits_to_bytes(uint16_t mask16) {
const __m128i zeroes = _mm_setzero_si128();
const __m128i ones = _mm_set1_epi8(1);;
return _mm_mask_blend_epi8(mask16, ones, zeroes);
}
For building with gcc, I use:
g++ -std=c++11 -march=native -O3 src.cpp -pthread
This will build OK, but if your processor doesn't support AVX512, it will throw an illegal instruction at run
time.

Producing good add with carry code from clang

I'm trying to produce code (currently using clang++-3.8) that adds two numbers consisting of multiple machine words. To simplify things for the moment I'm only adding 128bit numbers, but I'd like to be able to generalise this.
First some typedefs:
typedef unsigned long long unsigned_word;
typedef __uint128_t unsigned_128;
And a "result" type:
struct Result
{
unsigned_word lo;
unsigned_word hi;
};
The first function, f, takes two pairs of unsigned words and returns a result, by as an intermediate step putting both of these 64 bit words into a 128 bit word before adding them, like so:
Result f (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
Result x;
unsigned_128 n1 = lo1 + (static_cast<unsigned_128>(hi1) << 64);
unsigned_128 n2 = lo2 + (static_cast<unsigned_128>(hi2) << 64);
unsigned_128 r1 = n1 + n2;
x.lo = r1 & ((static_cast<unsigned_128>(1) << 64) - 1);
x.hi = r1 >> 64;
return x;
}
This actually gets inlined quite nicely like so:
movq 8(%rsp), %rsi
movq (%rsp), %rbx
addq 24(%rsp), %rsi
adcq 16(%rsp), %rbx
Now, instead I've written a simpler function using the clang multi-precision primatives, as below:
static Result g (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
Result x;
unsigned_word carryout;
x.lo = __builtin_addcll(lo1, lo2, 0, &carryout);
x.hi = __builtin_addcll(hi1, hi2, carryout, &x.carry);
return x;
}
This produces the following assembly:
movq 24(%rsp), %rsi
movq (%rsp), %rbx
addq 16(%rsp), %rbx
addq 8(%rsp), %rsi
adcq $0, %rbx
In this case, there's an extra add. Instead of doing an ordinary add on the lo-words, then an adc on the hi-words, it just adds the hi-words, then adds the lo-words, then does an adc on the hi-word again with an argument of zero.
This may not look too bad, but when you try this with larger words (say 192bit, 256bit) you soon get a mess of ors and other instructions dealing with the carries up the chain, instead of a simple chain of add, adc, adc, ... adc.
The multi-precision primitives seem to be doing a terrible job at exactly what they're intended to do.
So what I'm looking for is code that I could generalise to any length (no need to do it, just enough so I can work out how to), which clang produces additions in an manner with is as efficient as what it does with it's built in 128 bit type (which unfortunately I can't easily generalise). I presume this should just a chain of adcs, but I'm welcome to arguments and code that it should be something else.
There is an intrinsic to do this: _addcarry_u64. However, only Visual Studio and ICC (at least VS 2013 and 2015 and ICC 13 and ICC 15) do this efficiently. Clang 3.7 and GCC 5.2 still don't produce efficient code with this intrinsic.
Clang in addition has a built-in which one would think does this, __builtin_addcll, but it does not produce efficient code either.
The reason Visual Studio does this is that it does not allow inline assembly in 64-bit mode so the compiler should provide a way to do this with an intrinsic (though Microsoft took their time implementing this).
Therefore, with Visual Studio use _addcarry_u64. With ICC use _addcarry_u64 or inline assembly. With Clang and GCC use inline assembly.
Note that since the Broadwell microarchitecture there are two new instructions: adcx and adox which you can access with the _addcarryx_u64 intrinsic . Intel's documentation for these intrinsics used to be different then the assembly produced by the compiler but it appears their documentation is correct now. However, Visual Studio still only appears to produce adcx with _addcarryx_u64 whereas ICC produces both adcx and adox with this intrinsic. But even though ICC produces both instructions it does not produce the most optimal code (ICC 15) and so inline assembly is still necessary.
Personally, I think the fact that a non-standard feature of C/C++, such as inline assembly or intrinsics, is required to do this is a weakness of C/C++ but others might disagree. The adc instruction has been in the x86 instruction set since 1979. I would not hold my breath on C/C++ compilers being able to optimally figure out when you want adc. Sure they can have built-in types such as __int128 but the moment you want a larger type that's not built-in you have to use some non-standard C/C++ feature such as inline assembly or intrinsics.
In terms of inline assembly code to do this I already posted a solution for 256-bit addition for eight 64-bit integers in register at multi-word addition using the carry flag.
Here is that code reposted.
#define ADD256(X1, X2, X3, X4, Y1, Y2, Y3, Y4) \
__asm__ __volatile__ ( \
"addq %[v1], %[u1] \n" \
"adcq %[v2], %[u2] \n" \
"adcq %[v3], %[u3] \n" \
"adcq %[v4], %[u4] \n" \
: [u1] "+&r" (X1), [u2] "+&r" (X2), [u3] "+&r" (X3), [u4] "+&r" (X4) \
: [v1] "r" (Y1), [v2] "r" (Y2), [v3] "r" (Y3), [v4] "r" (Y4))
If you want to explicitly load the values from memory you can do something like this
//uint64_t dst[4] = {1,1,1,1};
//uint64_t src[4] = {1,2,3,4};
asm (
"movq (%[in]), %%rax\n"
"addq %%rax, %[out]\n"
"movq 8(%[in]), %%rax\n"
"adcq %%rax, 8%[out]\n"
"movq 16(%[in]), %%rax\n"
"adcq %%rax, 16%[out]\n"
"movq 24(%[in]), %%rax\n"
"adcq %%rax, 24%[out]\n"
: [out] "=m" (dst)
: [in]"r" (src)
: "%rax"
);
That produces nearlly identical assembly as from the following function in ICC
void add256(uint256 *x, uint256 *y) {
unsigned char c = 0;
c = _addcarry_u64(c, x->x1, y->x1, &x->x1);
c = _addcarry_u64(c, x->x2, y->x2, &x->x2);
c = _addcarry_u64(c, x->x3, y->x3, &x->x3);
_addcarry_u64(c, x->x4, y->x4, &x->x4);
}
I have limited experience with GCC inline assembly (or inline assembly in general - I usually use an assembler such as NASM) so maybe there are better inline assembly solutions.
So what I'm looking for is code that I could generalize to any length
To answer this question here is another solution using template meta programming. I used this same trick for loop unrolling. This produces optimal code with ICC. If Clang or GCC ever implement _addcarry_u64 efficiently this would be a good general solution.
#include <x86intrin.h>
#include <inttypes.h>
#define LEN 4 // N = N*64-bit add e.g. 4=256-bit add, 3=192-bit add, ...
static unsigned char c = 0;
template<int START, int N>
struct Repeat {
static void add (uint64_t *x, uint64_t *y) {
c = _addcarry_u64(c, x[START], y[START], &x[START]);
Repeat<START+1, N>::add(x,y);
}
};
template<int N>
struct Repeat<LEN, N> {
static void add (uint64_t *x, uint64_t *y) {}
};
void sum_unroll(uint64_t *x, uint64_t *y) {
Repeat<0,LEN>::add(x,y);
}
Assembly from ICC
xorl %r10d, %r10d #12.13
movzbl c(%rip), %eax #12.13
cmpl %eax, %r10d #12.13
movq (%rsi), %rdx #12.13
adcq %rdx, (%rdi) #12.13
movq 8(%rsi), %rcx #12.13
adcq %rcx, 8(%rdi) #12.13
movq 16(%rsi), %r8 #12.13
adcq %r8, 16(%rdi) #12.13
movq 24(%rsi), %r9 #12.13
adcq %r9, 24(%rdi) #12.13
setb %r10b
Meta programming is a basic feature of assemblers so it's too bad C and C++ (except through template meta programming hacks) have no solution for this either (the D language does).
The inline assembly I used above which referenced memory was causing some problems in a function. Here is a new version which seems to work better
void foo(uint64_t *dst, uint64_t *src)
{
__asm (
"movq (%[in]), %%rax\n"
"addq %%rax, (%[out])\n"
"movq 8(%[in]), %%rax\n"
"adcq %%rax, 8(%[out])\n"
"movq 16(%[in]), %%rax\n"
"addq %%rax, 16(%[out])\n"
"movq 24(%[in]), %%rax\n"
"adcq %%rax, 24(%[out])\n"
:
: [in] "r" (src), [out] "r" (dst)
: "%rax"
);
}
On Clang 6, both __builtin_addcl and __builtin_add_overflow produce the same, optimal disassembly.
Result g(unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
Result x;
unsigned_word carryout;
x.lo = __builtin_addcll(lo1, lo2, 0, &carryout);
x.hi = __builtin_addcll(hi1, hi2, carryout, &carryout);
return x;
}
Result h(unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
Result x;
unsigned_word carryout;
carryout = __builtin_add_overflow(lo1, lo2, &x.lo);
carryout = __builtin_add_overflow(hi1, carryout, &hi1);
__builtin_add_overflow(hi1, hi2, &x.hi);
return x;
}
Assembly for both:
add rdi, rdx
adc rsi, rcx
mov rax, rdi
mov rdx, rsi
ret
Starting with clang 5.0 it is possible to get good results using __uint128_t-addition and getting the carry bit by shifting:
inline uint64_t add_with_carry(uint64_t &a, const uint64_t &b, const uint64_t &c)
{
__uint128_t s = __uint128_t(a) + b + c;
a = s;
return s >> 64;
}
In many situations clang still does strange operations (I assume because of possible aliasing?), but usually copying one variable into a temporary helps.
Usage examples with
template<int size> struct LongInt
{
uint64_t data[size];
};
Manual usage:
void test(LongInt<3> &a, const LongInt<3> &b_)
{
const LongInt<3> b = b_; // need to copy b_ into local temporary
uint64_t c0 = add_with_carry(a.data[0], b.data[0], 0);
uint64_t c1 = add_with_carry(a.data[1], b.data[1], c0);
uint64_t c2 = add_with_carry(a.data[2], b.data[2], c1);
}
Generic solution:
template<int size>
void addTo(LongInt<size> &a, const LongInt<size> b)
{
__uint128_t c = __uint128_t(a.data[0]) + b.data[0];
for(int i=1; i<size; ++i)
{
c = __uint128_t(a.data[i]) + b.data[i] + (c >> 64);
a.data[i] = c;
}
}
Godbolt Link: All examples above are compiled to only mov, add and adc instructions (starting with clang 5.0, and at least -O2).
The examples don't produce good code with gcc (up to 8.1, which at the moment is the highest version on godbolt).
And I did not yet manage to get anything usable with __builtin_addcll ...
The code using __builtin_addcll is fully optimized by Clang since at version 10, for chains of at least 3 (which require an adc with variable carry-in that also produces a carry-out). Godbolt shows clang 9 making a mess of setc/movzx for that case.
Clang 6 and later handle it well for the much easier case of chains of 2, as shown in #zneak's answer, where no carry-out from an adc is needed.
The idiomatic code without builtins is good too. Moreover, it works in every compiler and is also fully optimized by GCC 5+ for chains of 2 (add/adc, without using the carry-out from the adc). It's tricky to write correct C that generates carry-out when there's carry-in, so this doesn't extend easily.
Result h (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)
{
unsigned_word lo = lo1 + lo2;
bool carry = lo < lo1;
unsigned_word hi = hi1 + hi2 + carry;
return Result{lo, hi};
}
https://godbolt.org/z/ThxGj1WGK

Moving a single float to a xmm register

I want to multiply the data stored in one xmm register with a single float value and save the result in a xmm register.
I made a little graphic to explain it a bit better.
As you see I got a xmm0 register with my data in it. For example it contains:
xmm0 = |4.0|2.5|3.5|2.0|
Each floating point is stored in 4 bytes. My xmm0 register is 128 bits, 16 bytes long.
That works pretty good. Now I want to store 0.5 in another xmm register, e.g. xmm1, and multiply this register with the xmm0 register so that each value stored in xmm0 is multiplied with 0.5.
I have absolutely no idea how to store 0.5 in an XMM register.
Any suggestions?
Btw: It's Inline Assembler in C++.
void filter(image* src_image, image* dst_image)
{
float* src = src_image->data;
float* dst = dst_image->data;
__asm__ __volatile__ (
"movaps (%%esi), %%xmm0\n"
// Multiply %xmm0 with a float, e.g. 0.5
"movaps %%xmm0, (%%edi)\n"
:
: "S"(src), "D"(dst) :
);
}
This is the quiet simple version of the thing i want to do. I got some image data stored in a float array. The pointer to these arrays are passed to assembly. movaps takes the first 4 float values of the array, stores these 16 bytes in the xmm0 register. After this xmm0 should be multiplied with e.g. 0.5. Than the "new" values shall be stored in the array from edi.
As people noted in comments, for this sort of very simple operation, it's essentially always better to use intrinsics:
void filter(image* src_image, image* dst_image)
{
const __m128 data = _mm_load_ps(src_image->data);
const __m128 scaled = _mm_mul_ps(data, _mm_set1_ps(0.5f));
_mm_store_ps(dst_image->data, scaled);
}
You should only resort to an inline ASM if the compiler is generating bad code (and only after filing a bug with the compiler vendor).
If you really want to stay in assembly, there are many ways to accomplish this task. You could define a scale vector outside of the ASM block:
const __m128 half = _mm_set1_ps(0.5f);
and then use it inside the ASM just like you use other operands.
You can do it without any loads, if you really want to:
"mov $0x3f000000, %%eax\n" // encoding of 0.5
"movd %%eax, %%xmm1\n" // move to xmm1
"shufps $0, %%xmm1, %%xmm1\n" // splat across all lanes of xmm1
Those are just two approaches. There are lots of other ways. You might spend some quality time with the Intel Instruction Set Reference.
Assuming you're using intrinsics: __m128 halfx4 = _mm_set1_ps(0.5f);
Edit:
You're much better off using intrinsics:
__m128 x = _mm_mul_ps(_mm_load_ps(src), halfx4);
_mm_store_ps(dst, x);
If the src and dst float data is not 16-byte aligned, you need: _mm_loadu_ps and _mm_storeu_ps - which are slower.
You are looking for the MOVSS instruction (which loads a single precision float from memory into the lowest 4 bytes of an SSE register), followed by a shuffle to fill the other 3 floats with this value:
movss (whatever), %%xmm1
shufps %%xmm1, %%xmm1, $0
That's also how the _mm_set1_ps intrinsic might probably do it. Then you can just multiply these SSE values or do whatever you want:
mulps %%xmm1, %%xmm0
If you are using c++ with gcc and have EasySSE your code can be as follows
void filter(float* src_image, float* dst_image){
*(PackedFloat128*)dst_image = Packefloat128(0.5) * (src_image+0);
}
This is assuming the given pointers are 16byte aligned.
You can check the assy code to verify the variables are properly mapped to vector registers.
Here's one way to do it:
#include <stdio.h>
#include <stdlib.h>
typedef struct img {
float *data;
} image_t;
image_t *src_image;
image_t *dst_image;
void filter(image_t*, image_t*);
int main()
{
image_t src, dst;
src.data = malloc(64);
dst.data = malloc(64);
src_image=&src;
dst_image=&dst;
*src.data = 42.0;
filter(src_image, dst_image);
printf("%f\n", *dst.data);
free(src.data);
free(dst.data);
return 0;
}
void filter(image_t* src_image, image_t* dst_image)
{
float* src = src_image->data;
float* dst = dst_image->data;
__asm__ __volatile__ (
"movd %%esi, %%xmm0;"
"movd %%xmm0, %%edi;"
: "=D" (*dst)
: "S" (*src)
);
}