With AVX512, there is the intrinsic _mm256_lzcnt_epi32, which returns a vector that, for each of the 8 32-bit elements, contains the number of leading zero bits in the input vector's element.
Is there an efficient way to implement this using AVX and AVX2 instructions only?
Currently I'm using a loop which extracts each element and applies the _lzcnt_u32 function.
Related: to bit-scan one large bitmap, see Count leading zeros in __m256i word which uses pmovmskb -> bitscan to find which byte to do a scalar bitscan on.
This question is about doing 8 separate lzcnts on 8 separate 32-bit elements when you're actually going to use all 8 results, not just select one.
float represents numbers in an exponential format, so int->FP conversion gives us the position of the highest set bit encoded in the exponent field.
We want int->float with magnitude rounded down (truncate the value towards 0), not the default rounding of nearest. That could round up and make 0x3FFFFFFF look like 0x40000000. If you're doing a lot of these conversions without doing any FP math, you could set the rounding mode in the MXCSR1 to truncation then set it back when you're done.
Otherwise you can use v & ~(v>>8) to keep the 8 most-significant bits and zero some or all lower bits, including a potentially-set bit 8 below the MSB. That's enough to ensure all rounding modes never round up to the next power of two. It always keeps the 8 MSB because v>>8 shifts in 8 zeros, so inverted that's 8 ones. At lower bit positions, wherever the MSB is, 8 zeros are shifted past there from higher positions, so it will never clear the most significant bit of any integer. Depending on how set bits below the MSB line up, it might or might not clear more below the 8 most significant.
After conversion, we use an integer shift on the bit-pattern to bring the exponent (and sign bit) to the bottom and undo the bias with a saturating subtract. We use min to set the result to 32 if no bits were set in the original 32-bit input.
__m256i avx2_lzcnt_epi32 (__m256i v) {
// prevent value from being rounded up to the next power of two
v = _mm256_andnot_si256(_mm256_srli_epi32(v, 8), v); // keep 8 MSB
v = _mm256_castps_si256(_mm256_cvtepi32_ps(v)); // convert an integer to float
v = _mm256_srli_epi32(v, 23); // shift down the exponent
v = _mm256_subs_epu16(_mm256_set1_epi32(158), v); // undo bias
v = _mm256_min_epi16(v, _mm256_set1_epi32(32)); // clamp at 32
return v;
}
Footnote 1: fp->int conversion is available with truncation (cvtt), but int->fp conversion is only available with default rounding (subject to MXCSR).
AVX512F introduces rounding-mode overrides for 512-bit vectors which would solve the problem, __m512 _mm512_cvt_roundepi32_ps( __m512i a, int r);. But all CPUs with AVX512F also support AVX512CD so you could just use _mm512_lzcnt_epi32. And with AVX512VL, _mm256_lzcnt_epi32
#aqrit's answer looks like a more-clever use of FP bithacks. My answer below is based on the first place I looked for a bithack which was old and aimed at scalar so it didn't try to avoid double (which is wider than int32 and thus a problem for SIMD).
It uses HW signed int->float conversion and saturating integer subtracts to handle the MSB being set (negative float), instead of stuffing bits into a mantissa for manual uint->double. If you can set MXCSR to round down across a lot of these _mm256_lzcnt_epi32, that's even more efficient.
https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogIEEE64Float suggests stuffing integers into the mantissa of a large double, then subtracting to get the FPU hardware to get a normalized double. (I think this bit of magic is doing uint32_t -> double, with the technique #Mysticial explains in How to efficiently perform double/int64 conversions with SSE/AVX? (which works for uint64_t up to 252-1)
Then grab the exponent bits of the double and undo the bias.
I think integer log2 is the same thing as lzcnt, but there might be an off-by-1 at powers of 2.
The Standford Graphics bithack page lists other branchless bithacks you could use that would probably still be better than 8x scalar lzcnt.
If you knew your numbers were always small-ish (like less than 2^23) you could maybe do this with float and avoid splitting and blending.
int v; // 32-bit integer to find the log base 2 of
int r; // result of log_2(v) goes here
union { unsigned int u[2]; double d; } t; // temp
t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] = 0x43300000;
t.u[__FLOAT_WORD_ORDER!=LITTLE_ENDIAN] = v;
t.d -= 4503599627370496.0;
r = (t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] >> 20) - 0x3FF;
The code above loads a 64-bit (IEEE-754 floating-point) double with a 32-bit integer (with no paddding bits) by storing the integer in the mantissa while the exponent is set to 252. From this newly minted double, 252 (expressed as a double) is subtracted, which sets the resulting exponent to the log base 2 of the input value, v. All that is left is shifting the exponent bits into position (20 bits right) and subtracting the bias, 0x3FF (which is 1023 decimal).
To do this with AVX2, blend and shift+blend odd/even halves with set1_epi32(0x43300000) and _mm256_castps_pd to get a __m256d. And after subtracting, _mm256_castpd_si256 and shift / blend the low/high halves into place then mask to get the exponents.
Doing integer operations on FP bit-patterns is very efficient with AVX2, just 1 cycle of extra latency for a bypass delay when doing integer shifts on the output of an FP math instruction.
(TODO: write it with C++ intrinsics, edit welcome or someone else could just post it as an answer.)
I'm not sure if you can do anything with int -> double conversion and then reading the exponent field. Negative numbers have no leading zeros and positive numbers give an exponent that depends on the magnitude.
If you did want that, you'd go one 128-bit lane at a time, shuffling to feed xmm -> ymm packed int32_t -> packed double conversion.
The question is also tagged AVX, but there are no instructions for integer processing in AVX, which means one needs to fall back to SSE on platforms that support AVX but not AVX2. I am showing an exhaustively tested, but a bit pedestrian version below. The basic idea here is as in the other answers, in that the count of leading zeros is determined by the floating-point normalization that occurs during integer to floating-point conversion. The exponent of the result has a one-to-one correspondence with the count of leading zeros, except that the result is wrong in the case of an argument of zero. Conceptually:
clz (a) = (158 - (float_as_uint32 (uint32_to_float_rz (a)) >> 23)) + (a == 0)
where float_as_uint32() is a re-interpreting cast and uint32_to_float_rz() is a conversion from unsigned integer to floating-point with truncation. A normal, rounding, conversion could bump up the conversion result to the next power of two, resulting in an incorrect count of leading zero bits.
SSE does not provide truncating integer to floating-point conversion as a single instruction, nor conversions from unsigned integers. This functionality needs to be emulated. The emulation does not need to be exact, as long as it does not change the magnitude of the conversion result. The truncation part is handled by the invert - right shift - andn technique from aqrit's answer. To use signed conversion, we cut the number in half before the conversion, then double and increment after the conversion:
float approximate_uint32_to_float_rz (uint32_t a)
{
float r = (float)(int)((a >> 1) & ~(a >> 2));
return r + r + 1.0f;
}
This approach is translated into SSE intrinsics in sse_clz() below.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include "immintrin.h"
/* compute count of leading zero bits using floating-point normalization.
clz(a) = (158 - (float_as_uint32 (uint32_to_float_rz (a)) >> 23)) + (a == 0)
The problematic part here is uint32_to_float_rz(). SSE does not offer
conversion of unsigned integers, and no rounding modes in integer to
floating-point conversion. Since all we need is an approximate version
that preserves order of magnitude:
float approximate_uint32_to_float_rz (uint32_t a)
{
float r = (float)(int)((a >> 1) & ~(a >> 2));
return r + r + 1.0f;
}
*/
__m128i sse_clz (__m128i a)
{
__m128 fp1 = _mm_set_ps1 (1.0f);
__m128i zero = _mm_set1_epi32 (0);
__m128i i158 = _mm_set1_epi32 (158);
__m128i iszero = _mm_cmpeq_epi32 (a, zero);
__m128i lsr1 = _mm_srli_epi32 (a, 1);
__m128i lsr2 = _mm_srli_epi32 (a, 2);
__m128i atrunc = _mm_andnot_si128 (lsr2, lsr1);
__m128 atruncf = _mm_cvtepi32_ps (atrunc);
__m128 atruncf2 = _mm_add_ps (atruncf, atruncf);
__m128 conv = _mm_add_ps (atruncf2, fp1);
__m128i convi = _mm_castps_si128 (conv);
__m128i lsr23 = _mm_srli_epi32 (convi, 23);
__m128i res = _mm_sub_epi32 (i158, lsr23);
return _mm_sub_epi32 (res, iszero);
}
/* Portable reference implementation of 32-bit count of leading zeros */
int clz32 (uint32_t a)
{
uint32_t r = 32;
if (a >= 0x00010000) { a >>= 16; r -= 16; }
if (a >= 0x00000100) { a >>= 8; r -= 8; }
if (a >= 0x00000010) { a >>= 4; r -= 4; }
if (a >= 0x00000004) { a >>= 2; r -= 2; }
r -= a - (a & (a >> 1));
return r;
}
/* Test floating-point based count leading zeros exhaustively */
int main (void)
{
__m128i res;
uint32_t resi[4], refi[4];
uint32_t count = 0;
do {
refi[0] = clz32 (count);
refi[1] = clz32 (count + 1);
refi[2] = clz32 (count + 2);
refi[3] = clz32 (count + 3);
res = sse_clz (_mm_set_epi32 (count + 3, count + 2, count + 1, count));
memcpy (resi, &res, sizeof resi);
if ((resi[0] != refi[0]) || (resi[1] != refi[1]) ||
(resi[2] != refi[2]) || (resi[3] != refi[3])) {
printf ("error # %08x %08x %08x %08x\n",
count, count+1, count+2, count+3);
return EXIT_FAILURE;
}
count += 4;
} while (count);
return EXIT_SUCCESS;
}
I'm trying to multiply two m128i byte per byte (8 bit signed integers).
The problem here is overflow. My solution is to store these 8 bit signed integers into 16 bit signed integers, multiply, then pack the whole thing into a m128i of 16 x 8 bit integers.
Here is the __m128i mulhi_epi8(__m128i a, __m128i b) emulation I made:
inline __m128i mulhi_epi8(__m128i a, __m128i b)
{
auto a_decomposed = decompose_epi8(a);
auto b_decomposed = decompose_epi8(b);
__m128i r1 = _mm_mullo_epi16(a_decomposed.first, b_decomposed.first);
__m128i r2 = _mm_mullo_epi16(a_decomposed.second, b_decomposed.second);
return _mm_packs_epi16(_mm_srai_epi16(r1, 8), _mm_srai_epi16(r2, 8));
}
decompose_epi8 is implemented in a non-simd way:
inline std::pair<__m128i, __m128i> decompose_epi8(__m128i input)
{
std::pair<__m128i, __m128i> result;
// result.first => should contain 8 shorts in [-128, 127] (8 first bytes of the input)
// result.second => should contain 8 shorts in [-128, 127] (8 last bytes of the input)
for (int i = 0; i < 8; ++i)
{
result.first.m128i_i16[i] = input.m128i_i8[i];
result.second.m128i_i16[i] = input.m128i_i8[i + 8];
}
return result;
}
This code works well. My goal now is to implement a simd version of this for loop. I looked at the Intel Intrinsics Guide but I can't find a way to do this. I guess shuffle could do the trick but I have trouble conceptualising this.
As you want to do signed multiplication, you need to sign-extend each byte to 16bit words, or move them into the upper half of each 16bit word. Since you pack the results back together afterwards, you can split the input into odd and even bytes, instead of the higher and lower half. Then sign-extension of the odd bytes can be done by arithmetically shifting all 16bit parts to the right You can extract the odd bytes by masking out the even bytes, and to get the even bytes, you can shift all 16bit parts to the left (both need to be multiplied by _mm_mulhi_epi16).
The following should work with SSE2:
__m128i mulhi_epi8(__m128i a, __m128i b)
{
__m128i mask = _mm_set1_epi16(0xff00);
// mask higher bytes:
__m128i a_hi = _mm_and_si128(a, mask);
__m128i b_hi = _mm_and_si128(b, mask);
__m128i r_hi = _mm_mulhi_epi16(a_hi, b_hi);
// mask out garbage in lower half:
r_hi = _mm_and_si128(r_hi, mask);
// shift lower bytes to upper half
__m128i a_lo = _mm_slli_epi16(a,8);
__m128i b_lo = _mm_slli_epi16(b,8);
__m128i r_lo = _mm_mulhi_epi16(a_lo, b_lo);
// shift result to the lower half:
r_lo = _mm_srli_epi16(r_lo,8);
// join result and return:
return _mm_or_si128(r_hi, r_lo);
}
Note: a previous version used shifts to sign-extend the odd bytes. On most Intel CPUs this would increase P0 usage (which needs to be used for multiplication as well). Bit-logic can operate on more ports, so this version should have better throughput.
I need to perform the following operation:
w[i] = scale * v[i] + point
scale and point are fixed, whereas v[] is a vector of 4-bit integers.
I need to compute w[] for the arbitrary input vector v[] and I want to speed up the process using AVX intrinsics. However, v[i] is a vector of 4-bit integers.
The question is how to perform operations on 4-bit integers using intrinsics? I could use 8-bit integers and perform operations that way, but is there a way to do the following:
[a,b] + [c,d] = [a+b,c+d]
[a,b] * [c,d] = [a * b,c * d]
(Ignoring overflow)
Using AVX intrinsics, where [...,...] Is an 8-bit integer and a,b,c,d are 4-bit integers?
If yes, would it be possible to give a short example on how this could work?
Just a partial answer (only addition) and in pseudo code (should be easy to extent to AVX2 intrinsics):
uint8_t a, b; // input containing two nibbles each
uint8_t c = a + b; // add with (unwanted) carry between nibbles
uint8_t x = a ^ b ^ c; // bits which are result of a carry
x &= 0x10; // only bit 4 is of interest
c -= x; // undo carry of lower to upper nibble
If either a or b is known to have bit 4 unset (i.e. the lowest bit of the upper nibble), it can be left out the computation of x.
As for multiplication: If scale is the same for all products, you can likely get away with some shifting and adding/subtracting (masking out overflow bits where necessarry). Otherwise, I'm afraid you need to mask out 4 bits of each 16bit word, do the operation, and fiddle them together at the end. Pseudo code (there is no AVX 8bit multiplication, so we need to operate with 16bit words):
uint16_t m0=0xf, m1=0xf0, m2=0xf00, m3=0xf000; // masks for each nibble
uint16_t a, b; // input containing 4 nibbles each.
uint16_t p0 = (a*b) & m0; // lowest nibble, does not require masking a,b
uint16_t p1 = ((a>>4) * (b&m1)) & m1;
uint16_t p2 = ((a>>8) * (b&m2)) & m2;
uint16_t p3 = ((a>>12)* (b&m3)) & m3;
uint16_t result = p0 | p1 | p2 | p3; // join results together
For fixed a, b in w[i]=v[i] * a + b, you can simply use a lookup table w_0_3 = _mm_shuffle_epi8(LUT_03, input) for the LSB. Split the input to even and odd nibbles, with the odd LUT preshifted by 4.
auto a = input & 15; // per element
auto b = (input >> 4) & 15; // shift as 16 bits
return LUTA[a] | LUTB[b];
How to generate those LUTs dynamically, is another issue, if at all.
4-bit aditions/multiplication can be done using AVX2, particularly if you want to apply those computations on larger vectors (say more than 128 elements). However, if you want to add just 4 numbers use straight scalar code.
We have done an extensive work on how to deal with 4-bit integers, and we have recently developed a library to do it Clover: 4-bit Quantized Linear Algebra Library (with focus on quantization). The code is also available at GitHub.
As you mentioned only 4-bit integers, I would assume that you are referring to signed integers (i.e. two's complements), and base my answer accordingly. Note that handling unsigned is in fact much simpler.
I would also assume that you would like to take vector int8_t v[n/2] that contains n 4-bit integers, and produce int8_t v_sum[n/4] having n/2 4-bit integers. All the code relative to the description bellow is available as a gist.
Packing / Unpacking
Obviously AVX2 does not offer any instructions to perform additions / multiplication on 4-bit integers, therefore, you must resort to the given 8- or 16-bit instruction. The first step in dealing with 4-bit arithmetics is to devise methods on how to place the 4-bit nibble into larger chunks of 8-, 16-, or 32-bit chunk.
For a sake of clarity let's assume that you want to unpack a given nibble from a 32-bit chunk that stores multiple 4-bit signed values into a corresponding 32-bit integer (figure below). This can be done with two bit shifts:
a logical left shift is used to shift the nibble so that it occupies the highest-order 4-bits of the 32-bit entity.
an arithmetic right shift is used to shift the nibble to the lowest order 4-bits of the 32-bit entity.
The arithmetic right shift has sign extension, filling the high-order 28 bits with the sign bit of the nibble. yielding a 32-bit integer with the same value as the two’s complement 4-bit value.
The goal of packing (left part of figure above) is to revert the unpacking operation. Two bit shifts can be used to place the lowest order 4 bits of a 32-bit integer anywhere within a 32-bit entity.
a logical left shift is used to shift the nibble so that it occupies the highest-order 4-bits of the 32-bit entity.
a logical right shift is used to shift the nibble to somewhere within the 32-bit entity.
The first sets the bits lower-ordered than the nibble to zero, and the second sets the bits higher-ordered than the nibble to zero. A bitwise OR operation can then be used to store up to eight nibbles in the 32-bit entity.
How to apply this in practice?
Let's assume that you have 64 x 32-bit integer values stored in 8 AVX registers __m256i q_1, q_2, q_3, q_4, q_5, q_6, q_7, q_8. Let's also assume that each value is in the [-8, 7], range. If you want to pack them into a single AVX register of 64 x 4-bit values, you can do as follows:
//
// Transpose the 8x8 registers
//
_mm256_transpose8_epi32(q_1, q_2, q_3, q_4, q_5, q_6, q_7, q_8);
//
// Shift values left
//
q_1 = _mm256_slli_epi32(q_1, 28);
q_2 = _mm256_slli_epi32(q_2, 28);
q_3 = _mm256_slli_epi32(q_3, 28);
q_4 = _mm256_slli_epi32(q_4, 28);
q_5 = _mm256_slli_epi32(q_5, 28);
q_6 = _mm256_slli_epi32(q_6, 28);
q_7 = _mm256_slli_epi32(q_7, 28);
q_8 = _mm256_slli_epi32(q_8, 28);
//
// Shift values right (zero-extend)
//
q_1 = _mm256_srli_epi32(q_1, 7 * 4);
q_2 = _mm256_srli_epi32(q_2, 6 * 4);
q_3 = _mm256_srli_epi32(q_3, 5 * 4);
q_4 = _mm256_srli_epi32(q_4, 4 * 4);
q_5 = _mm256_srli_epi32(q_5, 3 * 4);
q_6 = _mm256_srli_epi32(q_6, 2 * 4);
q_7 = _mm256_srli_epi32(q_7, 1 * 4);
q_8 = _mm256_srli_epi32(q_8, 0 * 4);
//
// Pack together
//
__m256i t1 = _mm256_or_si256(q_1, q_2);
__m256i t2 = _mm256_or_si256(q_3, q_4);
__m256i t3 = _mm256_or_si256(q_5, q_6);
__m256i t4 = _mm256_or_si256(q_7, q_8);
__m256i t5 = _mm256_or_si256(t1, t2);
__m256i t6 = _mm256_or_si256(t3, t4);
__m256i t7 = _mm256_or_si256(t5, t6);
Shifts usually take 1 cycle of throughput, and 1 cycle of latency, thus you can assume that are in fact quite inexpensive. If you have to deal with unsigned 4-bit values, the left shifts can be skipped all together.
To reverse the procedure, you can apply the same method. Let's assume that you have loaded 64 4-bit values into a single AVX register __m256i qu_64. In order to produce 64 x 32-bit integers __m256i q_1, q_2, q_3, q_4, q_5, q_6, q_7, q_8, you can execute the following:
//
// Shift values left
//
const __m256i qu_1 = _mm256_slli_epi32(qu_64, 4 * 7);
const __m256i qu_2 = _mm256_slli_epi32(qu_64, 4 * 6);
const __m256i qu_3 = _mm256_slli_epi32(qu_64, 4 * 5);
const __m256i qu_4 = _mm256_slli_epi32(qu_64, 4 * 4);
const __m256i qu_5 = _mm256_slli_epi32(qu_64, 4 * 3);
const __m256i qu_6 = _mm256_slli_epi32(qu_64, 4 * 2);
const __m256i qu_7 = _mm256_slli_epi32(qu_64, 4 * 1);
const __m256i qu_8 = _mm256_slli_epi32(qu_64, 4 * 0);
//
// Shift values right (sign-extent) and obtain 8x8
// 32-bit values
//
__m256i q_1 = _mm256_srai_epi32(qu_1, 28);
__m256i q_2 = _mm256_srai_epi32(qu_2, 28);
__m256i q_3 = _mm256_srai_epi32(qu_3, 28);
__m256i q_4 = _mm256_srai_epi32(qu_4, 28);
__m256i q_5 = _mm256_srai_epi32(qu_5, 28);
__m256i q_6 = _mm256_srai_epi32(qu_6, 28);
__m256i q_7 = _mm256_srai_epi32(qu_7, 28);
__m256i q_8 = _mm256_srai_epi32(qu_8, 28);
//
// Transpose the 8x8 values
//
_mm256_transpose8_epi32(q_1, q_2, q_3, q_4, q_5, q_6, q_7, q_8);
If dealing with unsigned 4-bits, the right shifts (_mm256_srai_epi32) can be skipped all-together, and instead of left shifts, we can perform left-logical shifts (_mm256_srli_epi32 ).
To see more details have a look a the gist here.
Adding Odd and Even 4-bit entries
Let's assume that you load from the vector using AVX:
const __m256i qv = _mm256_loadu_si256( ... );
Now, we can easily extract the odd and the even parts. Life would have been much easier if there were 8-bit shifts in AVX2, but there are none, so we have to deal with 16-bit shifts:
const __m256i hi_mask_08 = _mm256_set1_epi8(-16);
const __m256i qv_odd_dirty = _mm256_slli_epi16(qv, 4);
const __m256i qv_odd_shift = _mm256_and_si256(hi_mask_08, qv_odd_dirty);
const __m256i qv_evn_shift = _mm256_and_si256(hi_mask_08, qv);
At this point in time, you have essentially separated the odd and the even nibbles, in two AVX registers that hold their values in the high 4-bits (i.e. values in the range [-8 * 2^4, 7 * 2^4]). The procedure is the same even when dealing with unsigned 4-bit values. Now it is time to add the values.
const __m256i qv_sum_shift = _mm256_add_epi8(qv_odd_shift, qv_evn_shift);
This will work with both signed and unsigned, as binary addition work with two's complements. However, if you want to avoid overflows or underflows you can also consider addition with saturation already supported in AVX (for both signed and unsigned):
__m256i _mm256_adds_epi8 (__m256i a, __m256i b)
__m256i _mm256_adds_epu8 (__m256i a, __m256i b)
qv_sum_shift will be in the range [-8 * 2^4, 7 * 2^4]. To set it to the right value, we need to shift it back (Note that if qv_sum has to be unsigned, we can use _mm256_srli_epi16 instead):
const __m256i qv_sum = _mm256_srai_epi16(qv_sum_shift, 4);
The summation is now complete. Depending on your use case, this could as well be the end of the program, assuming that you want to produce 8-bit chunks of memory as a result. But let's assume that you want to solve a harder problem. Let's assume that the output is again a vector of 4-bit elements, with the same memory layout as the input one. In that case, we need to pack the 8-bit chunks into 4-bit chunks. However, the problem is that instead of having 64 values, we will end up with 32 values (i.e. half the size of the vector).
From this point there are two options. We either look ahead in the vector, processing 128 x 4-bit values, such that we produce 64 x 4-bit values. Or we revert to SSE, dealing with 32 x 4-bit values. Either way, the fastest way to pack the 8-bit chunks into 4-bit chunks would be to use the vpackuswb (or packuswb for SSE) instruction:
__m256i _mm256_packus_epi16 (__m256i a, __m256i b)
This instruction convert packed 16-bit integers from a and b to packed 8-bit integers using unsigned saturation, and store the results in dst. This means that we have to interleave the odd and even 4-bit values, such that they reside in the 8 low-bits of a 16-bit memory chunk. We can proceed as follows:
const __m256i lo_mask_16 = _mm256_set1_epi16(0x0F);
const __m256i hi_mask_16 = _mm256_set1_epi16(0xF0);
const __m256i qv_sum_lo = _mm256_and_si256(lo_mask_16, qv_sum);
const __m256i qv_sum_hi_dirty = _mm256_srli_epi16(qv_sum_shift, 8);
const __m256i qv_sum_hi = _mm256_and_si256(hi_mask_16, qv_sum_hi_dirty);
const __m256i qv_sum_16 = _mm256_or_si256(qv_sum_lo, qv_sum_hi);
The procedure will be identical for both signed and unsigned 4-bit values. Now, qv_sum_16 contains two consecutive 4-bit values, stored in the low-bits of a 16-bit memory chunk. Assuming that we have obtained qv_sum_16 from the next iteration (call it qv_sum_16_next), we can pack everything with:
const __m256i qv_sum_pack = _mm256_packus_epi16(qv_sum_16, qv_sum_16_next);
const __m256i result = _mm256_permute4x64_epi64(qv_sum_pack, 0xD8);
Alternatively, if we want to produce only 32 x 4-bit values, we can do the following:
const __m128i lo = _mm256_extractf128_si256(qv_sum_16, 0);
const __m128i hi = _mm256_extractf128_si256(qv_sum_16, 1);
const __m256i result = _mm_packus_epi16(lo, hi)
Putting it all together
Assuming signed nibbles, and vector size n, such that n is larger than 128 elements and is multiple of 128, we can perform the odd-even addition, producing n/2 elements as follows:
void add_odd_even(uint64_t n, int8_t * v, int8_t * r)
{
//
// Make sure that the vector size that is a multiple of 128
//
assert(n % 128 == 0);
const uint64_t blocks = n / 64;
//
// Define constants that will be used for masking operations
//
const __m256i hi_mask_08 = _mm256_set1_epi8(-16);
const __m256i lo_mask_16 = _mm256_set1_epi16(0x0F);
const __m256i hi_mask_16 = _mm256_set1_epi16(0xF0);
for (uint64_t b = 0; b < blocks; b += 2) {
//
// Calculate the offsets
//
const uint64_t offset0 = b * 32;
const uint64_t offset1 = b * 32 + 32;
const uint64_t offset2 = b * 32 / 2;
//
// Load 128 values in two AVX registers. Each register will
// contain 64 x 4-bit values in the range [-8, 7].
//
const __m256i qv_1 = _mm256_loadu_si256((__m256i *) (v + offset0));
const __m256i qv_2 = _mm256_loadu_si256((__m256i *) (v + offset1));
//
// Extract the odd and the even parts. The values will be split in
// two registers qv_odd_shift and qv_evn_shift, each of them having
// 32 x 8-bit values, such that each value is multiplied by 2^4
// and resides in the range [-8 * 2^4, 7 * 2^4]
//
const __m256i qv_odd_dirty_1 = _mm256_slli_epi16(qv_1, 4);
const __m256i qv_odd_shift_1 = _mm256_and_si256(hi_mask_08, qv_odd_dirty_1);
const __m256i qv_evn_shift_1 = _mm256_and_si256(hi_mask_08, qv_1);
const __m256i qv_odd_dirty_2 = _mm256_slli_epi16(qv_2, 4);
const __m256i qv_odd_shift_2 = _mm256_and_si256(hi_mask_08, qv_odd_dirty_2);
const __m256i qv_evn_shift_2 = _mm256_and_si256(hi_mask_08, qv_2);
//
// Perform addition. In case of overflows / underflows, behaviour
// is undefined. Values are still in the range [-8 * 2^4, 7 * 2^4].
//
const __m256i qv_sum_shift_1 = _mm256_add_epi8(qv_odd_shift_1, qv_evn_shift_1);
const __m256i qv_sum_shift_2 = _mm256_add_epi8(qv_odd_shift_2, qv_evn_shift_2);
//
// Divide by 2^4. At this point in time, each of the two AVX registers holds
// 32 x 8-bit values that are in the range of [-8, 7]. Summation is complete.
//
const __m256i qv_sum_1 = _mm256_srai_epi16(qv_sum_shift_1, 4);
const __m256i qv_sum_2 = _mm256_srai_epi16(qv_sum_shift_2, 4);
//
// Now, we want to take the even numbers of the 32 x 4-bit register, and
// store them in the high-bits of the odd numbers. We do this with
// left shifts that extend in zero, and 16-bit masks. This operation
// results in two registers qv_sum_lo and qv_sum_hi that hold 32
// values. However, each consecutive 4-bit values reside in the
// low-bits of a 16-bit chunk.
//
const __m256i qv_sum_1_lo = _mm256_and_si256(lo_mask_16, qv_sum_1);
const __m256i qv_sum_1_hi_dirty = _mm256_srli_epi16(qv_sum_shift_1, 8);
const __m256i qv_sum_1_hi = _mm256_and_si256(hi_mask_16, qv_sum_1_hi_dirty);
const __m256i qv_sum_2_lo = _mm256_and_si256(lo_mask_16, qv_sum_2);
const __m256i qv_sum_2_hi_dirty = _mm256_srli_epi16(qv_sum_shift_2, 8);
const __m256i qv_sum_2_hi = _mm256_and_si256(hi_mask_16, qv_sum_2_hi_dirty);
const __m256i qv_sum_16_1 = _mm256_or_si256(qv_sum_1_lo, qv_sum_1_hi);
const __m256i qv_sum_16_2 = _mm256_or_si256(qv_sum_2_lo, qv_sum_2_hi);
//
// Pack the two registers of 32 x 4-bit values, into a single one having
// 64 x 4-bit values. Use the unsigned version, to avoid saturation.
//
const __m256i qv_sum_pack = _mm256_packus_epi16(qv_sum_16_1, qv_sum_16_2);
//
// Interleave the 64-bit chunks.
//
const __m256i qv_sum = _mm256_permute4x64_epi64(qv_sum_pack, 0xD8);
//
// Store the result
//
_mm256_storeu_si256((__m256i *)(r + offset2), qv_sum);
}
}
A self-contained tester and validator of this code is available in the gist here.
Multiplying Odd and Even 4-bit entries
For the multiplication of the odd and even entries, we can use the same strategy as described above to extract the 4-bits into larger chunks.
AVX2 does not offer 8-bit multiplication, only 16-bit. However, we can implement 8-bit multiplication following the method implemented in the Agner Fog's C++ vector class library:
static inline Vec32c operator * (Vec32c const & a, Vec32c const & b) {
// There is no 8-bit multiply in SSE2. Split into two 16-bit multiplies
__m256i aodd = _mm256_srli_epi16(a,8); // odd numbered elements of a
__m256i bodd = _mm256_srli_epi16(b,8); // odd numbered elements of b
__m256i muleven = _mm256_mullo_epi16(a,b); // product of even numbered elements
__m256i mulodd = _mm256_mullo_epi16(aodd,bodd); // product of odd numbered elements
mulodd = _mm256_slli_epi16(mulodd,8); // put odd numbered elements back in place
__m256i mask = _mm256_set1_epi32(0x00FF00FF); // mask for even positions
__m256i product = selectb(mask,muleven,mulodd); // interleave even and odd
return product;
}
I would suggest however to extract the nibbles into 16-bit chunks first and then use _mm256_mullo_epi16 to avoid performing unnecessary shifts.
Basically how can I write the equivalent of this with AVX2 intrinsics? We assume here that result_in_float is of type __m256, while result is of type short int* or short int[8].
for(i = 0; i < 8; i++)
result[i] = (short int)result_in_float[i];
I know that floats can be converted to 32 bit integers using the __m256i _mm256_cvtps_epi32(__m256 m1) intrinsic, but have no idea how to convert these 32 bit integers further to 16 bit integers. And I don't want just that but also to store those values (in the form of 16 bit integers) to the memory, and I want to do that all using vector instructions.
Searching around the internet, I found an intrinsic by the name of_mm256_mask_storeu_epi16, but I'm not really sure if that would do the trick, as I couldn't find an example of its usage.
_mm256_cvtps_epi32 is a good first step, the conversion to a packed vector of shorts is a bit annoying, requiring a cross-slice shuffle (so it's good that it's not in a dependency chain here).
Since the values can be assumed to be in the right range (as per the comment), we can use _mm256_packs_epi32 instead of _mm256_shuffle_epi8 to do the conversion, either way it's a 1-cycle instruction on port 5 but using _mm256_packs_epi32 avoids having to get a shuffle mask from somewhere.
So to put it together (not tested)
__m256i tmp = _mm256_cvtps_epi32(result_in_float);
tmp = _mm256_packs_epi32(tmp, _mm256_setzero_si256());
tmp = _mm256_permute4x64_epi64(tmp, 0xD8);
__m128i res = _mm256_castsi256_si128(tmp);
// _mm_store_si128 that
The last step (cast) is free, it just changes the type.
If you had two vectors of floats to convert, you could re-use most of the instructions, eg: (not tested either)
__m256i tmp1 = _mm256_cvtps_epi32(result_in_float1);
__m256i tmp2 = _mm256_cvtps_epi32(result_in_float2);
tmp1 = _mm256_packs_epi32(tmp1, tmp2);
tmp1 = _mm256_permute4x64_epi64(tmp1, 0xD8);
// _mm256_store_si256 this