Related
I am new at c++ and working with simd.
What i am trying to do is creating a bitmask from input in constant time.
For example:
input 1,3,4 => output 13 (or 26 indexing scheme does not matter): 1101 (1st, 3rd and 4th bits are 1)
input 2,5,7 => output 82 : 1010010 (2nd, 5th and 7th bits are 1)
input type does not matter, it can be array for example.
I accomplished this with for loop, which is not wanted. Is there a function to create bitmask in a constant time?
Constant time cannot work if you have a variable number inputs. You have to iterate over the values at least once, right?
In any case, you can use intrinsics to minimize the number of operations. You have not specified your target architecture or the integer size. So I assume AVX2 and 64 bit integers as output. Also, for convenience, I assume that the inputs are 64 bit.
If your inputs are smaller-sized integers than the output, you have to add some zero-extensions.
#include <immintrin.h>
#include <array>
#include <cstdint>
#include <cstdio>
std::uint64_t accum(const std::uint64_t* bitpositions, std::size_t n)
{
// 2 x 64 bit integers set to 1
const __m128i ones2 = _mm_set1_epi64(_m_from_int64(1));
// 4 x 64 bit integers set to 1
const __m256i ones4 = _mm256_broadcastsi128_si256(ones2);
// 4 x 64 bit integers serving as partial bit masks
__m256i accum4 = _mm256_setzero_si256();
std::size_t i;
for(i = 0; i + 4 <= n; i += 4) {
// may be replaced with aligned load
__m256i positions = _mm256_loadu_si256((const __m256i*)(bitpositions + i));
// vectorized (1 << position) bit shift
__m256i shifted = _mm256_sllv_epi64(ones4, positions);
// add new bits to accumulator
accum4 = _mm256_or_si256(accum4, shifted);
}
// reduce 4 to 2 64 bit integers
__m128i accum2 = _mm256_castsi256_si128(accum4);
__m128i high2 = _mm256_extracti128_si256(accum4, 1);
if(i + 2 <= n) {
// zero or one iteration with 2 64 bit integers
__m128i positions = _mm_loadu_si128((const __m128i*)(bitpositions + i));
__m128i shifted = _mm_sllv_epi64(ones2, positions);
accum2 = _mm_or_si128(accum2, shifted);
i += 2;
}
// high2 folded in with delay to account for extract latency
accum2 = _mm_or_si128(accum2, high2);
// reduce to 1 64 bit integer
__m128i high1 = _mm_unpackhi_epi64(accum2, accum2);
accum2 = _mm_or_si128(accum2, high1);
std::uint64_t accum1 = static_cast<std::uint64_t>(_mm_cvtsi128_si64(accum2));
if(i < n)
accum1 |= 1 << bitpositions[i];
return accum1;
}
EDIT
I have just seen that your example inputs use 1-based indexing. So bit 1 would be set for value 1 and input value 0 is probably undefined behavior. I suggest switching to zero-based indexing. But if you are stuck with that notation, just add a _mm256_sub_epi64(positions, ones4) or _mm_sub_epi64(positions, ones2) before the shift.
With smaller input size
And here is a version for byte-sized input integers.
std::uint64_t accum(const std::uint8_t* bitpositions, std::size_t n)
{
const __m128i ones2 = _mm_set1_epi64(_m_from_int64(1));
const __m256i ones4 = _mm256_broadcastsi128_si256(ones2);
__m256i accum4 = _mm256_setzero_si256();
std::size_t i;
for(i = 0; i + 4 <= n; i += 4) {
/*
* As far as I can see, there is no point in loading a full 128 or 256 bit
* vector. To zero-extend more values, we would need to use many shuffle
* instructions and those have a lower throughput than repeated
* 32 bit loads
*/
__m128i positions = _mm_cvtsi32_si128(*(const int*)(bitpositions + i));
__m256i extended = _mm256_cvtepu8_epi64(positions);
__m256i shifted = _mm256_sllv_epi64(ones4, extended);
accum4 = _mm256_or_si256(accum4, shifted);
}
__m128i accum2 = _mm256_castsi256_si128(accum4);
__m128i high2 = _mm256_extracti128_si256(accum4, 1);
accum2 = _mm_or_si128(accum2, high2);
/*
* Until AVX512, there is no single instruction to load 2 byte into a vector
* register. So we don't bother. Instead, the scalar code below will run up
* to 3 times
*/
__m128i high1 = _mm_unpackhi_epi64(accum2, accum2);
accum2 = _mm_or_si128(accum2, high1);
std::uint64_t accum1 = static_cast<std::uint64_t>(_mm_cvtsi128_si64(accum2));
/*
* We use a separate accumulator to avoid the long dependency chain through
* the reduction above
*/
std::uint64_t tail = 0;
/*
* Compilers create a ton of code if we give them a simple loop because they
* think they can vectorize. So we unroll the loop, even if it is stupid
*/
if(i + 2 <= n) {
tail = std::uint64_t(1) << bitpositions[i++];
tail |= std::uint64_t(1) << bitpositions[i++];
}
if(i < n)
tail |= std::uint64_t(1) << bitpositions[i];
return accum1 | tail;
}
I'm trying to multiply two m128i byte per byte (8 bit signed integers).
The problem here is overflow. My solution is to store these 8 bit signed integers into 16 bit signed integers, multiply, then pack the whole thing into a m128i of 16 x 8 bit integers.
Here is the __m128i mulhi_epi8(__m128i a, __m128i b) emulation I made:
inline __m128i mulhi_epi8(__m128i a, __m128i b)
{
auto a_decomposed = decompose_epi8(a);
auto b_decomposed = decompose_epi8(b);
__m128i r1 = _mm_mullo_epi16(a_decomposed.first, b_decomposed.first);
__m128i r2 = _mm_mullo_epi16(a_decomposed.second, b_decomposed.second);
return _mm_packs_epi16(_mm_srai_epi16(r1, 8), _mm_srai_epi16(r2, 8));
}
decompose_epi8 is implemented in a non-simd way:
inline std::pair<__m128i, __m128i> decompose_epi8(__m128i input)
{
std::pair<__m128i, __m128i> result;
// result.first => should contain 8 shorts in [-128, 127] (8 first bytes of the input)
// result.second => should contain 8 shorts in [-128, 127] (8 last bytes of the input)
for (int i = 0; i < 8; ++i)
{
result.first.m128i_i16[i] = input.m128i_i8[i];
result.second.m128i_i16[i] = input.m128i_i8[i + 8];
}
return result;
}
This code works well. My goal now is to implement a simd version of this for loop. I looked at the Intel Intrinsics Guide but I can't find a way to do this. I guess shuffle could do the trick but I have trouble conceptualising this.
As you want to do signed multiplication, you need to sign-extend each byte to 16bit words, or move them into the upper half of each 16bit word. Since you pack the results back together afterwards, you can split the input into odd and even bytes, instead of the higher and lower half. Then sign-extension of the odd bytes can be done by arithmetically shifting all 16bit parts to the right You can extract the odd bytes by masking out the even bytes, and to get the even bytes, you can shift all 16bit parts to the left (both need to be multiplied by _mm_mulhi_epi16).
The following should work with SSE2:
__m128i mulhi_epi8(__m128i a, __m128i b)
{
__m128i mask = _mm_set1_epi16(0xff00);
// mask higher bytes:
__m128i a_hi = _mm_and_si128(a, mask);
__m128i b_hi = _mm_and_si128(b, mask);
__m128i r_hi = _mm_mulhi_epi16(a_hi, b_hi);
// mask out garbage in lower half:
r_hi = _mm_and_si128(r_hi, mask);
// shift lower bytes to upper half
__m128i a_lo = _mm_slli_epi16(a,8);
__m128i b_lo = _mm_slli_epi16(b,8);
__m128i r_lo = _mm_mulhi_epi16(a_lo, b_lo);
// shift result to the lower half:
r_lo = _mm_srli_epi16(r_lo,8);
// join result and return:
return _mm_or_si128(r_hi, r_lo);
}
Note: a previous version used shifts to sign-extend the odd bytes. On most Intel CPUs this would increase P0 usage (which needs to be used for multiplication as well). Bit-logic can operate on more ports, so this version should have better throughput.
I have the function below:
void CopyImageBitsWithAlphaRGBA(unsigned char *dest, const unsigned char *src, int w, int stride, int h,
unsigned char minredmask, unsigned char mingreenmask, unsigned char minbluemask, unsigned char maxredmask, unsigned char maxgreenmask, unsigned char maxbluemask)
{
auto pend = src + w * h * 4;
for (auto p = src; p < pend; p += 4, dest += 4)
{
dest[0] = p[0]; dest[1] = p[1]; dest[2] = p[2];
if ((p[0] >= minredmask && p[0] <= maxredmask) || (p[1] >= mingreenmask && p[1] <= maxgreenmask) || (p[2] >= minbluemask && p[2] <= maxbluemask))
dest[3] = 255;
else
dest[3] = 0;
}
}
What it does is it copies a 32 bit bitmap from one memory block to another, setting the alpha channel to fully transparent when the pixel color falls within a certain color range.
How do I make this use SSE/AVX in VC++ 2017? Right now it's not generating vectorized code. Failing an automatic way of doing it, what functions can I use to do this myself?
Because really, I'd imagine testing if bytes are in a range would be one of the most obviously useful operations possible, but I can't see any built in function to take care of it.
I don't think you're going to get a compiler to auto-vectorize as well as you can do by hand with Intel's intrinsics. (err, as well as I can do by hand anyway :P).
Possibly once we manually vectorize it, we can see how to hand-hold a compiler with scalar code that works that way, but we really need packed-compare into a 0/0xFF with byte elements, and it's hard to write something in C that compilers will auto-vectorize well. The default integer promotions mean that most C expressions actually produce 32-bit results, even when you use uint8_t, and that often tricks compilers into unpacking 8-bit to 32-bit elements, costing a lot of shuffles on top of the automatic factor of 4 throughput loss (fewer elements per register), like in #harold's small tweak to your source.
SSE/AVX (before AVX512) has signed comparisons for SIMD integer, not unsigned. But you can range-shift things to signed -128..127 by subtracting 128. XOR (add-without-carry) is slightly more efficient on some CPUs, so you actually just XOR with 0x80 to flip the high bit. But mathematically you're subtracting 128 from a 0..255 unsigned value, giving a -128..127 signed value.
It's even still possible to implement the "unsigned compare trick" of (x-min) < (max-min). (For example, detecting alphabetic ASCII characters). As a bonus, we can bake the range-shift into that subtract. If x<min, it wraps around and becomes a large value greater than max-min. This obviously works for unsigned, but it does in fact work (with a range-shifted max-min) with SSE/AVX2 signed-compare instructions. (A previous version of this answer claimed this trick only worked if max-min < 128, but that's not the case. x-min can't wrap all the way around and become lower than max-min, or get into that range if it started above max).
An earlier version of this answer had code that made the range exclusive, i.e. not including the ends, so you even redmin=0 / redmax=255 would exclude pixels with red=0 or red=255. But I solved that by comparing the other way (thanks to ideas from #Nejc's and #chtz's answers).
#chtz's idea of using a saturating add/sub instead of a compare is very cool. If you arrange things so saturation means in-range, it works for an inclusive range. (And you can set the Alpha component to a known value by choosing a min/max that makes all 256 possible inputs in-range). This lets us avoid range-shifting to signed, because unsigned-saturation is available
We can combine the sub/cmp range-check with the saturation trick to do sub (wraps on out-of-bounds low) / subs (only reaches zero if the first sub didn't wrap). Then we don't need an andnot or or to combine two separate checks on each component; we already have a 0 / non-zero result in one vector.
So it only takes two operations to give us a 32-bit value for the whole pixel that we can check. Iff all 3 RGB components are in-range, that element will have a specific value. (Because we've arranged for the Alpha component to already give a known value, too). If any of the 3 components are out-of-range, it will have some other value.
If you do this the other way, so saturation means out-of-range, then you have an exclusive range in that direction, because you can't choose a limit such that no value reaches 0 or reaches 255. You can always saturate the alpha component to give yourself a known value there, regardless of what it means for the RGB components. An exclusive range would let you abuse this function to be always-false by choosing a range that no pixel could ever match. (Or if there's a third condition, besides per-component min/max, then maybe you want an override).
The obvious thing would be to use a packed-compare instruction with 32-bit element size (_mm256_cmpeq_epi32 / vpcmpeqd) to generate a 0xFF or 0x00 (which we can apply / blend into the original RGB pixel value) for in/out of range.
// AVX2 core idea: wrapping-compare trick with saturation to achieve unsigned compare
__m256i tmp = _mm256_sub_epi8(src, min_values); // wraps to high unsigned if below min
__m256i RGB_inrange = _mm256_subs_epu8(tmp, max_minus_min); // unsigned saturation to 0 means in-range
__m256i new_alpha = _mm256_cmpeq_epi32(RGB_inrange, _mm256_setzero_si256());
// then blend the high byte of each element with RGB from the src vector
__m256i alpha_replaced = _mm256_blendv_epi8(new_alpha, src, _mm256_set1_epi32(0x00FFFFFF)); // alpha from new_alpha, RGB from src
Note that an SSE2 version would only need one MOVDQA instructions to copy src; the same register is the destination for every instruction.
Also note that you could saturate the other direction: add then adds (with (256-max) and (256-(min-max)), I think) to saturate to 0xFF for in-range. This could be useful with AVX512BW if you use zero-masking with a fixed mask (e.g. for alpha) or variable mask (for some other condition) to exclude a component based on some other condition. AVX512BW zero-masking for the sub/subs version would consider components in-range even when they aren't, which could also be useful.
But extending that to AVX512 requires a different approach: AVX512 compares produce a bit-mask (in a mask register), not a vector, so we can't turn around and use the high byte of each 32-bit compare result separately.
Instead of cmpeq_epi32, we can produce the value we want in the high byte of each pixel using carry/borrow from a subtract, which propagates left to right.
0x00000000 - 1 = 0xFFFFFFFF # high byte = 0xFF = new alpha
0x00?????? - 1 = 0x00?????? # high byte = 0x00 = new alpha
Where ?????? has at least one non-zero bit, so it's a 32-bit number >=0 and <=0x00FFFFFFFF
Remember we choose an alpha range that makes the high byte always zero
i.e. _mm256_sub_epi32(RGB_inrange, _mm_set1_epi32(1)). We only need the high byte of each 32-bit element to have the alpha value we want, because we use a byte-blend to merge it with the source RGB values. For AVX512, this avoids a VPMOVM2D zmm1, k1 instruction to convert a compare result back into a vector of 0/-1, or (much more expensive) to interleave each mask bit with 3 zeros to use it for a byte-blend.
This sub instead of cmp has a minor advantage even for AVX2: sub_epi32 runs on more ports on Skylake (p0/p1/p5 vs. p0/p1 for pcmpgt/pcmpeq). On all other CPUs, vector integer add/sub run on the same ports as vector integer compare. (Agner Fog's instruction tables).
Also, if you compile _mm256_cmpeq_epi32() with -march=native on a CPU with AVX512, or otherwise enable AVX512 and then compile normal AVX2 intrinsics, some compilers will stupidly use AVX512 compare-into-mask and then expand back to a vector instead of just using the VEX-coded vpcmpeqd. Thus, we use sub instead of cmp even for the _mm256 intrinsics version, because I already spent the time to figure it out and show that it's at least as efficient in the normal case of compiling for regular AVX2. (Although _mm256_setzero_si256() is cheaper than set1(1); vpxor can zero a register cheaply instead of loading a constant, but this setup happens outside the loop.)
#include <immintrin.h>
#ifdef __AVX2__
// inclusive min and max
__m256i setAlphaFromRangeCheck_AVX2(__m256i src, __m256i mins, __m256i max_minus_min)
{
__m256i tmp = _mm256_sub_epi8(src, mins); // out-of-range wraps to a high signed value
// (x-min) <= (max-min) equivalent to:
// (x-min) - (max-min) saturates to zero
__m256i RGB_inrange = _mm256_subs_epu8(tmp, max_minus_min);
// 0x00000000 for in-range pixels, 0x00?????? (some higher value) otherwise
// this has minor advantages over compare against zero, see full comments on Godbolt
__m256i new_alpha = _mm256_sub_epi32(RGB_inrange, _mm256_set1_epi32(1));
// 0x00000000 - 1 = 0xFFFFFFFF
// 0x00?????? - 1 = 0x00?????? high byte = new alpha value
const __m256i RGB_mask = _mm256_set1_epi32(0x00FFFFFF); // blend mask
// without AVX512, the only byte-granularity blend is a 2-uop variable-blend with a control register
// On Ryzen, it's only 1c latency, so probably 1 uop that can only run on one port. (1c throughput).
// For 256-bit, that's 2 uops of course.
__m256i alpha_replaced = _mm256_blendv_epi8(new_alpha, src, RGB_mask); // RGB from src, 0/FF from new_alpha
return alpha_replaced;
}
#endif // __AVX2__
Set up vector args for this function and loop over your array with _mm256_load_si256 / _mm256_store_si256. (Or loadu/storeu if you can't guarantee alignment.)
This compiles very efficiently (Godbolt Compiler explorer) with gcc, clang, and MSVC. (AVX2 version on Godbolt is good, AVX512 and SSE versions are still a mess, not all the tricks applied to them yet.)
;; MSVC's inner loop from a caller that loops over an array with it:
;; see the Godbolt link
$LL4#:
vmovdqu ymm3, YMMWORD PTR [rdx+rax*4]
vpsubb ymm0, ymm3, ymm7
vpsubusb ymm1, ymm0, ymm6
vpsubd ymm2, ymm1, ymm5
vpblendvb ymm3, ymm2, ymm3, ymm4
vmovdqu YMMWORD PTR [rcx+rax*4], ymm3
add eax, 8
cmp eax, r8d
jb SHORT $LL4#
So MSVC managed to hoist the constant setup after inlining. We get similar loops from gcc/clang.
The loop has 4 vector ALU instructions, one of which takes 2 uops. Total 5 vector ALU uops. But total fused-domain uops on Haswell/Skylake = 9 with no unrolling, so with luck this can run at 32 bytes (1 vector) per 2.25 clock cycles. It could come close to actually achieving that with data hot in L1d or L2 cache, but L3 or memory would be a bottleneck. With unrolling, it could maybe bottlenck on L2 cache bandwidth.
An AVX512 version (also included in the Godbolt link), only needs 1 uop to blend, and could run faster in vectors per cycle, thus more than twice as fast using 512-byte vectors.
This is one possible way to make this function work with SSE instructions. I used SSE instead of AVX because I wanted to keep the answer simple. Once you understand how the solution works, rewriting the function with AVX intrinsics should not be much of a problem though.
EDIT: please note that my approach is very similar to one by PeterCordes, but his code should be faster because he uses AVX. If you want to rewrite the function below with AVX intrinsics, change step value to 8.
void CopyImageBitsWithAlphaRGBA(
unsigned char *dest,
const unsigned char *src, int w, int stride, int h,
unsigned char minred, unsigned char mingre, unsigned char minblu,
unsigned char maxred, unsigned char maxgre, unsigned char maxblu)
{
char low = 0x80; // -128
char high = 0x7f; // 127
char mnr = *(char*)(&minred) - low;
char mng = *(char*)(&mingre) - low;
char mnb = *(char*)(&minblu) - low;
int32_t lowest = mnr | (mng << 8) | (mnb << 16) | (low << 24);
char mxr = *(char*)(&maxred) - low;
char mxg = *(char*)(&maxgre) - low;
char mxb = *(char*)(&maxblu) - low;
int32_t highest = mxr | (mxg << 8) | (mxb << 16) | (high << 24);
// SSE
int step = 4;
int sse_width = (w / step)*step;
for (int y = 0; y < h; ++y)
{
for (int x = 0; x < w; x += step)
{
if (x == sse_width)
{
x = w - step;
}
int ptr_offset = y * stride + x;
const unsigned char* src_ptr = src + ptr_offset;
unsigned char* dst_ptr = dest + ptr_offset;
__m128i loaded = _mm_loadu_si128((__m128i*)src_ptr);
// subtract 128 from every 8-bit int
__m128i subtracted = _mm_sub_epi8(loaded, _mm_set1_epi8(low));
// greater than top limit?
__m128i masks_hi = _mm_cmpgt_epi8(subtracted, _mm_set1_epi32(highest));
// lower that bottom limit?
__m128i masks_lo = _mm_cmplt_epi8(subtracted, _mm_set1_epi32(lowest));
// perform OR operation on both masks
__m128i combined = _mm_or_si128(masks_hi, masks_lo);
// are 32-bit integers equal to zero?
__m128i eqzer = _mm_cmpeq_epi32(combined, _mm_setzero_si128());
__m128i shifted = _mm_slli_epi32(eqzer, 24);
// EDIT: fixed a bug:
__m128 alpha_unmasked = _mm_and_si128(loaded, _mm_set1_epi32(0x00ffffff));
__m128i combined = _mm_or_si128(alpha_unmasked, shifted);
_mm_storeu_si128((__m128i*)dst_ptr, combined);
}
}
}
EDIT: as #PeterCordes stated in the comments, the code included a bug that is now fixed.
Based on #PeterCordes solution, but replacing the shift+compare by saturated subtract and adding:
// mins_compl shall be [255-minR, 255-minG, 255-minB, 0]
// maxs shall be [maxR, maxG, maxB, 0]
__m256i setAlphaFromRangeCheck(__m256i src, __m256i mins_compl, __m256i maxs)
{
__m256i in_lo = _mm256_adds_epu8(src, mins_compl); // is 255 iff src+mins_coml>=255, i.e. src>=mins
__m256i in_hi = _mm256_subs_epu8(src, maxs); // is 0 iff src - maxs <= 0, i.e., src <= maxs
__m256i inbounds_components = _mm256_andnot_si256(in_hi, in_lo);
// per-component mask, 0xff, iff (mins<=src && src<=maxs).
// alpha-channel is always (~src & src) == 0
// Use a 32-bit element compare to check that all 3 components are in-range
__m256i RGB_mask = _mm256_set1_epi32(0x00FFFFFF);
__m256i inbounds = _mm256_cmpeq_epi32(inbounds_components, RGB_mask);
__m256i new_alpha = _mm256_slli_epi32(inbounds, 24);
// alternatively _mm256_andnot_si256(RGB_mask, inbounds) ?
// byte blends (vpblendvb) are at least 2 uops, and Haswell requires port5
// instead clear alpha and then OR in the new alpha (0 or 0xFF)
__m256i alphacleared = _mm256_and_si256(src, RGB_mask); // off the critical path
__m256i new_alpha_applied = _mm256_or_si256(alphacleared, new_alpha);
return new_alpha_applied;
}
This saves on vpxor (no modification of src required) and one vpand (the alpha-channel is automatically 0 -- I guess that would be possible with Peter's solution as well by choosing the boundaries accordingly).
Godbolt-Link, apparently, neither gcc nor clang think it is worthwhile to re-use RGB_mask for both usages ...
Simple testing with SSE2 variant: https://wandbox.org/permlink/eVzFHljxfTX5HDcq (you can play around with the source and the boundaries)
I need to perform the following operation:
w[i] = scale * v[i] + point
scale and point are fixed, whereas v[] is a vector of 4-bit integers.
I need to compute w[] for the arbitrary input vector v[] and I want to speed up the process using AVX intrinsics. However, v[i] is a vector of 4-bit integers.
The question is how to perform operations on 4-bit integers using intrinsics? I could use 8-bit integers and perform operations that way, but is there a way to do the following:
[a,b] + [c,d] = [a+b,c+d]
[a,b] * [c,d] = [a * b,c * d]
(Ignoring overflow)
Using AVX intrinsics, where [...,...] Is an 8-bit integer and a,b,c,d are 4-bit integers?
If yes, would it be possible to give a short example on how this could work?
Just a partial answer (only addition) and in pseudo code (should be easy to extent to AVX2 intrinsics):
uint8_t a, b; // input containing two nibbles each
uint8_t c = a + b; // add with (unwanted) carry between nibbles
uint8_t x = a ^ b ^ c; // bits which are result of a carry
x &= 0x10; // only bit 4 is of interest
c -= x; // undo carry of lower to upper nibble
If either a or b is known to have bit 4 unset (i.e. the lowest bit of the upper nibble), it can be left out the computation of x.
As for multiplication: If scale is the same for all products, you can likely get away with some shifting and adding/subtracting (masking out overflow bits where necessarry). Otherwise, I'm afraid you need to mask out 4 bits of each 16bit word, do the operation, and fiddle them together at the end. Pseudo code (there is no AVX 8bit multiplication, so we need to operate with 16bit words):
uint16_t m0=0xf, m1=0xf0, m2=0xf00, m3=0xf000; // masks for each nibble
uint16_t a, b; // input containing 4 nibbles each.
uint16_t p0 = (a*b) & m0; // lowest nibble, does not require masking a,b
uint16_t p1 = ((a>>4) * (b&m1)) & m1;
uint16_t p2 = ((a>>8) * (b&m2)) & m2;
uint16_t p3 = ((a>>12)* (b&m3)) & m3;
uint16_t result = p0 | p1 | p2 | p3; // join results together
For fixed a, b in w[i]=v[i] * a + b, you can simply use a lookup table w_0_3 = _mm_shuffle_epi8(LUT_03, input) for the LSB. Split the input to even and odd nibbles, with the odd LUT preshifted by 4.
auto a = input & 15; // per element
auto b = (input >> 4) & 15; // shift as 16 bits
return LUTA[a] | LUTB[b];
How to generate those LUTs dynamically, is another issue, if at all.
4-bit aditions/multiplication can be done using AVX2, particularly if you want to apply those computations on larger vectors (say more than 128 elements). However, if you want to add just 4 numbers use straight scalar code.
We have done an extensive work on how to deal with 4-bit integers, and we have recently developed a library to do it Clover: 4-bit Quantized Linear Algebra Library (with focus on quantization). The code is also available at GitHub.
As you mentioned only 4-bit integers, I would assume that you are referring to signed integers (i.e. two's complements), and base my answer accordingly. Note that handling unsigned is in fact much simpler.
I would also assume that you would like to take vector int8_t v[n/2] that contains n 4-bit integers, and produce int8_t v_sum[n/4] having n/2 4-bit integers. All the code relative to the description bellow is available as a gist.
Packing / Unpacking
Obviously AVX2 does not offer any instructions to perform additions / multiplication on 4-bit integers, therefore, you must resort to the given 8- or 16-bit instruction. The first step in dealing with 4-bit arithmetics is to devise methods on how to place the 4-bit nibble into larger chunks of 8-, 16-, or 32-bit chunk.
For a sake of clarity let's assume that you want to unpack a given nibble from a 32-bit chunk that stores multiple 4-bit signed values into a corresponding 32-bit integer (figure below). This can be done with two bit shifts:
a logical left shift is used to shift the nibble so that it occupies the highest-order 4-bits of the 32-bit entity.
an arithmetic right shift is used to shift the nibble to the lowest order 4-bits of the 32-bit entity.
The arithmetic right shift has sign extension, filling the high-order 28 bits with the sign bit of the nibble. yielding a 32-bit integer with the same value as the two’s complement 4-bit value.
The goal of packing (left part of figure above) is to revert the unpacking operation. Two bit shifts can be used to place the lowest order 4 bits of a 32-bit integer anywhere within a 32-bit entity.
a logical left shift is used to shift the nibble so that it occupies the highest-order 4-bits of the 32-bit entity.
a logical right shift is used to shift the nibble to somewhere within the 32-bit entity.
The first sets the bits lower-ordered than the nibble to zero, and the second sets the bits higher-ordered than the nibble to zero. A bitwise OR operation can then be used to store up to eight nibbles in the 32-bit entity.
How to apply this in practice?
Let's assume that you have 64 x 32-bit integer values stored in 8 AVX registers __m256i q_1, q_2, q_3, q_4, q_5, q_6, q_7, q_8. Let's also assume that each value is in the [-8, 7], range. If you want to pack them into a single AVX register of 64 x 4-bit values, you can do as follows:
//
// Transpose the 8x8 registers
//
_mm256_transpose8_epi32(q_1, q_2, q_3, q_4, q_5, q_6, q_7, q_8);
//
// Shift values left
//
q_1 = _mm256_slli_epi32(q_1, 28);
q_2 = _mm256_slli_epi32(q_2, 28);
q_3 = _mm256_slli_epi32(q_3, 28);
q_4 = _mm256_slli_epi32(q_4, 28);
q_5 = _mm256_slli_epi32(q_5, 28);
q_6 = _mm256_slli_epi32(q_6, 28);
q_7 = _mm256_slli_epi32(q_7, 28);
q_8 = _mm256_slli_epi32(q_8, 28);
//
// Shift values right (zero-extend)
//
q_1 = _mm256_srli_epi32(q_1, 7 * 4);
q_2 = _mm256_srli_epi32(q_2, 6 * 4);
q_3 = _mm256_srli_epi32(q_3, 5 * 4);
q_4 = _mm256_srli_epi32(q_4, 4 * 4);
q_5 = _mm256_srli_epi32(q_5, 3 * 4);
q_6 = _mm256_srli_epi32(q_6, 2 * 4);
q_7 = _mm256_srli_epi32(q_7, 1 * 4);
q_8 = _mm256_srli_epi32(q_8, 0 * 4);
//
// Pack together
//
__m256i t1 = _mm256_or_si256(q_1, q_2);
__m256i t2 = _mm256_or_si256(q_3, q_4);
__m256i t3 = _mm256_or_si256(q_5, q_6);
__m256i t4 = _mm256_or_si256(q_7, q_8);
__m256i t5 = _mm256_or_si256(t1, t2);
__m256i t6 = _mm256_or_si256(t3, t4);
__m256i t7 = _mm256_or_si256(t5, t6);
Shifts usually take 1 cycle of throughput, and 1 cycle of latency, thus you can assume that are in fact quite inexpensive. If you have to deal with unsigned 4-bit values, the left shifts can be skipped all together.
To reverse the procedure, you can apply the same method. Let's assume that you have loaded 64 4-bit values into a single AVX register __m256i qu_64. In order to produce 64 x 32-bit integers __m256i q_1, q_2, q_3, q_4, q_5, q_6, q_7, q_8, you can execute the following:
//
// Shift values left
//
const __m256i qu_1 = _mm256_slli_epi32(qu_64, 4 * 7);
const __m256i qu_2 = _mm256_slli_epi32(qu_64, 4 * 6);
const __m256i qu_3 = _mm256_slli_epi32(qu_64, 4 * 5);
const __m256i qu_4 = _mm256_slli_epi32(qu_64, 4 * 4);
const __m256i qu_5 = _mm256_slli_epi32(qu_64, 4 * 3);
const __m256i qu_6 = _mm256_slli_epi32(qu_64, 4 * 2);
const __m256i qu_7 = _mm256_slli_epi32(qu_64, 4 * 1);
const __m256i qu_8 = _mm256_slli_epi32(qu_64, 4 * 0);
//
// Shift values right (sign-extent) and obtain 8x8
// 32-bit values
//
__m256i q_1 = _mm256_srai_epi32(qu_1, 28);
__m256i q_2 = _mm256_srai_epi32(qu_2, 28);
__m256i q_3 = _mm256_srai_epi32(qu_3, 28);
__m256i q_4 = _mm256_srai_epi32(qu_4, 28);
__m256i q_5 = _mm256_srai_epi32(qu_5, 28);
__m256i q_6 = _mm256_srai_epi32(qu_6, 28);
__m256i q_7 = _mm256_srai_epi32(qu_7, 28);
__m256i q_8 = _mm256_srai_epi32(qu_8, 28);
//
// Transpose the 8x8 values
//
_mm256_transpose8_epi32(q_1, q_2, q_3, q_4, q_5, q_6, q_7, q_8);
If dealing with unsigned 4-bits, the right shifts (_mm256_srai_epi32) can be skipped all-together, and instead of left shifts, we can perform left-logical shifts (_mm256_srli_epi32 ).
To see more details have a look a the gist here.
Adding Odd and Even 4-bit entries
Let's assume that you load from the vector using AVX:
const __m256i qv = _mm256_loadu_si256( ... );
Now, we can easily extract the odd and the even parts. Life would have been much easier if there were 8-bit shifts in AVX2, but there are none, so we have to deal with 16-bit shifts:
const __m256i hi_mask_08 = _mm256_set1_epi8(-16);
const __m256i qv_odd_dirty = _mm256_slli_epi16(qv, 4);
const __m256i qv_odd_shift = _mm256_and_si256(hi_mask_08, qv_odd_dirty);
const __m256i qv_evn_shift = _mm256_and_si256(hi_mask_08, qv);
At this point in time, you have essentially separated the odd and the even nibbles, in two AVX registers that hold their values in the high 4-bits (i.e. values in the range [-8 * 2^4, 7 * 2^4]). The procedure is the same even when dealing with unsigned 4-bit values. Now it is time to add the values.
const __m256i qv_sum_shift = _mm256_add_epi8(qv_odd_shift, qv_evn_shift);
This will work with both signed and unsigned, as binary addition work with two's complements. However, if you want to avoid overflows or underflows you can also consider addition with saturation already supported in AVX (for both signed and unsigned):
__m256i _mm256_adds_epi8 (__m256i a, __m256i b)
__m256i _mm256_adds_epu8 (__m256i a, __m256i b)
qv_sum_shift will be in the range [-8 * 2^4, 7 * 2^4]. To set it to the right value, we need to shift it back (Note that if qv_sum has to be unsigned, we can use _mm256_srli_epi16 instead):
const __m256i qv_sum = _mm256_srai_epi16(qv_sum_shift, 4);
The summation is now complete. Depending on your use case, this could as well be the end of the program, assuming that you want to produce 8-bit chunks of memory as a result. But let's assume that you want to solve a harder problem. Let's assume that the output is again a vector of 4-bit elements, with the same memory layout as the input one. In that case, we need to pack the 8-bit chunks into 4-bit chunks. However, the problem is that instead of having 64 values, we will end up with 32 values (i.e. half the size of the vector).
From this point there are two options. We either look ahead in the vector, processing 128 x 4-bit values, such that we produce 64 x 4-bit values. Or we revert to SSE, dealing with 32 x 4-bit values. Either way, the fastest way to pack the 8-bit chunks into 4-bit chunks would be to use the vpackuswb (or packuswb for SSE) instruction:
__m256i _mm256_packus_epi16 (__m256i a, __m256i b)
This instruction convert packed 16-bit integers from a and b to packed 8-bit integers using unsigned saturation, and store the results in dst. This means that we have to interleave the odd and even 4-bit values, such that they reside in the 8 low-bits of a 16-bit memory chunk. We can proceed as follows:
const __m256i lo_mask_16 = _mm256_set1_epi16(0x0F);
const __m256i hi_mask_16 = _mm256_set1_epi16(0xF0);
const __m256i qv_sum_lo = _mm256_and_si256(lo_mask_16, qv_sum);
const __m256i qv_sum_hi_dirty = _mm256_srli_epi16(qv_sum_shift, 8);
const __m256i qv_sum_hi = _mm256_and_si256(hi_mask_16, qv_sum_hi_dirty);
const __m256i qv_sum_16 = _mm256_or_si256(qv_sum_lo, qv_sum_hi);
The procedure will be identical for both signed and unsigned 4-bit values. Now, qv_sum_16 contains two consecutive 4-bit values, stored in the low-bits of a 16-bit memory chunk. Assuming that we have obtained qv_sum_16 from the next iteration (call it qv_sum_16_next), we can pack everything with:
const __m256i qv_sum_pack = _mm256_packus_epi16(qv_sum_16, qv_sum_16_next);
const __m256i result = _mm256_permute4x64_epi64(qv_sum_pack, 0xD8);
Alternatively, if we want to produce only 32 x 4-bit values, we can do the following:
const __m128i lo = _mm256_extractf128_si256(qv_sum_16, 0);
const __m128i hi = _mm256_extractf128_si256(qv_sum_16, 1);
const __m256i result = _mm_packus_epi16(lo, hi)
Putting it all together
Assuming signed nibbles, and vector size n, such that n is larger than 128 elements and is multiple of 128, we can perform the odd-even addition, producing n/2 elements as follows:
void add_odd_even(uint64_t n, int8_t * v, int8_t * r)
{
//
// Make sure that the vector size that is a multiple of 128
//
assert(n % 128 == 0);
const uint64_t blocks = n / 64;
//
// Define constants that will be used for masking operations
//
const __m256i hi_mask_08 = _mm256_set1_epi8(-16);
const __m256i lo_mask_16 = _mm256_set1_epi16(0x0F);
const __m256i hi_mask_16 = _mm256_set1_epi16(0xF0);
for (uint64_t b = 0; b < blocks; b += 2) {
//
// Calculate the offsets
//
const uint64_t offset0 = b * 32;
const uint64_t offset1 = b * 32 + 32;
const uint64_t offset2 = b * 32 / 2;
//
// Load 128 values in two AVX registers. Each register will
// contain 64 x 4-bit values in the range [-8, 7].
//
const __m256i qv_1 = _mm256_loadu_si256((__m256i *) (v + offset0));
const __m256i qv_2 = _mm256_loadu_si256((__m256i *) (v + offset1));
//
// Extract the odd and the even parts. The values will be split in
// two registers qv_odd_shift and qv_evn_shift, each of them having
// 32 x 8-bit values, such that each value is multiplied by 2^4
// and resides in the range [-8 * 2^4, 7 * 2^4]
//
const __m256i qv_odd_dirty_1 = _mm256_slli_epi16(qv_1, 4);
const __m256i qv_odd_shift_1 = _mm256_and_si256(hi_mask_08, qv_odd_dirty_1);
const __m256i qv_evn_shift_1 = _mm256_and_si256(hi_mask_08, qv_1);
const __m256i qv_odd_dirty_2 = _mm256_slli_epi16(qv_2, 4);
const __m256i qv_odd_shift_2 = _mm256_and_si256(hi_mask_08, qv_odd_dirty_2);
const __m256i qv_evn_shift_2 = _mm256_and_si256(hi_mask_08, qv_2);
//
// Perform addition. In case of overflows / underflows, behaviour
// is undefined. Values are still in the range [-8 * 2^4, 7 * 2^4].
//
const __m256i qv_sum_shift_1 = _mm256_add_epi8(qv_odd_shift_1, qv_evn_shift_1);
const __m256i qv_sum_shift_2 = _mm256_add_epi8(qv_odd_shift_2, qv_evn_shift_2);
//
// Divide by 2^4. At this point in time, each of the two AVX registers holds
// 32 x 8-bit values that are in the range of [-8, 7]. Summation is complete.
//
const __m256i qv_sum_1 = _mm256_srai_epi16(qv_sum_shift_1, 4);
const __m256i qv_sum_2 = _mm256_srai_epi16(qv_sum_shift_2, 4);
//
// Now, we want to take the even numbers of the 32 x 4-bit register, and
// store them in the high-bits of the odd numbers. We do this with
// left shifts that extend in zero, and 16-bit masks. This operation
// results in two registers qv_sum_lo and qv_sum_hi that hold 32
// values. However, each consecutive 4-bit values reside in the
// low-bits of a 16-bit chunk.
//
const __m256i qv_sum_1_lo = _mm256_and_si256(lo_mask_16, qv_sum_1);
const __m256i qv_sum_1_hi_dirty = _mm256_srli_epi16(qv_sum_shift_1, 8);
const __m256i qv_sum_1_hi = _mm256_and_si256(hi_mask_16, qv_sum_1_hi_dirty);
const __m256i qv_sum_2_lo = _mm256_and_si256(lo_mask_16, qv_sum_2);
const __m256i qv_sum_2_hi_dirty = _mm256_srli_epi16(qv_sum_shift_2, 8);
const __m256i qv_sum_2_hi = _mm256_and_si256(hi_mask_16, qv_sum_2_hi_dirty);
const __m256i qv_sum_16_1 = _mm256_or_si256(qv_sum_1_lo, qv_sum_1_hi);
const __m256i qv_sum_16_2 = _mm256_or_si256(qv_sum_2_lo, qv_sum_2_hi);
//
// Pack the two registers of 32 x 4-bit values, into a single one having
// 64 x 4-bit values. Use the unsigned version, to avoid saturation.
//
const __m256i qv_sum_pack = _mm256_packus_epi16(qv_sum_16_1, qv_sum_16_2);
//
// Interleave the 64-bit chunks.
//
const __m256i qv_sum = _mm256_permute4x64_epi64(qv_sum_pack, 0xD8);
//
// Store the result
//
_mm256_storeu_si256((__m256i *)(r + offset2), qv_sum);
}
}
A self-contained tester and validator of this code is available in the gist here.
Multiplying Odd and Even 4-bit entries
For the multiplication of the odd and even entries, we can use the same strategy as described above to extract the 4-bits into larger chunks.
AVX2 does not offer 8-bit multiplication, only 16-bit. However, we can implement 8-bit multiplication following the method implemented in the Agner Fog's C++ vector class library:
static inline Vec32c operator * (Vec32c const & a, Vec32c const & b) {
// There is no 8-bit multiply in SSE2. Split into two 16-bit multiplies
__m256i aodd = _mm256_srli_epi16(a,8); // odd numbered elements of a
__m256i bodd = _mm256_srli_epi16(b,8); // odd numbered elements of b
__m256i muleven = _mm256_mullo_epi16(a,b); // product of even numbered elements
__m256i mulodd = _mm256_mullo_epi16(aodd,bodd); // product of odd numbered elements
mulodd = _mm256_slli_epi16(mulodd,8); // put odd numbered elements back in place
__m256i mask = _mm256_set1_epi32(0x00FF00FF); // mask for even positions
__m256i product = selectb(mask,muleven,mulodd); // interleave even and odd
return product;
}
I would suggest however to extract the nibbles into 16-bit chunks first and then use _mm256_mullo_epi16 to avoid performing unnecessary shifts.
I run a bench of computations using SIMD intructions. These instructions return a vector of 16 bytes as result, named compare, with each byte being 0x00 or 0xff :
0 1 2 3 4 5 6 7 15 16
compare : 0x00 0x00 0x00 0x00 0xff 0x00 0x00 0x00 ... 0xff 0x00
Bytes set to 0xff mean I need to run the function do_operation(i) with i being the position of the byte.
For instance, the above compare vector mean, I need to run this sequence of operations :
do_operation(4);
do_operation(15);
Here is the fastest solution I came up with until now :
for(...) {
//
// SIMD computations
//
__m128i compare = ... // Result of SIMD computations
// Extract high and low quadwords for compare vector
std::uint64_t cmp_low = (_mm_cvtsi128_si64(compare));
std::uint64_t cmp_high = (_mm_extract_epi64(compare, 1));
// Process low quadword
if (cmp_low) {
const std::uint64_t low_possible_positions = 0x0706050403020100;
const std::uint64_t match_positions = _pext_u64(
low_possible_positions, cmp_low);
const int match_count = _popcnt64(cmp_low) / 8;
const std::uint8_t* match_pos_array =
reinterpret_cast<const std::uint8_t*>(&match_positions);
for (int i = 0; i < match_count; ++i) {
do_operation(i);
}
}
// Process high quadword (similarly)
if (cmp_high) {
const std::uint64_t high_possible_positions = 0x0f0e0d0c0b0a0908;
const std::uint64_t match_positions = _pext_u64(
high_possible_positions, cmp_high);
const int match_count = _popcnt64(cmp_high) / 8;
const std::uint8_t* match_pos_array =
reinterpret_cast<const std::uint8_t*>(&match_positions);
for(int i = 0; i < match_count; ++i) {
do_operation(i);
}
}
}
I start with extracting the first and second 64 bits integers of the 128 bits vector (cmp_low and cmp_high). Then I use popcount to compute the number of bytes set to 0xff (number of bits set to 1 divided by 8). Finally, I use pext to get positions, without zeros, like this :
0x0706050403020100
0x000000ff00ff0000
|
PEXT
|
0x0000000000000402
I would like to find a faster solution to extract the positions of the bytes set to 0xff in the compare vector. More precisely, the are very often only 0, 1 or 2 bytes set to 0xff in the compare vector and I would like to use this information to avoid some branches.
Here's a quick outline of how you could reduce the number of tests:
First use a function to project all the lsb or msb of each byte of your 128bit integer into a 16bit value (for instance, there's a SSE2 assembly instruction for that on X86 cpus: pmovmskb, which is supported on Intel and MS compilers with the _mm_movemask_pi8 intrinsic, and gcc has also an intrinsic: __builtin_ia32_ppmovmskb128, );
Then split that value in 4 nibbles;
define functions to handle each possible values of a nibble (from 0 to 15) and put these in an array;
Finally call the function indexed by each nibble (with extra parameters to indicate which nibble in the 16bits it is).
Since in your case very often only 0, 1 or 2 bytes are set to 0xff in the compare vector, a short
while-loop on the bitmask might be more efficient than a solution based on the pext
instruction. See also my answer on a similar question.
/*
gcc -O3 -Wall -m64 -mavx2 -march=broadwell esbsimd.c
*/
#include <stdio.h>
#include <immintrin.h>
int do_operation(int i){ /* some arbitrary do_operation() */
printf("i = %d\n",i);
return 0;
}
int main(){
__m128i compare = _mm_set_epi8(0xFF,0,0,0, 0,0,0,0, 0,0,0,0xFF, 0,0,0,0); /* Take some randon value for compare */
int k = _mm_movemask_epi8(compare);
while (k){
int i=_tzcnt_u32(k); /* Count the number of trailing zero bits in k. BMI1 instruction set, Haswell or newer. */
do_operation(i);
k=_blsr_u32(k); /* Clear the lowest set bit in k. */
}
return 0;
}
/*
Output:
i = 4
i = 15
*/