Looking for an index of an element in array via SIMD. A fast way

Looking for an index of an element in array via SIMD. A fast way - bit-manipulation

I need to find an index/position of an 8-bit value element N in an array ARR via SIMD. It must be a fast fashion.
For now the algorithm is that I'd load 8-bit values of ARR into one SIMD register and a character code of N into other SIMD register.
Then I'd use negation and check which byte is successful with popcnt.
Is there a faster way?
The operations may be saturated used if needed.

Which instruction set/architecture are you using? That will somewhat impact the 'correct' answer to this question.
in SSE:
#include <immintrin.h>
#include <stdio.h>
int byteIndex(__m128i ARR, __m128i N)
{
__m128i cmp = _mm_cmpeq_epi8(ARR, N);
int mask = _mm_movemask_epi8(cmp);
return _tzcnt_u32(mask);
}
int main()
{
__m128i ARR = _mm_setr_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);
// test case that will work
__m128i N = _mm_set1_epi8(3);
printf("%d\n", byteIndex(ARR, N)); ///< prints '3'
// test case that will fail
__m128i F = _mm_set1_epi8(16);
printf("%d\n", byteIndex(ARR, F)); ///< prints '32'
return 1;
}

Related

SSE2 packed 8-bit integer signed multiply (high-half): Decomposing a m128i (16x8 bit) into two m128i (8x16 each) and repack

I'm trying to multiply two m128i byte per byte (8 bit signed integers).
The problem here is overflow. My solution is to store these 8 bit signed integers into 16 bit signed integers, multiply, then pack the whole thing into a m128i of 16 x 8 bit integers.
Here is the __m128i mulhi_epi8(__m128i a, __m128i b) emulation I made:
inline __m128i mulhi_epi8(__m128i a, __m128i b)
{
auto a_decomposed = decompose_epi8(a);
auto b_decomposed = decompose_epi8(b);
__m128i r1 = _mm_mullo_epi16(a_decomposed.first, b_decomposed.first);
__m128i r2 = _mm_mullo_epi16(a_decomposed.second, b_decomposed.second);
return _mm_packs_epi16(_mm_srai_epi16(r1, 8), _mm_srai_epi16(r2, 8));
}
decompose_epi8 is implemented in a non-simd way:
inline std::pair<__m128i, __m128i> decompose_epi8(__m128i input)
{
std::pair<__m128i, __m128i> result;
// result.first => should contain 8 shorts in [-128, 127] (8 first bytes of the input)
// result.second => should contain 8 shorts in [-128, 127] (8 last bytes of the input)
for (int i = 0; i < 8; ++i)
{
result.first.m128i_i16[i] = input.m128i_i8[i];
result.second.m128i_i16[i] = input.m128i_i8[i + 8];
}
return result;
}
This code works well. My goal now is to implement a simd version of this for loop. I looked at the Intel Intrinsics Guide but I can't find a way to do this. I guess shuffle could do the trick but I have trouble conceptualising this.

As you want to do signed multiplication, you need to sign-extend each byte to 16bit words, or move them into the upper half of each 16bit word. Since you pack the results back together afterwards, you can split the input into odd and even bytes, instead of the higher and lower half. Then sign-extension of the odd bytes can be done by arithmetically shifting all 16bit parts to the right You can extract the odd bytes by masking out the even bytes, and to get the even bytes, you can shift all 16bit parts to the left (both need to be multiplied by _mm_mulhi_epi16).
The following should work with SSE2:
__m128i mulhi_epi8(__m128i a, __m128i b)
{
__m128i mask = _mm_set1_epi16(0xff00);
// mask higher bytes:
__m128i a_hi = _mm_and_si128(a, mask);
__m128i b_hi = _mm_and_si128(b, mask);
__m128i r_hi = _mm_mulhi_epi16(a_hi, b_hi);
// mask out garbage in lower half:
r_hi = _mm_and_si128(r_hi, mask);
// shift lower bytes to upper half
__m128i a_lo = _mm_slli_epi16(a,8);
__m128i b_lo = _mm_slli_epi16(b,8);
__m128i r_lo = _mm_mulhi_epi16(a_lo, b_lo);
// shift result to the lower half:
r_lo = _mm_srli_epi16(r_lo,8);
// join result and return:
return _mm_or_si128(r_hi, r_lo);
}
Note: a previous version used shifts to sign-extend the odd bytes. On most Intel CPUs this would increase P0 usage (which needs to be used for multiplication as well). Bit-logic can operate on more ports, so this version should have better throughput.

SIMD: implement _mm256_max_epu64_ and _mm256_min_epu64_

I want to ask a question about SIMD.
I don't get the AVX512 in my CPU but want to have a _mm256_max_epu64.
How can we implement this function with AVX2?
Here I try to have my trivial one. Maybe we can let it be a discussion and improve that.
#define SIMD_INLINE inline __attribute__ ((always_inline))
SIMD_INLINE __m256i __my_mm256_max_epu64_(__m256i a, __m256i b) {
uint64_t *val_a = (uint64_t*) &a;
uint64_t *val_b = (uint64_t*) &b;
uint64_t e[4];
for (size_t i = 0; i < 4; ++i) e[i] = (*(val_a + i) > *(val_b + i)) ? *(val_a + i) : *(val_b + i);
return _mm256_set_epi64x(e[3], e[2], e[1], e[0]);
}
EDIT as a Summary:
We had a discussion about the __mm256 unsigned comparing. I gave my trivial implementation above just following the very basic concept: a single __m256i is just equivalent with 4 uint64_t or 4 float, which also make up 256 bits together.
Then we had the answer from #chtz, which makes more AVX sense with calling more bit programming functions from AVX.
At end it turns out these two implementation result in the same assembly thanks to CLang. Assembly example from compiler explorer
Another _mm256_min_epu64_ added. It is just mirroring the _mm256_max_epu64_ above. Make it easier to be searched for the future use.
SIMD_INLINE __m256i __my_mm256_min_epu64_(__m256i a, __m256i b) {
uint64_t *val_a = (uint64_t*) &a;
uint64_t *val_b = (uint64_t*) &b;
uint64_t e[4];
for (size_t i = 0; i < 4; ++i) e[i] = (*(val_a + i) < *(val_b + i)) ? *(val_a + i) : *(val_b + i);
return _mm256_set_epi64x(e[3], e[2], e[1], e[0]);
}

The simplest solution would be a combination of _mm256_cmpgt_epi64 with a blend.
However, if you want the unsigned maximum, you need to first subtract 1<<63 from each element (before comparison, not before blending).
There is no _mm256_blendv_epu64 instruction, but it is possible to use _mm256_blendv_epi8 since the mask will set at every bit of the relevant elements. Also note that subtracting the uppermost bit can be done by a slightly faster xor:
__m256i pmax_epu64(__m256i a, __m256i b)
{
__m256i signbit = _mm256_set1_epi64x(0x8000'0000'0000'0000);
__m256i mask = _mm256_cmpgt_epi64(_mm256_xor_si256(a,signbit),_mm256_xor_si256(b,signbit));
return _mm256_blendv_epi8(b,a,mask);
}
Actually, clang almost manages to get the same instructions from your code: https://godbolt.org/z/afhdOa
It only uses vblendvpd instead of vpblendvb, which may introduce latencies (see #PeterCordes comment for details).
With some bit-twiddeling you could actually save setting the register for the signbit.
An unsigned comparison gives the same result if the signs of both operands match and the opposite results if they don't match, i.e.
unsigned_greater_than(signed a, signed b) == (a<0) ^ (b<0) ^ (a>b)
This can be used if you use the _mm256_blendv_pd with some casting as a _mm256_blendv_epi64 (because now only the uppermost bit is valid):
__m256i _mm256_blendv_epi64(__m256i a, __m256i b, __m256i mask)
{
return _mm256_castpd_si256(_mm256_blendv_pd(
_mm256_castsi256_pd(a),_mm256_castsi256_pd(b),_mm256_castsi256_pd(mask)));
}
__m256i pmax_epu64_b(__m256i a, __m256i b)
{
__m256i opposite_sign = _mm256_xor_si256(a,b);
__m256i mask = _mm256_cmpgt_epi64(a,b);
return _mm256_blendv_epi64(b,a,_mm256_xor_si256(mask, opposite_sign));
}
Just for reference, a signed maximum is of course just:
__m256i pmax_epi64(__m256i a, __m256i b)
{
__m256i mask = _mm256_cmpgt_epi64(a,b);
return _mm256_blendv_epi8(b,a,mask);
}

How to vectorize range check during block copy?

I have the function below:
void CopyImageBitsWithAlphaRGBA(unsigned char *dest, const unsigned char *src, int w, int stride, int h,
unsigned char minredmask, unsigned char mingreenmask, unsigned char minbluemask, unsigned char maxredmask, unsigned char maxgreenmask, unsigned char maxbluemask)
{
auto pend = src + w * h * 4;
for (auto p = src; p < pend; p += 4, dest += 4)
{
dest[0] = p[0]; dest[1] = p[1]; dest[2] = p[2];
if ((p[0] >= minredmask && p[0] <= maxredmask) || (p[1] >= mingreenmask && p[1] <= maxgreenmask) || (p[2] >= minbluemask && p[2] <= maxbluemask))
dest[3] = 255;
else
dest[3] = 0;
}
}
What it does is it copies a 32 bit bitmap from one memory block to another, setting the alpha channel to fully transparent when the pixel color falls within a certain color range.
How do I make this use SSE/AVX in VC++ 2017? Right now it's not generating vectorized code. Failing an automatic way of doing it, what functions can I use to do this myself?
Because really, I'd imagine testing if bytes are in a range would be one of the most obviously useful operations possible, but I can't see any built in function to take care of it.

I don't think you're going to get a compiler to auto-vectorize as well as you can do by hand with Intel's intrinsics. (err, as well as I can do by hand anyway :P).
Possibly once we manually vectorize it, we can see how to hand-hold a compiler with scalar code that works that way, but we really need packed-compare into a 0/0xFF with byte elements, and it's hard to write something in C that compilers will auto-vectorize well. The default integer promotions mean that most C expressions actually produce 32-bit results, even when you use uint8_t, and that often tricks compilers into unpacking 8-bit to 32-bit elements, costing a lot of shuffles on top of the automatic factor of 4 throughput loss (fewer elements per register), like in #harold's small tweak to your source.
SSE/AVX (before AVX512) has signed comparisons for SIMD integer, not unsigned. But you can range-shift things to signed -128..127 by subtracting 128. XOR (add-without-carry) is slightly more efficient on some CPUs, so you actually just XOR with 0x80 to flip the high bit. But mathematically you're subtracting 128 from a 0..255 unsigned value, giving a -128..127 signed value.
It's even still possible to implement the "unsigned compare trick" of (x-min) < (max-min). (For example, detecting alphabetic ASCII characters). As a bonus, we can bake the range-shift into that subtract. If x<min, it wraps around and becomes a large value greater than max-min. This obviously works for unsigned, but it does in fact work (with a range-shifted max-min) with SSE/AVX2 signed-compare instructions. (A previous version of this answer claimed this trick only worked if max-min < 128, but that's not the case. x-min can't wrap all the way around and become lower than max-min, or get into that range if it started above max).
An earlier version of this answer had code that made the range exclusive, i.e. not including the ends, so you even redmin=0 / redmax=255 would exclude pixels with red=0 or red=255. But I solved that by comparing the other way (thanks to ideas from #Nejc's and #chtz's answers).
#chtz's idea of using a saturating add/sub instead of a compare is very cool. If you arrange things so saturation means in-range, it works for an inclusive range. (And you can set the Alpha component to a known value by choosing a min/max that makes all 256 possible inputs in-range). This lets us avoid range-shifting to signed, because unsigned-saturation is available
We can combine the sub/cmp range-check with the saturation trick to do sub (wraps on out-of-bounds low) / subs (only reaches zero if the first sub didn't wrap). Then we don't need an andnot or or to combine two separate checks on each component; we already have a 0 / non-zero result in one vector.
So it only takes two operations to give us a 32-bit value for the whole pixel that we can check. Iff all 3 RGB components are in-range, that element will have a specific value. (Because we've arranged for the Alpha component to already give a known value, too). If any of the 3 components are out-of-range, it will have some other value.
If you do this the other way, so saturation means out-of-range, then you have an exclusive range in that direction, because you can't choose a limit such that no value reaches 0 or reaches 255. You can always saturate the alpha component to give yourself a known value there, regardless of what it means for the RGB components. An exclusive range would let you abuse this function to be always-false by choosing a range that no pixel could ever match. (Or if there's a third condition, besides per-component min/max, then maybe you want an override).
The obvious thing would be to use a packed-compare instruction with 32-bit element size (_mm256_cmpeq_epi32 / vpcmpeqd) to generate a 0xFF or 0x00 (which we can apply / blend into the original RGB pixel value) for in/out of range.
// AVX2 core idea: wrapping-compare trick with saturation to achieve unsigned compare
__m256i tmp = _mm256_sub_epi8(src, min_values); // wraps to high unsigned if below min
__m256i RGB_inrange = _mm256_subs_epu8(tmp, max_minus_min); // unsigned saturation to 0 means in-range
__m256i new_alpha = _mm256_cmpeq_epi32(RGB_inrange, _mm256_setzero_si256());
// then blend the high byte of each element with RGB from the src vector
__m256i alpha_replaced = _mm256_blendv_epi8(new_alpha, src, _mm256_set1_epi32(0x00FFFFFF)); // alpha from new_alpha, RGB from src
Note that an SSE2 version would only need one MOVDQA instructions to copy src; the same register is the destination for every instruction.
Also note that you could saturate the other direction: add then adds (with (256-max) and (256-(min-max)), I think) to saturate to 0xFF for in-range. This could be useful with AVX512BW if you use zero-masking with a fixed mask (e.g. for alpha) or variable mask (for some other condition) to exclude a component based on some other condition. AVX512BW zero-masking for the sub/subs version would consider components in-range even when they aren't, which could also be useful.
But extending that to AVX512 requires a different approach: AVX512 compares produce a bit-mask (in a mask register), not a vector, so we can't turn around and use the high byte of each 32-bit compare result separately.
Instead of cmpeq_epi32, we can produce the value we want in the high byte of each pixel using carry/borrow from a subtract, which propagates left to right.
0x00000000 - 1 = 0xFFFFFFFF # high byte = 0xFF = new alpha
0x00?????? - 1 = 0x00?????? # high byte = 0x00 = new alpha
Where ?????? has at least one non-zero bit, so it's a 32-bit number >=0 and <=0x00FFFFFFFF
Remember we choose an alpha range that makes the high byte always zero
i.e. _mm256_sub_epi32(RGB_inrange, _mm_set1_epi32(1)). We only need the high byte of each 32-bit element to have the alpha value we want, because we use a byte-blend to merge it with the source RGB values. For AVX512, this avoids a VPMOVM2D zmm1, k1 instruction to convert a compare result back into a vector of 0/-1, or (much more expensive) to interleave each mask bit with 3 zeros to use it for a byte-blend.
This sub instead of cmp has a minor advantage even for AVX2: sub_epi32 runs on more ports on Skylake (p0/p1/p5 vs. p0/p1 for pcmpgt/pcmpeq). On all other CPUs, vector integer add/sub run on the same ports as vector integer compare. (Agner Fog's instruction tables).
Also, if you compile _mm256_cmpeq_epi32() with -march=native on a CPU with AVX512, or otherwise enable AVX512 and then compile normal AVX2 intrinsics, some compilers will stupidly use AVX512 compare-into-mask and then expand back to a vector instead of just using the VEX-coded vpcmpeqd. Thus, we use sub instead of cmp even for the _mm256 intrinsics version, because I already spent the time to figure it out and show that it's at least as efficient in the normal case of compiling for regular AVX2. (Although _mm256_setzero_si256() is cheaper than set1(1); vpxor can zero a register cheaply instead of loading a constant, but this setup happens outside the loop.)
#include <immintrin.h>
#ifdef __AVX2__
// inclusive min and max
__m256i setAlphaFromRangeCheck_AVX2(__m256i src, __m256i mins, __m256i max_minus_min)
{
__m256i tmp = _mm256_sub_epi8(src, mins); // out-of-range wraps to a high signed value
// (x-min) <= (max-min) equivalent to:
// (x-min) - (max-min) saturates to zero
__m256i RGB_inrange = _mm256_subs_epu8(tmp, max_minus_min);
// 0x00000000 for in-range pixels, 0x00?????? (some higher value) otherwise
// this has minor advantages over compare against zero, see full comments on Godbolt
__m256i new_alpha = _mm256_sub_epi32(RGB_inrange, _mm256_set1_epi32(1));
// 0x00000000 - 1 = 0xFFFFFFFF
// 0x00?????? - 1 = 0x00?????? high byte = new alpha value
const __m256i RGB_mask = _mm256_set1_epi32(0x00FFFFFF); // blend mask
// without AVX512, the only byte-granularity blend is a 2-uop variable-blend with a control register
// On Ryzen, it's only 1c latency, so probably 1 uop that can only run on one port. (1c throughput).
// For 256-bit, that's 2 uops of course.
__m256i alpha_replaced = _mm256_blendv_epi8(new_alpha, src, RGB_mask); // RGB from src, 0/FF from new_alpha
return alpha_replaced;
}
#endif // __AVX2__
Set up vector args for this function and loop over your array with _mm256_load_si256 / _mm256_store_si256. (Or loadu/storeu if you can't guarantee alignment.)
This compiles very efficiently (Godbolt Compiler explorer) with gcc, clang, and MSVC. (AVX2 version on Godbolt is good, AVX512 and SSE versions are still a mess, not all the tricks applied to them yet.)
;; MSVC's inner loop from a caller that loops over an array with it:
;; see the Godbolt link
$LL4#:
vmovdqu ymm3, YMMWORD PTR [rdx+rax*4]
vpsubb ymm0, ymm3, ymm7
vpsubusb ymm1, ymm0, ymm6
vpsubd ymm2, ymm1, ymm5
vpblendvb ymm3, ymm2, ymm3, ymm4
vmovdqu YMMWORD PTR [rcx+rax*4], ymm3
add eax, 8
cmp eax, r8d
jb SHORT $LL4#
So MSVC managed to hoist the constant setup after inlining. We get similar loops from gcc/clang.
The loop has 4 vector ALU instructions, one of which takes 2 uops. Total 5 vector ALU uops. But total fused-domain uops on Haswell/Skylake = 9 with no unrolling, so with luck this can run at 32 bytes (1 vector) per 2.25 clock cycles. It could come close to actually achieving that with data hot in L1d or L2 cache, but L3 or memory would be a bottleneck. With unrolling, it could maybe bottlenck on L2 cache bandwidth.
An AVX512 version (also included in the Godbolt link), only needs 1 uop to blend, and could run faster in vectors per cycle, thus more than twice as fast using 512-byte vectors.

This is one possible way to make this function work with SSE instructions. I used SSE instead of AVX because I wanted to keep the answer simple. Once you understand how the solution works, rewriting the function with AVX intrinsics should not be much of a problem though.
EDIT: please note that my approach is very similar to one by PeterCordes, but his code should be faster because he uses AVX. If you want to rewrite the function below with AVX intrinsics, change step value to 8.
void CopyImageBitsWithAlphaRGBA(
unsigned char *dest,
const unsigned char *src, int w, int stride, int h,
unsigned char minred, unsigned char mingre, unsigned char minblu,
unsigned char maxred, unsigned char maxgre, unsigned char maxblu)
{
char low = 0x80; // -128
char high = 0x7f; // 127
char mnr = *(char*)(&minred) - low;
char mng = *(char*)(&mingre) - low;
char mnb = *(char*)(&minblu) - low;
int32_t lowest = mnr | (mng << 8) | (mnb << 16) | (low << 24);
char mxr = *(char*)(&maxred) - low;
char mxg = *(char*)(&maxgre) - low;
char mxb = *(char*)(&maxblu) - low;
int32_t highest = mxr | (mxg << 8) | (mxb << 16) | (high << 24);
// SSE
int step = 4;
int sse_width = (w / step)*step;
for (int y = 0; y < h; ++y)
{
for (int x = 0; x < w; x += step)
{
if (x == sse_width)
{
x = w - step;
}
int ptr_offset = y * stride + x;
const unsigned char* src_ptr = src + ptr_offset;
unsigned char* dst_ptr = dest + ptr_offset;
__m128i loaded = _mm_loadu_si128((__m128i*)src_ptr);
// subtract 128 from every 8-bit int
__m128i subtracted = _mm_sub_epi8(loaded, _mm_set1_epi8(low));
// greater than top limit?
__m128i masks_hi = _mm_cmpgt_epi8(subtracted, _mm_set1_epi32(highest));
// lower that bottom limit?
__m128i masks_lo = _mm_cmplt_epi8(subtracted, _mm_set1_epi32(lowest));
// perform OR operation on both masks
__m128i combined = _mm_or_si128(masks_hi, masks_lo);
// are 32-bit integers equal to zero?
__m128i eqzer = _mm_cmpeq_epi32(combined, _mm_setzero_si128());
__m128i shifted = _mm_slli_epi32(eqzer, 24);
// EDIT: fixed a bug:
__m128 alpha_unmasked = _mm_and_si128(loaded, _mm_set1_epi32(0x00ffffff));
__m128i combined = _mm_or_si128(alpha_unmasked, shifted);
_mm_storeu_si128((__m128i*)dst_ptr, combined);
}
}
}
EDIT: as #PeterCordes stated in the comments, the code included a bug that is now fixed.

Based on #PeterCordes solution, but replacing the shift+compare by saturated subtract and adding:
// mins_compl shall be [255-minR, 255-minG, 255-minB, 0]
// maxs shall be [maxR, maxG, maxB, 0]
__m256i setAlphaFromRangeCheck(__m256i src, __m256i mins_compl, __m256i maxs)
{
__m256i in_lo = _mm256_adds_epu8(src, mins_compl); // is 255 iff src+mins_coml>=255, i.e. src>=mins
__m256i in_hi = _mm256_subs_epu8(src, maxs); // is 0 iff src - maxs <= 0, i.e., src <= maxs
__m256i inbounds_components = _mm256_andnot_si256(in_hi, in_lo);
// per-component mask, 0xff, iff (mins<=src && src<=maxs).
// alpha-channel is always (~src & src) == 0
// Use a 32-bit element compare to check that all 3 components are in-range
__m256i RGB_mask = _mm256_set1_epi32(0x00FFFFFF);
__m256i inbounds = _mm256_cmpeq_epi32(inbounds_components, RGB_mask);
__m256i new_alpha = _mm256_slli_epi32(inbounds, 24);
// alternatively _mm256_andnot_si256(RGB_mask, inbounds) ?
// byte blends (vpblendvb) are at least 2 uops, and Haswell requires port5
// instead clear alpha and then OR in the new alpha (0 or 0xFF)
__m256i alphacleared = _mm256_and_si256(src, RGB_mask); // off the critical path
__m256i new_alpha_applied = _mm256_or_si256(alphacleared, new_alpha);
return new_alpha_applied;
}
This saves on vpxor (no modification of src required) and one vpand (the alpha-channel is automatically 0 -- I guess that would be possible with Peter's solution as well by choosing the boundaries accordingly).
Godbolt-Link, apparently, neither gcc nor clang think it is worthwhile to re-use RGB_mask for both usages ...
Simple testing with SSE2 variant: https://wandbox.org/permlink/eVzFHljxfTX5HDcq (you can play around with the source and the boundaries)

How would you transpose a binary matrix?

I have binary matrices in C++ that I repesent with a vector of 8-bit values.
For example, the following matrix:
1 0 1 0 1 0 1
0 1 1 0 0 1 1
0 0 0 1 1 1 1
is represented as:
const uint8_t matrix[] = {
0b01010101,
0b00110011,
0b00001111,
};
The reason why I'm doing it this way is because then computing the product of such a matrix and a 8-bit vector becomes really simple and efficient (just one bitwise AND and a parity computation, per row), which is much better than calculating each bit individually.
I'm now looking for an efficient way to transpose such a matrix, but I haven't been able to figure out how to do it without having to manually calculate each bit.
Just to clarify, for the above example, I'd like to get the following result from the transposition:
const uint8_t transposed[] = {
0b00000000,
0b00000100,
0b00000010,
0b00000110,
0b00000001,
0b00000101,
0b00000011,
0b00000111,
};
NOTE: I would prefer an algorithm that can calculate this with arbitrary-sized matrices but am also interested in algorithms that can only handle certain sizes.

I've spent more time looking for a solution, and I've found some good ones.
The SSE2 way
On a modern x86 CPU, transposing a binary matrix can be done very efficiently with SSE2 instructions. Using such instructions it is possible to process a 16×8 matrix.
This solution is inspired by this blog post by mischasan and is vastly superior to every suggestion I've got so far to this question.
The idea is simple:
#include <emmintrin.h>
Pack 16 uint8_t variables into an __m128i
Use _mm_movemask_epi8 to get the MSBs of each byte, producing an uint16_t
Use _mm_slli_epi64 to shift the 128-bit register by one
Repeat until you've got all 8 uint16_ts
A generic 32-bit solution
Unfortunately, I also need to make this work on ARM. After implementing the SSE2 version, it would be easy to just just find the NEON equivalents, but the Cortex-M CPU, (contrary to the Cortex-A) does not have SIMD capabilities, so NEON isn't too useful for me at the moment.
NOTE: Because the Cortex-M doesn't have native 64-bit arithmetics, I could not use the ideas in any answers that suggest to do it by treating a 8x8 block as an uint64_t. Most microcontrollers that have a Cortex-M CPU also don't have too much memory so I prefer to do all this without a lookup table.
After some thinking, the same algorithm can be implemented using plain 32-bit arithmetics and some clever coding. This way, I can work with 4×8 blocks at a time. It was suggested by a collegaue and the magic lies in the way 32-bit multiplication works: you can find a 32-bit number with which you can multiply and then the MSB of each byte gets next to each other in the upper 32 bits of the result.
Pack 4 uint8_ts in a 32-bit variable
Mask the 1st bit of each byte (using 0x80808080)
Multiply it with 0x02040810
Take the 4 LSBs of the upper 32 bits of the multiplication
Generally, you can mask the Nth bit in each byte (shift the mask right by N bits) and multiply with the magic number, shifted left by N bits. The advantage here is that if your compiler is smart enough to unroll the loop, both the mask and the 'magic number' become compile-time constants so shifting them does not incur any performance penalty whatsoever. There's some trouble with the last series of 4 bits, because then one LSB is lost, so in that case I needed to shift the input left by 8 bits and use the same method as the first series of 4-bits.
If you do this with two 4×8 blocks, then you can get an 8x8 block done and arrange the resulting bits so that everything goes into the right place.

My suggestion is that, you don't do the transposition, rather you add one bit information to your matrix data, indicating whether the matrix is transposed or not.
Now, if you want to multiply a transposd matrix with a vector, it will be the same as multiplying the matrix on the left by the vector (and then transpose). This is easy: just some xor operations of your 8-bit numbers.
This however makes some other operations complicated (e.g. adding two matrices). But in the comment you say that multiplication is exactly what you want to optimize.

Here is the text of Jay Foad's email to me regarding fast Boolean matrix
transpose:
The heart of the Boolean transpose algorithm is a function I'll call transpose8x8 which transposes an 8x8 Boolean matrix packed in a 64-bit word (in row major order from MSB to LSB). To transpose any rectangular matrix whose width and height are multiples of 8, break it down into 8x8 blocks, transpose each one individually and store them at the appropriate place in the output. To load an 8x8 block you have to load 8 individual bytes and shift and OR them into a 64-bit word. Same kinda thing for storing.
A plain C implementation of transpose8x8 relies on the fact that all the bits on any diagonal line parallel to the leading diagonal move the same distance up/down and left/right. For example, all the bits just above the leading diagonal have to move one place left and one place down, i.e. 7 bits to the right in the packed 64-bit word. This leads to an algorithm like this:
transpose8x8(word) {
return
(word & 0x0100000000000000) >> 49 // top right corner
| (word & 0x0201000000000000) >> 42
| ...
| (word & 0x4020100804020100) >> 7 // just above diagonal
| (word & 0x8040201008040201) // leading diagonal
| (word & 0x0080402010080402) << 7 // just below diagonal
| ...
| (word & 0x0000000000008040) << 42
| (word & 0x0000000000000080) << 49; // bottom left corner
}
This runs about 10x faster than the previous implementation, which copied each bit individually from the source byte in memory and merged it into the destination byte in memory.
Alternatively, if you have PDEP and PEXT instructions you can implement a perfect shuffle, and use that to do the transpose as mentioned in Hacker's Delight. This is significantly faster (but I don't have timings handy):
shuffle(word) {
return pdep(word >> 32, 0xaaaaaaaaaaaaaaaa) | pdep(word, 0x5555555555555555);
} // outer perfect shuffle
transpose8x8(word) { return shuffle(shuffle(shuffle(word))); }
POWER's vgbbd instruction effectively implements the whole of transpose8x8 in a single instruction (and since it's a 128-bit vector instruction it does it twice, independently, on the low 64 bits and the high 64 bits). This gave about 15% speed-up over the plain C implementation. (Only 15% because, although the bit twiddling is much faster, the overall run time is now dominated by the time it takes to load 8 bytes and assemble them into the argument to transpose8x8, and to take the result and store it as 8 separate bytes.)

My suggestion would be to use a lookup table to speed up the processing.
Another thing to note is with the current definition of your matrix the maximum size will be 8x8 bits. This fits into a uint64_t so we can use this to our advantage especially when using a 64-bit platform.
I have worked out a simple example using a lookup table which you can find below and run using: http://www.tutorialspoint.com/compile_cpp11_online.php online compiler.
Example code
#include <iostream>
#include <bitset>
#include <stdint.h>
#include <assert.h>
using std::cout;
using std::endl;
using std::bitset;
/* Static lookup table */
static uint64_t lut[256];
/* Helper function to print array */
template<int N>
void print_arr(const uint8_t (&arr)[N]){
for(int i=0; i < N; ++i){
cout << bitset<8>(arr[i]) << endl;
}
}
/* Transpose function */
template<int N>
void transpose_bitmatrix(const uint8_t (&matrix)[N], uint8_t (&transposed)[8]){
assert(N <= 8);
uint64_t value = 0;
for(int i=0; i < N; ++i){
value = (value << 1) + lut[matrix[i]];
}
/* Ensure safe copy to prevent misalignment issues */
/* Can be removed if input array can be treated as uint64_t directly */
for(int i=0; i < 8; ++i){
transposed[i] = (value >> (i * 8)) & 0xFF;
}
}
/* Calculate lookup table */
void calculate_lut(void){
/* For all byte values */
for(uint64_t i = 0; i < 256; ++i){
auto b = std::bitset<8>(i);
auto v = std::bitset<64>(0);
/* For all bits in current byte */
for(int bit=0; bit < 8; ++bit){
if(b.test(bit)){
v.set((7 - bit) * 8);
}
}
lut[i] = v.to_ullong();
}
}
int main()
{
calculate_lut();
const uint8_t matrix[] = {
0b01010101,
0b00110011,
0b00001111,
};
uint8_t transposed[8];
transpose_bitmatrix(matrix, transposed);
print_arr(transposed);
return 0;
}
How it works
your 3x8 matrix will be transposed to a 8x3 matrix, represented in an 8x8 array.
The issue is that you want to convert bits, your "horizontal" representation to a vertical one, divided over several bytes.
As I mentioned above, we can take advantage of the fact that the output (8x8) will always fit into a uint64_t. We will use this to our advantage because now we can use an uint64_t to write the 8 byte array, but we can also use it for to add, xor, etc. because we can perform basic arithmetic operations on a 64 bit integer.
Each entry in your 3x8 matrix (input) is 8 bits wide, to optimize processing we first generate 256 entry lookup table (for each byte value). The entry itself is a uint64_t and will contain a rotated version of the bits.
example:
byte = 0b01001111 = 0x4F
lut[0x4F] = 0x0001000001010101 = (uint8_t[]){ 0, 1, 0, 0, 1, 1, 1, 1 }
Now for the calculation:
For the calculations we use the uint64_t but keep in mind that under water it will represent a uint8_t[8] array. We simple shift the current value (start with 0), look up our first byte and add it to the current value.
The 'magic' here is that each byte of the uint64_t in the lookup table will either be 1 or 0 so it will only set the least significant bit (of each byte). Shifting the uint64_t will shift each byte, as long as we make sure we do not do this more than 8 times! we can do operations on each byte individually.
Issues
As someone noted in the comments: Translate(Translate(M)) != M so if you need this you need some additional work.
Perfomance can be improved by directly mapping uint64_t's instead of uint8_t[8] arrays since it omits a "safe-copy" to prevent alignment issues.

I have added a new awnser instead of editing my original one to make this more visible (no comment rights unfortunatly).
In your own awnser you add an additional requirement not present in the first one: It has to work on ARM Cortex-M
I did come up with an alternative solution for ARM in my original awnser but omitted it as it was not part of the question and seemed off topic (mostly because of the C++ tag).
ARM Specific solution Cortex-M:
Some or most Cortex-M 3/4 have a bit banding region which can be used for exactly what you need, it expands bits into 32-bit fields, this region can be used to perform atomic bit operations.
If you put your array in a bitbanded region it will have an 'exploded' mirror in the bitband region where you can just use move operations on the bits itself. If you make a loop the compiler will surely be able to unroll and optimize to just move operations.
If you really want to, you can even setup a DMA controller to process an entire batch of transpose operations with a bit of effort and offload it entirely from the cpu :)
Perhaps this might still help you.

This is a bit late, but I just stumbled across this interchange today.
If you look at Hacker's Delight, 2nd Edition,there are several algorithms for efficiently transposing Boolean arrays, starting on page 141.
They are quite efficient: a colleague of mine obtained a factor about 10X
speedup compared to naive coding, on an X86.

Here's what I posted on gitub (mischasan/sse2/ssebmx.src)
Changing INP() and OUT() to use induction vars saves an IMUL each.
AVX256 does it twice as fast.
AVX512 is not an option, because there is no _mm512_movemask_epi8().
#include <stdint.h>
#include <emmintrin.h>
#define INP(x,y) inp[(x)*ncols/8 + (y)/8]
#define OUT(x,y) out[(y)*nrows/8 + (x)/8]
void ssebmx(char const *inp, char *out, int nrows, int ncols)
{
int rr, cc, i, h;
union { __m128i x; uint8_t b[16]; } tmp;
// Do the main body in [16 x 8] blocks:
for (rr = 0; rr <= nrows - 16; rr += 16)
for (cc = 0; cc < ncols; cc += 8) {
for (i = 0; i < 16; ++i)
tmp.b[i] = INP(rr + i, cc);
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
*(uint16_t*)&OUT(rr, cc + i) = _mm_movemask_epi8(tmp.x);
}
if (rr == nrows) return;
// The remainder is a row of [8 x 16]* [8 x 8]?
// Do the [8 x 16] blocks:
for (cc = 0; cc <= ncols - 16; cc += 16) {
for (i = 8; i--;)
tmp.b[i] = h = *(uint16_t const*)&INP(rr + i, cc),
tmp.b[i + 8] = h >> 8;
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
OUT(rr, cc + i) = h = _mm_movemask_epi8(tmp.x),
OUT(rr, cc + i + 8) = h >> 8;
}
if (cc == ncols) return;
// Do the remaining [8 x 8] block:
for (i = 8; i--;)
tmp.b[i] = INP(rr + i, cc);
for (i = 8; i--; tmp.x = _mm_slli_epi64(tmp.x, 1))
OUT(rr, cc + i) = _mm_movemask_epi8(tmp.x);
}
HTH.

Inspired by Roberts answer, polynomial multiplication in Arm Neon can be utilised to scatter the bits --
inline poly8x16_t mull_lo(poly8x16_t a) {
auto b = vget_low_p8(a);
return vreinterpretq_p8_p16(vmull_p8(b,b));
}
inline poly8x16_t mull_hi(poly8x16_t a) {
auto b = vget_high_p8(a);
return vreinterpretq_p8_p16(vmull_p8(b,b));
}
auto a = mull_lo(word);
auto b = mull_lo(a), c = mull_hi(a);
auto d = mull_lo(b), e = mull_hi(b);
auto f = mull_lo(c), g = mull_hi(c);
Then the vsli can be used to combine the bits pairwise.
auto ab = vsli_p8(vget_high_p8(d), vget_low_p8(d), 1);
auto cd = vsli_p8(vget_high_p8(e), vget_low_p8(e), 1);
auto ef = vsli_p8(vget_high_p8(f), vget_low_p8(f), 1);
auto gh = vsli_p8(vget_high_p8(g), vget_low_p8(g), 1);
auto abcd = vsli_p8(ab, cd, 2);
auto efgh = vsli_p8(ef, gh, 2);
return vsli_p8(abcd, efgh, 4);
Clang optimizes this code to avoid vmull2 instructions, using heavily ext q0,q0,8 to vget_high_p8.
An iterative approach would possibly be not only faster, but also uses less registers and also simdifies for 2x or more throughput.
// transpose bits in 2x2 blocks, first 4 rows
// x = a b|c d|e f|g h a i|c k|e m|g o | byte 0
// i j|k l|m n|o p b j|d l|f n|h p | byte 1
// q r|s t|u v|w x q A|s C|u E|w G | byte 2
// A B|C D|E F|G H r B|t D|v F|h H | byte 3 ...
// ----------------------
auto a = (x & 0x00aa00aa00aa00aaull);
auto b = (x & 0x5500550055005500ull);
auto c = (x & 0xaa55aa55aa55aa55ull) | (a << 7) | (b >> 7);
// transpose 2x2 blocks (first 4 rows shown)
// aa bb cc dd aa ii cc kk
// ee ff gg hh -> ee mm gg oo
// ii jj kk ll bb jj dd ll
// mm nn oo pp ff nn hh pp
auto d = (c & 0x0000cccc0000ccccull);
auto e = (c & 0x3333000033330000ull);
auto f = (c & 0xcccc3333cccc3333ull) | (d << 14) | (e >> 14);
// Final transpose of 4x4 bit blocks
auto g = (f & 0x00000000f0f0f0f0ull);
auto h = (f & 0x0f0f0f0f00000000ull);
x = (f & 0xf0f0f0f00f0f0f0full) | (g << 28) | (h >> 28);
In ARM each step can now be composed with 3 instructions:
auto tmp = vrev16_u8(x);
tmp = vshl_u8(tmp, plus_minus_1); // 0xff01ff01ff01ff01ull
x = vbsl_u8(mask_1, x, tmp); // 0xaa55aa55aa55aa55ull
tmp = vrev32_u16(x);
tmp = vshl_u16(tmp, plus_minus_2); // 0xfefe0202fefe0202ull
x = vbsl_u8(mask_2, x, tmp); // 0xcccc3333cccc3333ull
tmp = vrev64_u32(x);
tmp = vshl_u32(tmp, plus_minus_4); // 0xfcfcfcfc04040404ull
x = vbsl_u8(mask_4, x, tmp); // 0xf0f0f0f00f0f0f0full

Sparse array compression using SIMD (AVX2)

I have a sparse array a (mostly zeroes):
unsigned char a[1000000];
and I would like to create an array b of indexes to non-zero elements of a using SIMD instructions on Intel x64 architecture with AVX2. I'm looking for tips how to do it efficiently. Specifically, are there SIMD instruction(s) to get positions of consecutive non-zero elements in SIMD register, arranged contiguously?

Five methods to compute the indices of the nonzeros are:
Semi vectorized loop: Load a SIMD vector with chars, compare with zero and apply a movemask. Use a small scalar loop if any of the chars is nonzero
(also suggested by #stgatilov). This works well for very sparse arrays. Function arr2ind_movmsk in the code below uses BMI1 instructions
for the scalar loop.
Vectorized loop: Intel Haswell processors and newer support the BMI1 and BMI2 instruction sets. BMI2 contains
the pext instruction (Parallel bits extract, see wikipedia link),
which turns out to be useful here. See arr2ind_pext in the code below.
Classic scalar loop with if statement: arr2ind_if.
Scalar loop without branches: arr2ind_cmov.
Lookup table: #stgatilov shows that it is possible to use a lookup table instead of the pdep and other integer
instructions. This might work well, however, the lookup table is quite large: it doesn't fit in the L1 cache.
Not tested here. See also the discussion here.
/*
gcc -O3 -Wall -m64 -mavx2 -fopenmp -march=broadwell -std=c99 -falign-loops=16 sprs_char2ind.c
example: Test different methods with an array a of size 20000 and approximate 25/1024*100%=2.4% nonzeros:
./a.out 20000 25
*/
#include <stdio.h>
#include <immintrin.h>
#include <stdint.h>
#include <omp.h>
#include <string.h>
__attribute__ ((noinline)) int arr2ind_movmsk(const unsigned char * restrict a, int n, int * restrict ind, int * m){
int i, m0, k;
__m256i msk;
m0=0;
for (i=0;i<n;i=i+32){ /* Load 32 bytes and compare with zero: */
msk=_mm256_cmpeq_epi8(_mm256_load_si256((__m256i *)&a[i]),_mm256_setzero_si256());
k=_mm256_movemask_epi8(msk);
k=~k; /* Search for nonzero bits instead of zero bits. */
while (k){
ind[m0]=i+_tzcnt_u32(k); /* Count the number of trailing zero bits in k. */
m0++;
k=_blsr_u32(k); /* Clear the lowest set bit in k. */
}
}
*m=m0;
return 0;
}
__attribute__ ((noinline)) int arr2ind_pext(const unsigned char * restrict a, int n, int * restrict ind, int * m){
int i, m0;
uint64_t cntr_const = 0xFEDCBA9876543210;
__m256i shft = _mm256_set_epi64x(0x04,0x00,0x04,0x00);
__m256i vmsk = _mm256_set1_epi8(0x0F);
__m256i cnst16 = _mm256_set1_epi32(16);
__m256i shf_lo = _mm256_set_epi8(0x80,0x80,0x80,0x0B, 0x80,0x80,0x80,0x03, 0x80,0x80,0x80,0x0A, 0x80,0x80,0x80,0x02,
0x80,0x80,0x80,0x09, 0x80,0x80,0x80,0x01, 0x80,0x80,0x80,0x08, 0x80,0x80,0x80,0x00);
__m256i shf_hi = _mm256_set_epi8(0x80,0x80,0x80,0x0F, 0x80,0x80,0x80,0x07, 0x80,0x80,0x80,0x0E, 0x80,0x80,0x80,0x06,
0x80,0x80,0x80,0x0D, 0x80,0x80,0x80,0x05, 0x80,0x80,0x80,0x0C, 0x80,0x80,0x80,0x04);
__m128i pshufbcnst = _mm_set_epi8(0x80,0x80,0x80,0x80,0x80,0x80,0x80,0x80, 0x0E,0x0C,0x0A,0x08,0x06,0x04,0x02,0x00);
__m256i i_vec = _mm256_setzero_si256();
m0=0;
for (i=0;i<n;i=i+16){
__m128i v = _mm_load_si128((__m128i *)&a[i]); /* Load 16 bytes. */
__m128i msk = _mm_cmpeq_epi8(v,_mm_setzero_si128()); /* Generate 16x8 bit mask. */
msk = _mm_srli_epi64(msk,4); /* Pack 16x8 bit mask to 16x4 bit mask. */
msk = _mm_shuffle_epi8(msk,pshufbcnst); /* Pack 16x8 bit mask to 16x4 bit mask. */
msk = _mm_xor_si128(msk,_mm_set1_epi32(-1)); /* Invert 16x4 mask. */
uint64_t msk64 = _mm_cvtsi128_si64x(msk); /* _mm_popcnt_u64 and _pext_u64 work on 64-bit general-purpose registers, not on simd registers.*/
int p = _mm_popcnt_u64(msk64)>>2; /* p is the number of nonzeros in 16 bytes of a. */
uint64_t cntr = _pext_u64(cntr_const,msk64); /* parallel bits extract. cntr contains p 4-bit integers. The 16 4-bit integers in cntr_const are shuffled to the p 4-bit integers that we want */
/* The next 7 intrinsics unpack these p 4-bit integers to p 32-bit integers. */
__m256i cntr256 = _mm256_set1_epi64x(cntr);
cntr256 = _mm256_srlv_epi64(cntr256,shft);
cntr256 = _mm256_and_si256(cntr256,vmsk);
__m256i cntr256_lo = _mm256_shuffle_epi8(cntr256,shf_lo);
__m256i cntr256_hi = _mm256_shuffle_epi8(cntr256,shf_hi);
cntr256_lo = _mm256_add_epi32(i_vec,cntr256_lo);
cntr256_hi = _mm256_add_epi32(i_vec,cntr256_hi);
_mm256_storeu_si256((__m256i *)&ind[m0],cntr256_lo); /* Note that the stores of iteration i and i+16 may overlap. */
_mm256_storeu_si256((__m256i *)&ind[m0+8],cntr256_hi); /* Array ind has to be large enough to avoid segfaults. At most 16 integers are written more than strictly necessary */
m0 = m0+p;
i_vec = _mm256_add_epi32(i_vec,cnst16);
}
*m=m0;
return 0;
}
__attribute__ ((noinline)) int arr2ind_if(const unsigned char * restrict a, int n, int * restrict ind, int * m){
int i, m0;
m0=0;
for (i=0;i<n;i++){
if (a[i]!=0){
ind[m0]=i;
m0=m0+1;
}
}
*m=m0;
return 0;
}
__attribute__((noinline)) int arr2ind_cmov(const unsigned char * restrict a, int n, int * restrict ind, int * m){
int i, m0;
m0=0;
for (i=0;i<n;i++){
ind[m0]=i;
m0=(a[i]==0)? m0 : m0+1; /* Compiles to cmov instruction. */
}
*m=m0;
return 0;
}
__attribute__ ((noinline)) int print_nonz(const unsigned char * restrict a, const int * restrict ind, const int m){
int i;
for (i=0;i<m;i++) printf("i=%d, ind[i]=%d a[ind[i]]=%u\n",i,ind[i],a[ind[i]]);
printf("\n"); fflush( stdout );
return 0;
}
__attribute__ ((noinline)) int print_chk(const unsigned char * restrict a, const int * restrict ind, const int m){
int i; /* Compute a hash to compare the results of different methods. */
unsigned int chk=0;
for (i=0;i<m;i++){
chk=((chk<<1)|(chk>>31))^(ind[i]);
}
printf("chk = %10X\n",chk);
return 0;
}
int main(int argc, char **argv){
int n, i, m;
unsigned int j, k, d;
unsigned char *a;
int *ind;
double t0,t1;
int meth, nrep;
char txt[30];
sscanf(argv[1],"%d",&n); /* Length of array a. */
n=n>>5; /* Adjust n to a multiple of 32. */
n=n<<5;
sscanf(argv[2],"%u",&d); /* The approximate fraction of nonzeros in a is: d/1024 */
printf("n=%d, d=%u\n",n,d);
a=_mm_malloc(n*sizeof(char),32);
ind=_mm_malloc(n*sizeof(int),32);
/* Generate a pseudo random array a. */
j=73659343;
for (i=0;i<n;i++){
j=j*653+1;
k=(j & 0x3FF00)>>8; /* k is a pseudo random number between 0 and 1023 */
if (k<d){
a[i] = (j&0xFE)+1; /* Set a[i] to nonzero. */
}else{
a[i] = 0;
}
}
/* for (i=0;i<n;i++){if (a[i]!=0){printf("i=%d, a[i]=%u\n",i,a[i]);}} printf("\n"); */ /* Uncomment this line to print the nonzeros in a. */
char txt0[]="arr2ind_movmsk: ";
char txt1[]="arr2ind_pext: ";
char txt2[]="arr2ind_if: ";
char txt3[]="arr2ind_cmov: ";
nrep=10000; /* Repeat a function nrep times to make relatively accurate timings possible. */
/* With nrep=1000000: ./a.out 10016 4 ; ./a.out 10016 48 ; ./a.out 10016 519 */
/* With nrep=10000: ./a.out 1000000 5 ; ./a.out 1000000 52 ; ./a.out 1000000 513 */
printf("nrep = \%d \n\n",nrep);
arr2ind_movmsk(a,n,ind,&m); /* Make sure that the arrays a and ind are read and/or written at least one time before benchmarking. */
for (meth=0;meth<4;meth++){
t0=omp_get_wtime();
switch (meth){
case 0: for(i=0;i<nrep;i++) arr2ind_movmsk(a,n,ind,&m); strcpy(txt,txt0); break;
case 1: for(i=0;i<nrep;i++) arr2ind_pext(a,n,ind,&m); strcpy(txt,txt1); break;
case 2: for(i=0;i<nrep;i++) arr2ind_if(a,n,ind,&m); strcpy(txt,txt2); break;
case 3: for(i=0;i<nrep;i++) arr2ind_cmov(a,n,ind,&m); strcpy(txt,txt3); break;
default: ;
}
t1=omp_get_wtime();
printf("method = %s ",txt);
/* print_chk(a,ind,m); */
printf(" elapsed time = %6.2f\n",t1-t0);
}
print_nonz(a, ind, 2); /* Do something with the results */
printf("density = %f %% \n\n",((double)m)/((double)n)*100); /* Actual nonzero density of array a. */
/* print_nonz(a, ind, m); */ /* Uncomment this line to print the indices of the nonzeros. */
return 0;
}
/*
With nrep=1000000:
./a.out 10016 4 ; ./a.out 10016 4 ; ./a.out 10016 48 ; ./a.out 10016 48 ; ./a.out 10016 519 ; ./a.out 10016 519
With nrep=10000:
./a.out 1000000 5 ; ./a.out 1000000 5 ; ./a.out 1000000 52 ; ./a.out 1000000 52 ; ./a.out 1000000 513 ; ./a.out 1000000 513
*/
The code was tested with array size of n=10016 (the data fits in L1 cache) and n=1000000, with
different nonzero densities of about 0.5%, 5% and 50%. For accurate timing the functions were called 1000000
and 10000 times, respectively.
Time in seconds, size n=10016, 1e6 function calls. Intel core i5-6500
0.53% 5.1% 50.0%
arr2ind_movmsk: 0.27 0.53 4.89
arr2ind_pext: 1.44 1.59 1.45
arr2ind_if: 5.93 8.95 33.82
arr2ind_cmov: 6.82 6.83 6.82
Time in seconds, size n=1000000, 1e4 function calls.
0.49% 5.1% 50.1%
arr2ind_movmsk: 0.57 2.03 5.37
arr2ind_pext: 1.47 1.47 1.46
arr2ind_if: 5.88 8.98 38.59
arr2ind_cmov: 6.82 6.81 6.81
In these examples the vectorized loops are faster than the scalar loops.
The performance of arr2ind_movmsk depends a lot on the density of a. It is only
faster than arr2ind_pext if the density is sufficiently small. The break-even point also depends on the array size n.
Function 'arr2ind_if' clearly suffers from failing branch prediction at 50% nonzero density.

If you expect number of nonzero elements to be very low (i.e. much less than 1%), then you can simply check each 16-byte chunk for being nonzero:
int mask = _mm_movemask_epi8(_mm_cmpeq_epi8(reg, _mm_setzero_si128());
if (mask != 65535) {
//store zero bits of mask with scalar code
}
If percentage of good elements is sufficiently small, the cost of mispredicted branches and the cost of slow scalar code inside 'if' would be negligible.
As for a good general solution, first consider SSE implementation of stream compaction. It removes all zero elements from byte array (idea taken from here):
__m128i shuf [65536]; //must be precomputed
char cnt [65536]; //must be precomputed
int compress(const char *src, int len, char *dst) {
char *ptr = dst;
for (int i = 0; i < len; i += 16) {
__m128i reg = _mm_load_si128((__m128i*)&src[i]);
__m128i zeroMask = _mm_cmpeq_epi8(reg, _mm_setzero_si128());
int mask = _mm_movemask_epi8(zeroMask);
__m128i compressed = _mm_shuffle_epi8(reg, shuf[mask]);
_mm_storeu_si128((__m128i*)ptr, compressed);
ptr += cnt[mask]; //alternative: ptr += 16-_mm_popcnt_u32(mask);
}
return ptr - dst;
}
As you see, (_mm_shuffle_epi8 + lookup table) can do wonders. I don't know any other way of vectorizing structurally complex code like stream compaction.
Now the only remaining problem with your request is that you want to get indices. Each index must be stored in 4-byte value, so a chunk of 16 input bytes may produce up to 64 bytes of output, which do not fit into single SSE register.
One way to handle this is to honestly unpack the output to 64 bytes. So you replace reg with constant (0,1,2,3,4,...,15) in the code, then unpack the SSE register into 4 registers, and add a register with four i values. This would take much more instructions: 6 unpack instructions, 4 adds, and 3 stores (one is already there). As for me, that is a huge overhead, especially if you expect less than 25% of nonzero elements.
Alternatively, you can limit the number of nonzero bytes processed by single loop iteration by 4, so that one register is always enough for output.
Here is the sample code:
__m128i shufMask [65536]; //must be precomputed
char srcMove [65536]; //must be precomputed
char dstMove [65536]; //must be precomputed
int compress_ids(const char *src, int len, int *dst) {
const char *ptrSrc = src;
int *ptrDst = dst;
__m128i offsets = _mm_setr_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);
__m128i base = _mm_setzero_si128();
while (ptrSrc < src + len) {
__m128i reg = _mm_loadu_si128((__m128i*)ptrSrc);
__m128i zeroMask = _mm_cmpeq_epi8(reg, _mm_setzero_si128());
int mask = _mm_movemask_epi8(zeroMask);
__m128i ids8 = _mm_shuffle_epi8(offsets, shufMask[mask]);
__m128i ids32 = _mm_unpacklo_epi16(_mm_unpacklo_epi8(ids8, _mm_setzero_si128()), _mm_setzero_si128());
ids32 = _mm_add_epi32(ids32, base);
_mm_storeu_si128((__m128i*)ptrDst, ids32);
ptrDst += dstMove[mask]; //alternative: ptrDst += min(16-_mm_popcnt_u32(mask), 4);
ptrSrc += srcMove[mask]; //no alternative without LUT
base = _mm_add_epi32(base, _mm_set1_epi32(dstMove[mask]));
}
return ptrDst - dst;
}
One drawback of this approach is that now each subsequent loop iteration cannot start until the line ptrDst += dstMove[mask]; is executed on the previous iteration. So the critical path has increased dramatically. Hardware hyperthreading or its manual emulation can remove this penalty.
So, as you see, there are many variations of this basic idea, all of which solve your problem with different degree of efficiency. You can also reduce size of LUT if you don't like it (again, at the cost of decreasing throughput performance).
This approach cannot be fully extended to wider registers (i.e. AVX2 and AVX-512), but you can try to combine instructions of several consecutive iterations into single AVX2 or AVX-512 instruction, thus slightly increasing throughput.
Note: I didn't test any code (because precomputing LUT correctly requires noticeable effort).

Although AVX2 instruction set has many GATHER instructions, but its performance is too slow. And the most effective way to do this - to process an array manually.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js