Please tell me, I can't figure it out myself:
Here I have __m128i SIMD vector - each of the 16 bytes contains the following value:
1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1
Is it possible to somehow transform this vector so that all ones are removed, and the place of zeros is the number of the element in the vector of this zero. That is, like this:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1
1 4 6 10 12 14
And finally get a vector with only these values:
1 4 6 10 12 14
What logic can be to obtain such a result? What SIMD instructions should be used?
PS: I'm just starting to learn SIMD - so I don't know much. and I don't understand.
If you have BMI2, use the following version.
__m128i compressZeroIndices_bmi2( __m128i v )
{
const __m128i zero = _mm_setzero_si128();
// Replace zeros with 0xFF
v = _mm_cmpeq_epi8( v, zero );
// Extract low/high pieces into scalar registers for PEXT instruction
uint64_t low = (uint64_t)_mm_cvtsi128_si64( v );
uint64_t high = (uint64_t)_mm_extract_epi64( v, 1 );
// Count payload bytes in the complete vector
v = _mm_sub_epi8( zero, v );
v = _mm_sad_epu8( v, zero );
v = _mm_add_epi64( v, _mm_srli_si128( v, 8 ) );
v = _mm_shuffle_epi8( v, zero );
// Make a mask vector filled with 0 for payload bytes, 0xFF for padding
const __m128i identity = _mm_setr_epi8( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 );
v = _mm_max_epu8( v, identity );
__m128i mask = _mm_cmpeq_epi8( v, identity );
// The following line requires C++/20
// If you don't have it, use #ifdef _MSC_VER to switch between __popcnt64() and _popcnt64() intrinsics.
uint64_t lowBits = std::popcount( low );
// Use BMI2 to gather these indices
low = _pext_u64( 0x0706050403020100ull, low );
high = _pext_u64( 0x0F0E0D0C0B0A0908ull, high );
// Merge payload into a vector
v = _mm_cvtsi64_si128( low | ( high << lowBits ) );
v = _mm_insert_epi64( v, high >> ( 64 - lowBits ), 1 );
// Apply the mask to set unused elements to -1, enables pmovmskb + tzcnt to find the length
return _mm_or_si128( v, mask );
}
Here’s another version without BMI2. Probably slower on most CPUs, but the code is way simpler, and doesn’t use any scalar instructions.
inline __m128i sortStep( __m128i a, __m128i perm, __m128i blend )
{
// The min/max are independent and their throughput is 0.33-0.5 cycles,
// so this whole function only takes 3 (AMD) or 4 (Intel) cycles to complete
__m128i b = _mm_shuffle_epi8( a, perm );
__m128i i = _mm_min_epu8( a, b );
__m128i ax = _mm_max_epu8( a, b );
return _mm_blendv_epi8( i, ax, blend );
}
__m128i compressZeroIndices( __m128i v )
{
// Replace zeros with 0-based indices, ones with 0xFF
v = _mm_cmpgt_epi8( v, _mm_setzero_si128() );
const __m128i identity = _mm_setr_epi8( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 );
v = _mm_or_si128( v, identity );
// Sort bytes in the vector with a network
// https://demonstrations.wolfram.com/SortingNetworks/
// Click the "transposition" algorithm on that demo
const __m128i perm1 = _mm_setr_epi8( 1, 0, 3, 2, 5, 4, 7, 6, 9, 8, 11, 10, 13, 12, 15, 14 );
const __m128i blend1 = _mm_set1_epi16( (short)0xFF00 );
const __m128i perm2 = _mm_setr_epi8( 0, 2, 1, 4, 3, 6, 5, 8, 7, 10, 9, 12, 11, 14, 13, 15 );
const __m128i blend2 = _mm_setr_epi8( 0, 0, -1, 0, -1, 0, -1, 0, -1, 0, -1, 0, -1, 0, -1, 0 );
for( size_t i = 0; i < 8; i++ )
{
v = sortStep( v, perm1, blend1 );
v = sortStep( v, perm2, blend2 );
}
return v;
}
P.S. If you want the length of the output vector, use this function:
uint32_t vectorLength( __m128i v )
{
uint32_t mask = (uint32_t)_mm_movemask_epi8( v );
mask |= 0x10000;
return _tzcnt_u32( mask );
}
Horizontal data-dependent stuff is hard. This is not something traditional SIMD building blocks are good at. This is a tricky problem to start learning SIMD with.
If you had AVX512VBMI2 (Ice Lake), vpcompressb could do this in one instruction on a constant. (Well two, counting the test-into-mask of the input.)
Or with AVX-512BW (Skylake-avx512), you could use vpcompressd on a constant vector of 16x uint32_t and then pack that __m512i down to bytes after compressing with vpmovdb. (After the same test-into-mask of the byte vector).
16 separate elements means a single table-lookup is not viable; 2^16 x __m128i would be 64K x 16-bytes = 1 MiB, most accesses would miss in cache. (The code would be simple though; just _mm_cmpeq_epi8 against zero or _mm_slli_epi32(v, 7) / _mm_movemask_epi8 / use that 16-bit bitmask as an array index).
Possibly 4 lookup of 4-byte chunks using 4 mask bits at a time could work. (With SWAR add of 0x04040404 / 0x08080808 / 0x0c0c0c0c to offset the result). Your table could also store offset values, or you could _lzcnt_u32 or something to figure out how much to increment a pointer until the next store, or _popcnt_u32(zpos&0xf).
#include <stdint.h>
#include <immintrin.h>
#include <stdalign.h>
#include <string.h>
// untested but compiles ok
char *zidx_SSE2(char *outbuf, __m128i v)
{
alignas(64) static struct __attribute__((packed)) {
uint32_t idx;
uint8_t count; // or make this also uint32_t, but still won't allow a memory-source add unless it's uintptr_t. Indexing could be cheaper in a PIE though, *8 instead of *5 which needs both base and idx
}lut[] = { // 16x 5-byte entries
/*[0b0000]=*/ {0, 0}, /* [0b0001]= */ {0x00000000, 1}, /* [0b0010]= */ {0x00000001, 1 },
//... left-packed indices, count of non-zero bits
/* [0b1111]=*/ {0x03020100, 4}
};
// Maybe pack the length into the high 4 bits, and mask? Maybe not, it's a small LUT
unsigned zpos = _mm_movemask_epi8(_mm_cmpeq_epi8(v, _mm_setzero_si128()));
for (int i=0 ; i<16 ; i+=4){
uint32_t idx = lut[zpos & 0xf].idx;
idx += (0x01010101 * i); // this strength-reduces or is a constant after fully unrolling. GCC -O2 even realizes it can use add reg, 0x04040404 *as* the loop counter; clang -fno-unroll-loops doesn't
// idxs from bits 0..3, bits 4..7, bits 8..11, or bits 12..15
memcpy(outbuf, &idx, sizeof(idx)); // x86 is little-endian. Aliasing safe unaligned store.
outbuf += lut[zpos & 0xf].count; // or popcount(zpos&0xf)
zpos >>= 4;
}
return outbuf; // pointer to next byte after the last valid idx
}
https://godbolt.org/z/59Ev1Tz37 shows GCC and clang without loop unrolling. gcc -O3 does fully unroll it, as does clang at -O2 by default.
It will never store more than 16 bytes into outbuf, but stores fewer than that for inputs with fewer zero bytes. (But every store to outbuf is 4 bytes wide, even if there were zero actual indices in this chunk.) If all the input vector bytes are 0, the 4 stores won't overlap at all, otherwise they will (partially or fully) overlap. This is fine; cache and store buffers can absorb this easily.
SIMD vectors are fixed width, so IDK exactly what you meant about your output only having those values. The upper bytes have to be something; if you want zeros, then you could zero the outbuf first. Note that reloading from it into a __m128i vector would cause a store-forwarding stall (extra latency) if done right away after being written by 4x 32-bit stores. That's not a disaster, but it's not great. Best to do this into whatever actual output you want to write in the first place.
BMI2 pext is a horizontal pack operation
You said in comments you want this for an i7 with AVX2.
That also implies you have fast BMI2 pext / pdep (Intel since Haswell, AMD since Zen3.) Earlier AMD support those instructions, but not fast. Those do the bitwise equivalent of vpcompressb / vpexpandb on a uint64_t in an integer register.
This could allow a similar trick to AVX2 what is the most efficient way to pack left based on a mask?
After turning your vector into a mask of 0 / 0xf nibbles, we can extract the corresponding nibbles with values 0..15 into the bottom of an integer register with one pext instruction.
Or maybe keep things packed to bytes at the smallest to avoid having to unpack nibbles back to bytes, so then you'd need two separate 8-byte left-pack operations and need a popcnt or lzcnt to figure out how they should overlap.
Your pext operands would be the 0 / 0xff bytes from a _mm_cmpeq_epi8(v, _mm_setzero_si128()), extracted in two uint64_t halves with lo = _mm_cvtsi128_si64(cmp) and hi = _mm_extract_epi64(cmp, 1)`.
Use memcpy as an unaligned aliasing-safe store, as in the LUT version.
Slightly customized from here.
This SSSE3 strategy processes 64-bit words then recombines the result into a 128-bit word. Merging the 64-bit halves in a xmm register is more expensive than a compress-store to memory using overlapping writes.
/// `v` input bytes are either a 1 or 0
/// `v` output bytes are the "compressed" indices of zeros locations in the input
/// unused leading bytes in the output are filled with garbage.
/// returns the number of used bytes in `v`
static inline
size_t zidx_SSSE3 (__m128i &v) {
static const uint64_t table[27] = { /* 216 bytes */
0x0000000000000706, 0x0000000000070600, 0x0000000007060100, 0x0000000000070602,
0x0000000007060200, 0x0000000706020100, 0x0000000007060302, 0x0000000706030200,
0x0000070603020100, 0x0000000000070604, 0x0000000007060400, 0x0000000706040100,
0x0000000007060402, 0x0000000706040200, 0x0000070604020100, 0x0000000706040302,
0x0000070604030200, 0x0007060403020100, 0x0000000007060504, 0x0000000706050400,
0x0000070605040100, 0x0000000706050402, 0x0000070605040200, 0x0007060504020100,
0x0000070605040302, 0x0007060504030200, 0x0706050403020100
};
const __m128i id = _mm_set_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
// adding 8 to each shuffle index is cheaper than extracting the high qword
const __m128i offset = _mm_cvtsi64_si128(0x0808080808080808);
// bits[4:0] = index -> ((trit_d * 0) + (trit_c * 9) + (trit_b * 3) + (trit_a * 1))
// bits[15:7] = popcnt
const __m128i sadmask = _mm_set1_epi64x(0x8080898983838181);
// detect 1's (spaces)
__m128i mask = _mm_sub_epi8(_mm_setzero_si128(), v);
// manually process 16-bit lanes to reduce possible combinations
v = _mm_add_epi8(v, id);
// extract bitfields describing each qword: index, popcnt
__m128i desc = _mm_sad_epu8(_mm_and_si128(mask, sadmask), sadmask);
size_t lo_desc = (size_t)_mm_cvtsi128_si32(desc);
size_t hi_desc = (size_t)_mm_extract_epi16(desc, 4);
// load shuffle control indices from pre-computed table
__m128i lo_shuf = _mm_loadl_epi64((__m128i*)&table[lo_desc & 0x1F]);
__m128i hi_shuf = _mm_or_si128(_mm_loadl_epi64((__m128i*)&table[hi_desc & 0x1F]), offset);
//// recombine shuffle control qwords ////
// emulates a variable `_mm_bslli_si128(hi_shuf, lo_popcnt)` operation
desc = _mm_srli_epi16(desc, 7); // isolate popcnts
__m128i shift = _mm_shuffle_epi8(desc, _mm_setzero_si128()); // broadcast popcnt of low qword
hi_shuf = _mm_shuffle_epi8(hi_shuf, _mm_sub_epi8(id, shift)); // byte shift left
__m128i shuf = _mm_max_epu8(lo_shuf, hi_shuf); // merge
v = _mm_shuffle_epi8(v, shuf);
return (hi_desc + lo_desc) >> 7; // popcnt
}
If we're extracting these indices just for future scalar processing
then we might wish to consider using pmovmskb then peeling off each index as needed.
x = (unsigned)_mm_movemask_epi8(compare_mask);
while (x) {
idx = count_trailing_zeros(x);
x &= x - 1; // clear lowest set bit
DoSomethingTM(idx);
}
I want to be able to retain the same amount of bits to my vector whilst still performing binary addition. For example.
int numOfBits = 4;
int myVecVal = 3;
vector< bool > myVec;
GetBinaryVector(&myVec,myVecVal, numOfBits);
and its output would be:
{0, 0, 1, 1}
I don't know how to make a function of GetBinaryVector though, any ideas?
This seems to work (although the article I added in initial comment seem to suggest you only have byte level access):
void GetBinaryVector(vector<bool> *v, int val, int bits) {
v->resize(bits);
for(int i = 0; i < bits; i++) {
(*v)[bits - 1 - i] = (val >> i) & 0x1;
}
}
The left hand side sets the i'th least significant bit which is index bits - 1 - i. The right hand side isolates the i'th least significant bit by bit shifting the value down i'th bit and masking everything but the least significant bit.
In your example val = 8, bits = 15. In the first iteration i = 0: we have (*v)[15 - 1 - 0] = (8 >> 0) & 0x1. 8 is binary 1000 and shifting it down 0 is 1000. 1000 & 0x1 is 0. Let's jump to i = 4: (*v)[15 - 1 - 4] = (8 >> 4) & 0x1. 1000 >> 4 is 1 and 1 & 0x1 is 1, so we set (*v)[10] = 1. The resulting vector is { 0, ..., 0, 1, 0, 0, 0 }
Am working on a C++ app in Windows platform. There's a unsigned char pointer that get's bytes in decimal format.
unsigned char array[160];
This will have values like this,
array[0] = 0
array[1] = 0
array[2] = 176
array[3] = 52
array[4] = 0
array[5] = 0
array[6] = 223
array[7] = 78
array[8] = 0
array[9] = 0
array[10] = 123
array[11] = 39
array[12] = 0
array[13] = 0
array[14] = 172
array[15] = 51
.......
........
.........
and so forth...
I need to take each block of 4 bytes and then calculate its decimal value.
So for eg., for the 1st 4 bytes the combined hex value is B034. Now i need to convert this to decimal and divide by 1000.
As you see, for each 4 byte block the 1st 2 bytes are always 0. So i can ignore those and then take the last 2 bytes of that block. So from above example, it's 176 & 52.
There're many ways of doing this, but i want to do it via using bit wise operators.
Below is what i tried, but it's not working. Basically am ignoring the 1st 2 bytes of every 4 byte block.
int index = 0
for (int i = 0 ; i <= 160; i++) {
index++;
index++;
float Val = ((Array[index]<<8)+Array[index+1])/1000.0f;
index++;
}
Since you're processing the array four-by-four, I recommend that you increment i by 4 in the for loop. You can also avoid confusion after dropping the unnecessary index variable - you have i in the loop and can use it directly, no?
Another thing: Prefer bitwise OR over arithmetic addition when you're trying to "concatenate" numbers, although their outcome is identical.
for (int i = 0 ; i <= 160; i += 4) {
float val = ((array[i + 2] << 8) | array[i + 3]) / 1000.0f;
}
First of all, i <= 160 is one iteration too many.
Second, your incrementation is wrong; for index, you have
Iteration 1:
1, 2, 3
And you're combining 2 and 3 - this is correct.
Iteration 2:
4, 5, 6
And you're combining 5 and 6 - should be 6 and 7.
Iteration 3:
7, 8, 9
And you're combining 8 and 9 - should be 10 and 11.
You need to increment four times per iteration, not three.
But I think it's simpler to start looping at the first index you're interested in - 2 - and increment by 4 (the "stride") directly:
for (int i = 2; i < 160; i += 4) {
float Val = ((Array[i]<<8)+Array[i+1])/1000.0f;
}
I have to translate the following instructions from SSE to Neon
uint32_t a = _mm_cvtsi128_si32(_mm_shuffle_epi8(a,SHUFFLE_MASK) );
Where:
static const __m128i SHUFFLE_MASK = _mm_setr_epi8(3, 7, 11, 15, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1);
So basically I have to take 4th,8th,12th and 16th bytes from the register and put it into an uint32_t. Looks like a packing instruction (in SSE I seem to remember I used shuffle because it saves one instructions compared to packing, this example shows the use of packing instructions).
How does this operation translate in Neon?Should I use packing instructions?How do I then extract 32bits? (Is there anything equivalent to _mm_cvtsi128_si32?)
Edit:
To start with, vgetq_lane_u32 should allow to replace _mm_cvtsi128_si32
(but I will have to cast my uint8x16_t to uint32x4_t)
uint32_t vgetq_lane_u32(uint32x4_t vec, __constrange(0,3) int lane);
or directly store the lane vst1q_lane_u32
void vst1q_lane_u32(__transfersize(1) uint32_t * ptr, uint32x4_t val, __constrange(0,3) int lane); // VST1.32 {d0[0]}, [r0]
I found this excellent guide.
I am working on that, it seems that my operation could be done with one VTBL instruction (look up table), but I will implement it with 2 deinterleaving operations because for the moment it looks simpler.
uint8x8x2_t vuzp_u8(uint8x8_t a, uint8x8_t b);
So something like:
uint8x16_t a;
uint8_t* out;
[...]
//a = 138 0 0 0 140 0 0 0 146 0 0 0 147 0 0 0
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
//a = 138 0 140 0 146 0 147 0 0 0 0 0 0 0 0 0
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
//a = 138 140 146 147 0 0 0 0 0 0 0 0 0 0 0 0
vst1q_lane_u32(out,a,0);
Last one does not give warning using __attribute__((optimize("lax-vector-conversions")))
But, because of data conversion, the 2 assignments are not possible. One workaround is like this (Edit: This breaks strict aliasing rules! The compiler could assume that a does not change while assigning the address of d.):
uint8x8x2_t* d = (uint8x8x2_t*) &a;
*d = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
*d = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
vst1q_lane_u32(out,a,0);
I have implemented a more general workaround through a flexible data type:
NeonVectorType<uint8x16_t> a; //a can be used as a uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
vst1q_lane_u32(out,a,0);
Edit:
Here is the version with shuffle mask/look up table. It makes indeed my inner loop a little bit faster. Again, I have used the data type described here.
static const uint8x8_t MASK = {0x00,0x04,0x08,0x0C,0xff,0xff,0xff,0xff};
NeonVectorType<uint8x16_t> a; //a can be used as a uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
NeonVectorType<uint8x8_t> res; //res can be used as uint8x8_t, uint32x2_t, etc.
[...]
res = vtbl2_u8(a, MASK);
vst1_lane_u32(out,res,0);
I would write it as so:
uint32_t extract (uint8x16_t x)
{
uint8x8x2_t a = vuzp_u8 (vget_low_u8 (x), vget_high_u8 (x));
uint8x8x2_t b = vuzp_u8 (a.val[0], a.val[1]);
return vget_lane_u32 (vreinterpret_u32_u8 (b.val[0]), 0);
}
Which on a recent GCC version compiles to:
extract:
vuzp.8 d0, d1
vuzp.8 d0, d1
vmov.32 r0, d0[0]
bx lr
I just wanted to know if there is any difference between these two expressions:
1 : a = ( a | ( b&0x7F ) >> 7 );
2 : a = ( ( a << 8 ) | ( b&0x7F ) << 1 );
I'm not only speaking about result, but also about efficiency (but the first one looks better).
The purpose is to concatenate the 7 lower bits of multiple bytes, and I was at first using number 2 like this:
while(thereIsByte)
{
a = ( ( a << 8 ) | ( b&0x7F ) << i );
++i;
}
Thanks.
The two expression don't do anything alike:
a = ( a | ( b&0x7F ) >> 7 );
Explaining:
a = 0010001000100011
b = 1000100010001100
0x7f = 0000000001111111
b&0x7f = 0000000000001100
(b&0x7f) >> 7 = 0000000000000000 (this is always 0), you are selecting the lowest
7 bits of 'b' and shifting right 7bit, discarding
the selected bits).
(a | (b&0x7f) >> 7) always is equal to `a`
a = ( ( a << 8 ) | ( b&0x7F ) << 1 );
Explaining:
a = 0010001000100011
b = 1000100010001100
0x7f = 0000000001111111
b&0x7f = 0000000000001100
(b&0x7f) << 1 = 0000000000011000
(a << 8) = 0010001100000000
(a << 8) | (b&0x7F) << 1 = 0010001100011000
In the second expression the result would have the 3 lowest bytes of a as the 3 highest bytes and the lowest byte of b without the highest bit, shifting 1 bit to the left. Would line a = a * 256 + (b & 0x7f) * 2
If you want to concatenate the lowest 7bits of b in a would be:
while (thereIsByte) {
a = (a << 7) | (b & 0x7f);
// read the next byte into `b`
}
Example in case of sizeof(a) = 4 bytes and you are concatenating four 7bits info, the result of the pseudo code would be:
a = uuuuzzzzzzzyyyyyyyxxxxxxxwwwwwww
Where the z are the 7 bits of the first byte readed, the y are the 7bits of the second and so on. The u are unused bits (contain the info in the lowest 4 bits of a at the beginning)
In this case the size of a need to be greater that the total bits you want to concatenate (eg: at least 32 bits if you want to concatenate four 7bits info).
If a and b are one byte of size won't be really much concatenating, you probably need a data structure like boost::dynamic_bitset where you can append bits multiple times and it grow accondinly.
Yes they are different. On MSVC2010 here is the disassembly when both a and b are chars.
a = ( a | ( b&0x7F ) >> 7 );
012713A6 movsx eax,byte ptr [a]
012713AA movsx ecx,byte ptr [b]
012713AE and ecx,7Fh
012713B1 sar ecx,7
012713B4 or eax,ecx
012713B6 mov byte ptr [a],al
a = ( ( a << 8 ) | ( b&0x7F ) << 1 );
012813A6 movsx eax,byte ptr [a]
012813AA shl eax,8
012813AD movsx ecx,byte ptr [b]
012813B1 and ecx,7Fh
012813B4 shl ecx,1
012813B6 or eax,ecx
012813B8 mov byte ptr [a],al
Notice that the second method does two shift operations (for a total of 9 shifted bits, which each take a clock cycle) while the first does a single shift (only 7 bits) and read. Basically this is caused by the order of operations. The first method IS more optimized, however shifting is one of the computers most efficient operations and this difference is probably negligible for most applications.
Notice the compiler treated them as bytes, NOT signed ints.