I'm interested in identifying overflowing values when adding unsigned 8-bit integers, and clamping the result to 0xFF:
__m128i m1 = _mm_loadu_si128(/* 16 8-bit unsigned integers */);
__m128i m2 = _mm_loadu_si128(/* 16 8-bit unsigned integers */);
__m128i m3 = _mm_adds_epu8(m1, m2);
I would be interested in performing comparison for "less than" on these unsigned integers, similar to _mm_cmplt_epi8 for signed:
__m128i mask = _mm_cmplt_epi8 (m3, m1);
m1 = _mm_or_si128(m3, mask);
If an "epu8" equivalent was available, mask would have 0xFF where m3[i] < m1[i] (overflow!), 0x00 otherwise, and we would be able to clamp m1 using the "or", so m1 will hold the addition result where valid, and 0xFF where it overflowed.
Problem is, _mm_cmplt_epi8 performs a signed comparison, so for instance if m1[i] = 0x70 and m2[i] = 0x10, then m3[i] = 0x80 and mask[i] = 0xFF, which is obviously not what I require.
Using VS2012.
I would appreciate another approach for performing this. Thanks!
One way of implementing compares for unsigned 8 bit vectors is to exploit _mm_max_epu8, which returns the maximum of unsigned 8 bit int elements. You can compare for equality the (unsigned) maximum value of two elements with one of the source elements and then return the appropriate result. This translates to 2 instructions for >= or <=, and 3 instructions for > or <.
Example code:
#include <stdio.h>
#include <emmintrin.h> // SSE2
#define _mm_cmpge_epu8(a, b) \
_mm_cmpeq_epi8(_mm_max_epu8(a, b), a)
#define _mm_cmple_epu8(a, b) _mm_cmpge_epu8(b, a)
#define _mm_cmpgt_epu8(a, b) \
_mm_xor_si128(_mm_cmple_epu8(a, b), _mm_set1_epi8(-1))
#define _mm_cmplt_epu8(a, b) _mm_cmpgt_epu8(b, a)
int main(void)
{
__m128i va = _mm_setr_epi8(0, 0, 1, 1, 1, 127, 127, 127, 128, 128, 128, 254, 254, 254, 255, 255);
__m128i vb = _mm_setr_epi8(0, 255, 0, 1, 255, 0, 127, 255, 0, 128, 255, 0, 254, 255, 0, 255);
__m128i v_ge = _mm_cmpge_epu8(va, vb);
__m128i v_le = _mm_cmple_epu8(va, vb);
__m128i v_gt = _mm_cmpgt_epu8(va, vb);
__m128i v_lt = _mm_cmplt_epu8(va, vb);
printf("va = %4vhhu\n", va);
printf("vb = %4vhhu\n", vb);
printf("v_ge = %4vhhu\n", v_ge);
printf("v_le = %4vhhu\n", v_le);
printf("v_gt = %4vhhu\n", v_gt);
printf("v_lt = %4vhhu\n", v_lt);
return 0;
}
Compile and run:
$ gcc -Wall _mm_cmplt_epu8.c && ./a.out
va = 0 0 1 1 1 127 127 127 128 128 128 254 254 254 255 255
vb = 0 255 0 1 255 0 127 255 0 128 255 0 254 255 0 255
v_ge = 255 0 255 255 0 255 255 0 255 255 0 255 255 0 255 255
v_le = 255 255 0 255 255 0 255 255 0 255 255 0 255 255 0 255
v_gt = 0 0 255 0 0 255 0 0 255 0 0 255 0 0 255 0
v_lt = 0 255 0 0 255 0 0 255 0 0 255 0 0 255 0 0
The other answers got me thinking of a simpler method to answer the specific question more directly:
To simply detect clamping, do saturating and non-saturating additions, and compare the results.
__m128i m1 = _mm_loadu_si128(/* 16 8-bit unsigned integers */);
__m128i m2 = _mm_loadu_si128(/* 16 8-bit unsigned integers */);
__m128i m1m2_sat = _mm_adds_epu8(m1, m2);
__m128i m1m2_wrap = _mm_add_epi8(m1, m2);
__m128i non_clipped = _mm_cmpeq_epi8(m1m2_sat, m1m2_wrap);
So that's just two instructions beyond the adds, and one of them can run in parallel with the adds. So the non_clipped mask is ready one cycle after the addition result. (Potentially 3 instructions (an extra movdqa) without AVX 3-operand non-destructive vector ops.)
If the non-saturating add result is 0xFF, it will match the saturating-add result, and be detected as not clipping. This is why it's different from just checking the output of the saturating add for 0xFF bytes.
Another way to compare unsigned bytes: add 0x80 and compare them as signed ones.
__m128i _mm_cmplt_epu8(__m128i a, __m128i b) {
__m128i as = _mm_add_epi8(a, _mm_set1_epi8((char)0x80));
__m128i bs = _mm_add_epi8(b, _mm_set1_epi8((char)0x80));
return _mm_cmplt_epi8(as, bs);
}
I don't think it is very efficient, but it works, and it may be useful in some cases. Also, you can use xor instead of addition if you want.
In some cases you can even do bidirectional range checking at once, i.e. compare a value with both lower and upper bounds. To do so, align the lower bound with 0x80, similar to what this answer does.
There is an implementation of comparison of 8-bit unsigned integer:
inline __m128i NotEqual8u(__m128i a, __m128i b)
{
return _mm_andnot_si128(_mm_cmpeq_epi8(a, b), _mm_set1_epi8(-1));
}
inline __m128i Greater8u(__m128i a, __m128i b)
{
return _mm_andnot_si128(_mm_cmpeq_epi8(_mm_min_epu8(a, b), a), _mm_set1_epi8(-1));
}
inline __m128i GreaterOrEqual8u(__m128i a, __m128i b)
{
return _mm_cmpeq_epi8(_mm_max_epu8(a, b), a);
}
inline __m128i Lesser8u(__m128i a, __m128i b)
{
return _mm_andnot_si128(_mm_cmpeq_epi8(_mm_max_epu8(a, b), a), _mm_set1_epi8(-1));
}
inline __m128i LesserOrEqual8u(__m128i a, __m128i b)
{
return _mm_cmpeq_epi8(_mm_min_epu8(a, b), a);
}
Related
Please tell me, I can't figure it out myself:
Here I have __m128i SIMD vector - each of the 16 bytes contains the following value:
1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1
Is it possible to somehow transform this vector so that all ones are removed, and the place of zeros is the number of the element in the vector of this zero. That is, like this:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1
1 4 6 10 12 14
And finally get a vector with only these values:
1 4 6 10 12 14
What logic can be to obtain such a result? What SIMD instructions should be used?
PS: I'm just starting to learn SIMD - so I don't know much. and I don't understand.
If you have BMI2, use the following version.
__m128i compressZeroIndices_bmi2( __m128i v )
{
const __m128i zero = _mm_setzero_si128();
// Replace zeros with 0xFF
v = _mm_cmpeq_epi8( v, zero );
// Extract low/high pieces into scalar registers for PEXT instruction
uint64_t low = (uint64_t)_mm_cvtsi128_si64( v );
uint64_t high = (uint64_t)_mm_extract_epi64( v, 1 );
// Count payload bytes in the complete vector
v = _mm_sub_epi8( zero, v );
v = _mm_sad_epu8( v, zero );
v = _mm_add_epi64( v, _mm_srli_si128( v, 8 ) );
v = _mm_shuffle_epi8( v, zero );
// Make a mask vector filled with 0 for payload bytes, 0xFF for padding
const __m128i identity = _mm_setr_epi8( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 );
v = _mm_max_epu8( v, identity );
__m128i mask = _mm_cmpeq_epi8( v, identity );
// The following line requires C++/20
// If you don't have it, use #ifdef _MSC_VER to switch between __popcnt64() and _popcnt64() intrinsics.
uint64_t lowBits = std::popcount( low );
// Use BMI2 to gather these indices
low = _pext_u64( 0x0706050403020100ull, low );
high = _pext_u64( 0x0F0E0D0C0B0A0908ull, high );
// Merge payload into a vector
v = _mm_cvtsi64_si128( low | ( high << lowBits ) );
v = _mm_insert_epi64( v, high >> ( 64 - lowBits ), 1 );
// Apply the mask to set unused elements to -1, enables pmovmskb + tzcnt to find the length
return _mm_or_si128( v, mask );
}
Here’s another version without BMI2. Probably slower on most CPUs, but the code is way simpler, and doesn’t use any scalar instructions.
inline __m128i sortStep( __m128i a, __m128i perm, __m128i blend )
{
// The min/max are independent and their throughput is 0.33-0.5 cycles,
// so this whole function only takes 3 (AMD) or 4 (Intel) cycles to complete
__m128i b = _mm_shuffle_epi8( a, perm );
__m128i i = _mm_min_epu8( a, b );
__m128i ax = _mm_max_epu8( a, b );
return _mm_blendv_epi8( i, ax, blend );
}
__m128i compressZeroIndices( __m128i v )
{
// Replace zeros with 0-based indices, ones with 0xFF
v = _mm_cmpgt_epi8( v, _mm_setzero_si128() );
const __m128i identity = _mm_setr_epi8( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 );
v = _mm_or_si128( v, identity );
// Sort bytes in the vector with a network
// https://demonstrations.wolfram.com/SortingNetworks/
// Click the "transposition" algorithm on that demo
const __m128i perm1 = _mm_setr_epi8( 1, 0, 3, 2, 5, 4, 7, 6, 9, 8, 11, 10, 13, 12, 15, 14 );
const __m128i blend1 = _mm_set1_epi16( (short)0xFF00 );
const __m128i perm2 = _mm_setr_epi8( 0, 2, 1, 4, 3, 6, 5, 8, 7, 10, 9, 12, 11, 14, 13, 15 );
const __m128i blend2 = _mm_setr_epi8( 0, 0, -1, 0, -1, 0, -1, 0, -1, 0, -1, 0, -1, 0, -1, 0 );
for( size_t i = 0; i < 8; i++ )
{
v = sortStep( v, perm1, blend1 );
v = sortStep( v, perm2, blend2 );
}
return v;
}
P.S. If you want the length of the output vector, use this function:
uint32_t vectorLength( __m128i v )
{
uint32_t mask = (uint32_t)_mm_movemask_epi8( v );
mask |= 0x10000;
return _tzcnt_u32( mask );
}
Horizontal data-dependent stuff is hard. This is not something traditional SIMD building blocks are good at. This is a tricky problem to start learning SIMD with.
If you had AVX512VBMI2 (Ice Lake), vpcompressb could do this in one instruction on a constant. (Well two, counting the test-into-mask of the input.)
Or with AVX-512BW (Skylake-avx512), you could use vpcompressd on a constant vector of 16x uint32_t and then pack that __m512i down to bytes after compressing with vpmovdb. (After the same test-into-mask of the byte vector).
16 separate elements means a single table-lookup is not viable; 2^16 x __m128i would be 64K x 16-bytes = 1 MiB, most accesses would miss in cache. (The code would be simple though; just _mm_cmpeq_epi8 against zero or _mm_slli_epi32(v, 7) / _mm_movemask_epi8 / use that 16-bit bitmask as an array index).
Possibly 4 lookup of 4-byte chunks using 4 mask bits at a time could work. (With SWAR add of 0x04040404 / 0x08080808 / 0x0c0c0c0c to offset the result). Your table could also store offset values, or you could _lzcnt_u32 or something to figure out how much to increment a pointer until the next store, or _popcnt_u32(zpos&0xf).
#include <stdint.h>
#include <immintrin.h>
#include <stdalign.h>
#include <string.h>
// untested but compiles ok
char *zidx_SSE2(char *outbuf, __m128i v)
{
alignas(64) static struct __attribute__((packed)) {
uint32_t idx;
uint8_t count; // or make this also uint32_t, but still won't allow a memory-source add unless it's uintptr_t. Indexing could be cheaper in a PIE though, *8 instead of *5 which needs both base and idx
}lut[] = { // 16x 5-byte entries
/*[0b0000]=*/ {0, 0}, /* [0b0001]= */ {0x00000000, 1}, /* [0b0010]= */ {0x00000001, 1 },
//... left-packed indices, count of non-zero bits
/* [0b1111]=*/ {0x03020100, 4}
};
// Maybe pack the length into the high 4 bits, and mask? Maybe not, it's a small LUT
unsigned zpos = _mm_movemask_epi8(_mm_cmpeq_epi8(v, _mm_setzero_si128()));
for (int i=0 ; i<16 ; i+=4){
uint32_t idx = lut[zpos & 0xf].idx;
idx += (0x01010101 * i); // this strength-reduces or is a constant after fully unrolling. GCC -O2 even realizes it can use add reg, 0x04040404 *as* the loop counter; clang -fno-unroll-loops doesn't
// idxs from bits 0..3, bits 4..7, bits 8..11, or bits 12..15
memcpy(outbuf, &idx, sizeof(idx)); // x86 is little-endian. Aliasing safe unaligned store.
outbuf += lut[zpos & 0xf].count; // or popcount(zpos&0xf)
zpos >>= 4;
}
return outbuf; // pointer to next byte after the last valid idx
}
https://godbolt.org/z/59Ev1Tz37 shows GCC and clang without loop unrolling. gcc -O3 does fully unroll it, as does clang at -O2 by default.
It will never store more than 16 bytes into outbuf, but stores fewer than that for inputs with fewer zero bytes. (But every store to outbuf is 4 bytes wide, even if there were zero actual indices in this chunk.) If all the input vector bytes are 0, the 4 stores won't overlap at all, otherwise they will (partially or fully) overlap. This is fine; cache and store buffers can absorb this easily.
SIMD vectors are fixed width, so IDK exactly what you meant about your output only having those values. The upper bytes have to be something; if you want zeros, then you could zero the outbuf first. Note that reloading from it into a __m128i vector would cause a store-forwarding stall (extra latency) if done right away after being written by 4x 32-bit stores. That's not a disaster, but it's not great. Best to do this into whatever actual output you want to write in the first place.
BMI2 pext is a horizontal pack operation
You said in comments you want this for an i7 with AVX2.
That also implies you have fast BMI2 pext / pdep (Intel since Haswell, AMD since Zen3.) Earlier AMD support those instructions, but not fast. Those do the bitwise equivalent of vpcompressb / vpexpandb on a uint64_t in an integer register.
This could allow a similar trick to AVX2 what is the most efficient way to pack left based on a mask?
After turning your vector into a mask of 0 / 0xf nibbles, we can extract the corresponding nibbles with values 0..15 into the bottom of an integer register with one pext instruction.
Or maybe keep things packed to bytes at the smallest to avoid having to unpack nibbles back to bytes, so then you'd need two separate 8-byte left-pack operations and need a popcnt or lzcnt to figure out how they should overlap.
Your pext operands would be the 0 / 0xff bytes from a _mm_cmpeq_epi8(v, _mm_setzero_si128()), extracted in two uint64_t halves with lo = _mm_cvtsi128_si64(cmp) and hi = _mm_extract_epi64(cmp, 1)`.
Use memcpy as an unaligned aliasing-safe store, as in the LUT version.
Slightly customized from here.
This SSSE3 strategy processes 64-bit words then recombines the result into a 128-bit word. Merging the 64-bit halves in a xmm register is more expensive than a compress-store to memory using overlapping writes.
/// `v` input bytes are either a 1 or 0
/// `v` output bytes are the "compressed" indices of zeros locations in the input
/// unused leading bytes in the output are filled with garbage.
/// returns the number of used bytes in `v`
static inline
size_t zidx_SSSE3 (__m128i &v) {
static const uint64_t table[27] = { /* 216 bytes */
0x0000000000000706, 0x0000000000070600, 0x0000000007060100, 0x0000000000070602,
0x0000000007060200, 0x0000000706020100, 0x0000000007060302, 0x0000000706030200,
0x0000070603020100, 0x0000000000070604, 0x0000000007060400, 0x0000000706040100,
0x0000000007060402, 0x0000000706040200, 0x0000070604020100, 0x0000000706040302,
0x0000070604030200, 0x0007060403020100, 0x0000000007060504, 0x0000000706050400,
0x0000070605040100, 0x0000000706050402, 0x0000070605040200, 0x0007060504020100,
0x0000070605040302, 0x0007060504030200, 0x0706050403020100
};
const __m128i id = _mm_set_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
// adding 8 to each shuffle index is cheaper than extracting the high qword
const __m128i offset = _mm_cvtsi64_si128(0x0808080808080808);
// bits[4:0] = index -> ((trit_d * 0) + (trit_c * 9) + (trit_b * 3) + (trit_a * 1))
// bits[15:7] = popcnt
const __m128i sadmask = _mm_set1_epi64x(0x8080898983838181);
// detect 1's (spaces)
__m128i mask = _mm_sub_epi8(_mm_setzero_si128(), v);
// manually process 16-bit lanes to reduce possible combinations
v = _mm_add_epi8(v, id);
// extract bitfields describing each qword: index, popcnt
__m128i desc = _mm_sad_epu8(_mm_and_si128(mask, sadmask), sadmask);
size_t lo_desc = (size_t)_mm_cvtsi128_si32(desc);
size_t hi_desc = (size_t)_mm_extract_epi16(desc, 4);
// load shuffle control indices from pre-computed table
__m128i lo_shuf = _mm_loadl_epi64((__m128i*)&table[lo_desc & 0x1F]);
__m128i hi_shuf = _mm_or_si128(_mm_loadl_epi64((__m128i*)&table[hi_desc & 0x1F]), offset);
//// recombine shuffle control qwords ////
// emulates a variable `_mm_bslli_si128(hi_shuf, lo_popcnt)` operation
desc = _mm_srli_epi16(desc, 7); // isolate popcnts
__m128i shift = _mm_shuffle_epi8(desc, _mm_setzero_si128()); // broadcast popcnt of low qword
hi_shuf = _mm_shuffle_epi8(hi_shuf, _mm_sub_epi8(id, shift)); // byte shift left
__m128i shuf = _mm_max_epu8(lo_shuf, hi_shuf); // merge
v = _mm_shuffle_epi8(v, shuf);
return (hi_desc + lo_desc) >> 7; // popcnt
}
If we're extracting these indices just for future scalar processing
then we might wish to consider using pmovmskb then peeling off each index as needed.
x = (unsigned)_mm_movemask_epi8(compare_mask);
while (x) {
idx = count_trailing_zeros(x);
x &= x - 1; // clear lowest set bit
DoSomethingTM(idx);
}
I have I problem with multiplying two registers (or just register by float constant). One register is __m128i type and contains one channel of RGBA pixel color from 16 pixels (the array with 16 pixels is sending as a parameter to CPP dll). I want to multiply this register by constant to get grayscale value for this channel and do this operation also for other channels stored in __m128i registers.
I think that a good idea to use SIMD for convert image to grayscale is to use this algorithm.
fY(R, G, B) = R x 0.29891 + G x 0.58661 + B x 0.11448
I have this following code and now it's only decomposing the image to channels and pack it together to return as an src vector. Now I need to make it for grayscale :)
The src variable is a pointer to unsigned char array.
__m128i vecSrc = _mm_loadu_si128((__m128i*) &src[srcIndex]);
__m128i maskR = _mm_setr_epi16(1, 0, 0, 0, 1, 0, 0, 0);
__m128i maskG = _mm_setr_epi16(0, 1, 0, 0, 0, 1, 0, 0);
__m128i maskB = _mm_setr_epi16(0, 0, 1, 0, 0, 0, 1, 0);
__m128i maskA = _mm_setr_epi16(0, 0, 0, 1, 0, 0, 0, 1);
// Creating factors.
const __m128i factorR = _mm_set1_epi16((short)(0.29891 * 0x10000)); //8 coefficients - R scale factor.
const __m128i factorG = _mm_set1_epi16((short)(0.58661 * 0x10000)); //8 coefficients - G scale factor.
const __m128i factorB = _mm_set1_epi16((short)(0.11448 * 0x10000)); //8 coefficients - B scale factor.
__m128i zero = _mm_setzero_si128();
// Shifting higher part of src register to lower.
__m128i vectSrcLowInHighPart = _mm_cvtepu8_epi16(vecSrc);
__m128i vectSrcHighInHighPart = _mm_unpackhi_epi8(vecSrc, zero);
// Multiply high parts of 16 x uint8 vectors by channels masks and save lower half. Getting each channels separatly (in two parts H and L)
__m128i vecR_L = _mm_mullo_epi16(vectSrcLowInHighPart, maskR);
__m128i vecG_L = _mm_mullo_epi16(vectSrcLowInHighPart, maskG);
__m128i vecB_L = _mm_mullo_epi16(vectSrcLowInHighPart, maskB);
__m128i vecA_L = _mm_mullo_epi16(vectSrcLowInHighPart, maskA);
// Multiply lower parts of 16 x uint8 vectors by channels masks and save lower half.
__m128i vecR_H = _mm_mullo_epi16(vectSrcHighInHighPart, maskR);
__m128i vecG_H = _mm_mullo_epi16(vectSrcHighInHighPart, maskG);
__m128i vecB_H = _mm_mullo_epi16(vectSrcHighInHighPart, maskB);
__m128i vecA_H = _mm_mullo_epi16(vectSrcHighInHighPart, maskA);
// Lower and high masks using to packing.
__m128i maskLo = _mm_set_epi8(0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 14, 12, 10, 8, 6, 4, 2, 0);
__m128i maskHi = _mm_set_epi8(14, 12, 10, 8, 6, 4, 2, 0, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80);
// Packed the High and Lowe part of register into one 16 x 8bit registers of each channels.
__m128i R = _mm_or_si128(_mm_shuffle_epi8(vecR_L, maskLo), _mm_shuffle_epi8(vecR_H, maskHi));
__m128i G = _mm_or_si128(_mm_shuffle_epi8(vecG_L, maskLo), _mm_shuffle_epi8(vecG_H, maskHi));
__m128i B = _mm_or_si128(_mm_shuffle_epi8(vecB_L, maskLo), _mm_shuffle_epi8(vecB_H, maskHi));
__m128i A = _mm_or_si128(_mm_shuffle_epi8(vecA_L, maskLo), _mm_shuffle_epi8(vecA_H, maskHi));
// Added all sub vectors to get in result one 128-bit vector with all edited channels.
__m128i resultVect = _mm_add_epi8(_mm_add_epi8(R, G), _mm_add_epi8(B, A));
// Put result vector into array to return as src pointer.
_mm_storel_epi64((__m128i*)&src[srcIndex], resultVect);
Thanks for help for you! It's my first program with SIMD (SSE) instructions.
Based on comments to my question, I created a solution. And also a project where I was learning how the registers exactly work when I using SSE instructions.
// Function displaying only registers with 16 x uInt8. And message.
void printRegister(__m128i registerToprint, const string &msg) {
unsigned char tab_debug[16] = { 0 };
unsigned char *dest = tab_debug;
_mm_store_si128((__m128i*)&dest[0], registerToprint);
cout << msg << endl;
cout << "\/\/\/\/ LO \/\/\/\/" << endl;
for (int i = 0; i < 16; i++)
cout << dec << (unsigned int)dest[i] << endl;
cout << "/\/\/\/\ HI /\/\/\/" << endl;
}
int main()
{
// Example array as 128-bit register with 16xuInt8. That represent each channel of pixel in BGRA configuration.
unsigned char tab[] = { 100,200,250,255, 101,201,251,255, 102,202,252,255, 103,203,253,255 };
// A pointer to source tab for simulate dll parameters reference.
unsigned char *src = tab;
// Start index of src t
int srcIndex = 0;
// How to define float numbers as integer of uInt16 type.
const __m128i r_coef = _mm_set1_epi16((short)(0.2989*32768.0 + 0.5));
const __m128i g_coef = _mm_set1_epi16((short)(0.5870*32768.0 + 0.5));
const __m128i b_coef = _mm_set1_epi16((short)(0.1140*32768.0 + 0.5));
// vecSrc - source vector (BGRA BGRA BGRA BGRA).
// Load data from tab[] into 128-bit register starting from adress at pointer src. (From 0 index so load all 16 elements x 8bit).
__m128i vecSrc = _mm_loadu_si128((__m128i*) &src[srcIndex]);
// Shuffle to configuration A0A1A2A3_R0R1R2R3_G0G1G2G3_B0B1B2B3
// Not revers so mask is read from left (Lo) to right (Hi). And counting from righ in srcVect (Lo).
__m128i shuffleMask = _mm_set_epi8(15, 11, 7, 3, 14, 10, 6, 2, 13, 9, 5, 1, 12, 8, 4, 0);
__m128i AAAA_R0RRR_G0GGG_B0BBB = _mm_shuffle_epi8(vecSrc, shuffleMask);
// Put B0BBB in lower part.
__m128i B0_XXX = _mm_slli_si128(AAAA_R0RRR_G0GGG_B0BBB, 12);
__m128i XXX_B0 = _mm_srli_si128(B0_XXX, 12);
// Put G0GGG in Lower part.
__m128i G0_B_XX = _mm_slli_si128(AAAA_R0RRR_G0GGG_B0BBB, 8);
__m128i XXX_G0 = _mm_srli_si128(G0_B_XX, 12);
// Put R0RRR in Lower part.
__m128i R0_G_XX = _mm_slli_si128(AAAA_R0RRR_G0GGG_B0BBB, 4);
__m128i XXX_R0 = _mm_srli_si128(R0_G_XX, 12);
// Unpack uint8 elements to uint16 elements.
// The sequence in uInt8 is like (Hi) XXXX XXXX XXXX XXXX (Lo) where X represent uInt8.
// In uInt16 is like (Hi) X_X_ X_X_ X_X_ X_X_ (Lo)
__m128i B0BBB = _mm_cvtepu8_epi16(XXX_B0);
__m128i G0GGG = _mm_cvtepu8_epi16(XXX_G0);
__m128i R0RRR = _mm_cvtepu8_epi16(XXX_R0);
// Multiply epi16 registers.
__m128i B0BBB_mul = _mm_mulhrs_epi16(B0BBB, b_coef);
__m128i G0GGG_mul = _mm_mulhrs_epi16(G0GGG, g_coef);
__m128i R0RRR_mul = _mm_mulhrs_epi16(R0RRR, r_coef);
__m128i BGR_gray = _mm_add_epi16(_mm_add_epi16(B0BBB_mul, G0GGG_mul), R0RRR_mul);
__m128i grayMsk = _mm_setr_epi8(0, 0, 0, 0, 2, 2, 2, 2, 4, 4, 4, 4, 6, 6, 6, 6);
__m128i vectGray = _mm_shuffle_epi8(BGR_gray, grayMsk);
printRegister(vectGray, "Gray");
}
How it's work
The unsigned char tab[] contains 16 x uInt8 elements to fill one 128-bit register. This array is simulating a 8 pixels which channels is on BGRA configuration.
void printRegister(__m128i registerToprint, const string &msg);
This function is using to print as a decimal registers value sending as a parameter in console.
If someone wants to test it the full project is available at gitHub: Full project demo gitHub
I hope that all comments are valid if no, please correct me :) Thanks for the support.
I don't get the error in my code. I try to compare a buffer of unsigned char values to a constant. Then I want to store 1 or 0 depending on the comparison. Here is my code (in a structure):
void operator()(const uint8* src, int32 swidth, int32 sheight, uint8* dst, uint8 value) {
uint8 t[16];
__m128i v_one = _mm_set1_epi8((uint8)1);
__m128i v_value = _mm_set1_epi8(value);
printf("value = %d\n", value);
SHOW(t, v_one);
SHOW(t, v_value);
std::cout << "****" << std::endl;
for (int32 i = 0; i < sheight; ++i) {
const uint8* sdata = src + i * swidth;
uint8* ddata = dst + i * swidth;
int32 j = 0;
for ( ; j <= swidth - 16; j += 16) {
__m128i s = _mm_load_si128((const __m128i*)(sdata + j));
__m128i mask = _mm_cmpgt_epi8(s, v_value);
SHOW(t, s);
SHOW(t, mask);
std::cout << std::endl;
}
}
}
My first line are what I would expect:
value = 100
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
But then my comparison are wrong:
214 100 199 203 232 50 85 195 70 141 121 160 93 130 242 233
0 0 0 0 0 0 0 0 0 0 255 0 0 0 0 0
And I really don't get where the mistakes are.
The SHOW macro is:
#define SHOW(t, r) \
_mm_storeu_si128((__m128i*)t, r); \
printf("%3d", (int32)t[0]); \
for (int32 k = 1; k < 16; ++k) \
printf(" %3d", (int32)t[k]); \
printf("\n")
You are comparing the elements in your s array with your value array.
All the values in the value array are 100.
You have a mix of values in your s array.
However, _mm_cmpgt_epi8 works on signed values and as these are bytes it considers values from -128 to +127.
So the only possible values that are > 100 are values in the range 101 to 127.
As you've only got 1 value in that range (121) thats the only one which has its mask set.
To see this, change uint8 t[16]; to int8 t[16]; and you should get a more expected result.
I have to translate the following instructions from SSE to Neon
uint32_t a = _mm_cvtsi128_si32(_mm_shuffle_epi8(a,SHUFFLE_MASK) );
Where:
static const __m128i SHUFFLE_MASK = _mm_setr_epi8(3, 7, 11, 15, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1);
So basically I have to take 4th,8th,12th and 16th bytes from the register and put it into an uint32_t. Looks like a packing instruction (in SSE I seem to remember I used shuffle because it saves one instructions compared to packing, this example shows the use of packing instructions).
How does this operation translate in Neon?Should I use packing instructions?How do I then extract 32bits? (Is there anything equivalent to _mm_cvtsi128_si32?)
Edit:
To start with, vgetq_lane_u32 should allow to replace _mm_cvtsi128_si32
(but I will have to cast my uint8x16_t to uint32x4_t)
uint32_t vgetq_lane_u32(uint32x4_t vec, __constrange(0,3) int lane);
or directly store the lane vst1q_lane_u32
void vst1q_lane_u32(__transfersize(1) uint32_t * ptr, uint32x4_t val, __constrange(0,3) int lane); // VST1.32 {d0[0]}, [r0]
I found this excellent guide.
I am working on that, it seems that my operation could be done with one VTBL instruction (look up table), but I will implement it with 2 deinterleaving operations because for the moment it looks simpler.
uint8x8x2_t vuzp_u8(uint8x8_t a, uint8x8_t b);
So something like:
uint8x16_t a;
uint8_t* out;
[...]
//a = 138 0 0 0 140 0 0 0 146 0 0 0 147 0 0 0
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
//a = 138 0 140 0 146 0 147 0 0 0 0 0 0 0 0 0
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
//a = 138 140 146 147 0 0 0 0 0 0 0 0 0 0 0 0
vst1q_lane_u32(out,a,0);
Last one does not give warning using __attribute__((optimize("lax-vector-conversions")))
But, because of data conversion, the 2 assignments are not possible. One workaround is like this (Edit: This breaks strict aliasing rules! The compiler could assume that a does not change while assigning the address of d.):
uint8x8x2_t* d = (uint8x8x2_t*) &a;
*d = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
*d = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
vst1q_lane_u32(out,a,0);
I have implemented a more general workaround through a flexible data type:
NeonVectorType<uint8x16_t> a; //a can be used as a uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
a = vuzp_u8(vget_low_u8(a), vget_high_u8(a) );
vst1q_lane_u32(out,a,0);
Edit:
Here is the version with shuffle mask/look up table. It makes indeed my inner loop a little bit faster. Again, I have used the data type described here.
static const uint8x8_t MASK = {0x00,0x04,0x08,0x0C,0xff,0xff,0xff,0xff};
NeonVectorType<uint8x16_t> a; //a can be used as a uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
NeonVectorType<uint8x8_t> res; //res can be used as uint8x8_t, uint32x2_t, etc.
[...]
res = vtbl2_u8(a, MASK);
vst1_lane_u32(out,res,0);
I would write it as so:
uint32_t extract (uint8x16_t x)
{
uint8x8x2_t a = vuzp_u8 (vget_low_u8 (x), vget_high_u8 (x));
uint8x8x2_t b = vuzp_u8 (a.val[0], a.val[1]);
return vget_lane_u32 (vreinterpret_u32_u8 (b.val[0]), 0);
}
Which on a recent GCC version compiles to:
extract:
vuzp.8 d0, d1
vuzp.8 d0, d1
vmov.32 r0, d0[0]
bx lr
I have been using the algorithm from Microsoft here:
INT iWidth = bitmap.GetWidth();
INT iHeight = bitmap.GetHeight();
Color color, colorTemp;
for(INT iRow = 0; iRow < iHeight; iRow++)
{
for(INT iColumn = 0; iColumn < iWidth; iColumn++)
{
bitmap.GetPixel(iColumn, iRow, &color);
colorTemp.SetValue(color.MakeARGB(
(BYTE)(255 * iColumn / iWidth),
color.GetRed(),
color.GetGreen(),
color.GetBlue()));
bitmap.SetPixel(iColumn, iRow, colorTemp);
}
}
to create a gradient alpha blend. Theirs goes left to right, I need one going from bottom to top, so I changed their line
(BYTE)(255 * iColumn / iWidth)
to
(BYTE)(255 - ((iRow * 255) / iHeight))
This makes row 0 have alpha 255, through to the last row having alpha 8.
How can I alter the calculation to make the alpha go from 255 to 0 (instead of 255 to 8)?
f(x) = 255 * (x - 8) / (255 - 8)?
Where x is in [8, 255] and f(x) is in [0, 255]
The original problem is probably related with the fact that if you have width of 100 and you iterate over horizontal pixels, you'll only get values 0 to 99. So, dividing 99 by 100 is never 1. What you need is something like 255*(column+1)/width
(BYTE)( 255 - 255 * iRow / (iHeight-1) )
iRow is between 0 and (iHeight-1), so if we want a value between 0 and 1 we need to divide by (iHeight-1). We actually want a value between 0 and 255, so we just scale up by 255. Finally we want to start at the maximum and descend to the minimum, so we just subtract the value from 255.
At the endpoints:
iRow = 0
255 - 255 * 0 / (iHeight-1) = 255
iRow = (iHeight-1)
255 - 255 * (iHeight-1) / (iHeight-1) = 255 - 255 * 1 = 0
Note that iHeight must be greater than or equal to 2 for this to work (you'll get a divide by zero if it is 1).
Edit:
This will cause only the last row to have an alpha value of 0. You can get a more even distribution of alpha values with
(BYTE)( 255 - 256 * iRow / iHeight )
however, if iHeight is less than 256 the last row won't have an alpha value of 0.
Try using one of the the following calculations (they give the same result):
(BYTE)(255 - (iRow * 256 - 1) / (iHeight - 1))
(BYTE)((iHeight - 1 - iRow) * 256 - 1) / (iHeight - 1))
This will only work if using signed division (you use the type INT which seems to be the same as int, so it should work).