Compute norm between two integers interpreted as 4 bytes - c++

I would like to write a function norm2 which computes
uint32_t norm2(uint32_t a, uint32_t b) {
return sqd( a & 0x000000FF , b & 0x000000FF )
+ sqd((a & 0x0000FF00)>> 8, (b & 0x0000FF00)>> 8)
+ sqd((a & 0x00FF0000)>>16, (b & 0x00FF0000)>> 16)
+ sqd((a & 0xFF000000)>>24, (b & 0xFF000000)>> 24);
}
uint32_t sqd(uint32_t a, uint32_t b) {
uint32_t x = (a > b) ? a - b : b - a;
return x*x;
}
What is the fastest way to do so under GCC? For example using assembler, SSE or similar.

Very simple to do the whole thing in a few instructions using SSE:
#include <immintrin.h>
#include <stdint.h>
uint32_t norm2(uint32_t a, uint32_t b) {
const __m128i vec_zero = _mm_setzero_si128();
__m128i vec_a = _mm_unpacklo_epi8(_mm_cvtsi32_si128(a), vec_zero);
__m128i vec_b = _mm_unpacklo_epi8(_mm_cvtsi32_si128(b), vec_zero);
__m128i vec_diff = _mm_sub_epi16(vec_a, vec_b);
__m128i vec_dsq = _mm_madd_epi16(vec_diff, vec_diff);
return _mm_cvtsi128_si32(_mm_hadd_epi32(vec_dsq, vec_dsq));
}
What we’re doing here is “unpacking” both a and b with a zero vector to expand the individual bytes into vectors of 16-bit integers. We then subtract them (as 16-bit integers, avoiding risk of overflow), and multiply and accumulate them (as 32-bit integers, again avoiding risk of overflow).
I don’t have GCC installed to test with, but the above generates near-optimal assembly with clang; it shouldn’t be necessary to drop into assembly for such a simple task.

If you can read in sets of 4 for a and b, this can be done most cleanly/elegantly/efficiently by operating on 4-tuples because it will more fully saturate some of the instructions, so that all parts of the computation are part of the solution. The below solution uses up to SSSE3. Of course you will be better off to pull this out of the function, initialize constants up front, and find the most efficient way to put the values into the __m128i values depending on how the surrounding code is structured.
// a, b, and out, must all point to 4 integers
void norm2x4(const unsigned *a, const unsigned *b, unsigned *out) {
// load up registers a and b, in practice this should probably not be in a function,
// initialization of zero can happen outside of a loop,
// and a and b can be loaded directly from memory into __m128i registers
__m128i const zero = _mm_setzero_si128();
__m128i alo = _mm_loadu_si128((__m128i*)a); // this can also be adapted to aligned read instructions if you ensure an aligned buffer
__m128i blo = _mm_loadu_si128((__m128i*)b);
// everything is already in the register where we need it except it
// needs to be expanded to 2-byte ints for computations to work correctly
__m128i ahi = _mm_unpackhi_epi8(alo, zero);
__m128i bhi = _mm_unpackhi_epi8(blo, zero);
alo = _mm_unpacklo_epi8(alo, zero);
blo = _mm_unpacklo_epi8(blo, zero);
alo = _mm_sub_epi16(alo, blo); // don't care if a - b, or b - a, the "wrong" one will result in a
ahi = _mm_sub_epi16(ahi, bhi); // negation the square will later correct
alo = _mm_madd_epi16(alo, alo); // perform the square, and add every two adjacent
ahi = _mm_madd_epi16(ahi, ahi);
alo = _mm_hadd_epi32(alo, ahi); // add horizontal elements; ahi now contains 4 ints which are your results
// store the result to output; this can be adapted to an aligned store if you ensure an aligned buffer
// or the individual values can be extracted directly to 32-bit registers using _mm_extract_epi32
_mm_storeu_si128((__m128i*)out, alo);
}

a branchless version (as square(-x) == square(x)):
uint32_t sqd(int32_t a, int32_t b) {
int32_t x = a - b;
return x * x;
}
uint32_t norm2(uint32_t a, uint32_t b) {
return sqd( a & 0x000000FF , b & 0x000000FF )
+ sqd((a & 0x0000FF00) >> 8, (b & 0x0000FF00) >> 8)
+ sqd((a & 0x00FF0000) >> 16, (b & 0x00FF0000) >> 16)
+ sqd((a & 0xFF000000) >> 24, (b & 0xFF000000) >> 24);
}

Related

Implement a function that blends two colors encoded with RGB565 using Alpha blending

I am trying to implement a function that blends two colors encoded with RGB565 using Alpha blending
Crgb565 = (1-a)Argb565 + a*Brgb565
Where a is the alpha parameter, and the alpha blending value of 0.0-1.0 is mapped to an unsigned char value on the range 0-32.
we can choose to use a five bit representation for a instead, thus restricting it to the range of 0-31 (effectively mapping to an alpha blending value of 0.0-0.96875).
Following code I am trying to implement, can you please suggest better way wrt less temp variable , memory optimization (number of multiplications and required memory accesses ),Is my logic for alpha bending is correct? I am not getting correct result/expected output, Seems like I am missing something, please review the code, Every suggest is appreciated, have some doubt based on alpha parameter. I have put my doubts in code comment section. Is there any way to shortening the alpha blending equations(division operation)?
=====================================================
unsigned short blend_rgb565(unsigned short A, unsigned short B, unsigned char Alpha)
{
unsigned short res = 0;
// Alpha converted from [0..255] to [0..31] (8 bit to 5 bit)
/* I want the alpha parameter (0-32), do i need to add something in Alpha before right shift?? */
Alpha = Alpha >> 3;
// Split Image A into R, G, B components
/*Do I need to take it as unsigned short or uint8_t also work fine ??*/
unsigned short A_r = A >> 11;
unsigned short A_g = (A >> 5) & ((1u << 6) - 1); // ((1u << 6) - 1) --> 00000000 00111111
unsigned short A_b = A & ((1u << 5) - 1); // ((1u << 5) - 1) --> 00000000 00011111
// Split Image B into R, G, B components
unsigned short B_r = B >> 11;
unsigned short B_g = (B >> 5) & ((1u << 6) - 1);
unsigned short B_b = B & ((1u << 5) - 1);
// Alpha blend components
/*Do I need to use 255(8 bit) instead of 32(5 bit), Why we are dividing by it , I have taken the ref from internet , but need little bit more clarification ??*/
unsigned short uiC_r = (A_r * Alpha + B_r * (32 - Alpha)) / 32;
unsigned short uiC_g = (A_g * Alpha + B_g * (32 - Alpha)) / 32;
unsigned short uiC_b = (A_b * Alpha + B_b * (32 - Alpha)) / 32;
// Pack result
res= (unsigned short) ((uiC_r << 11) | (uiC_g << 5) | uiC_b);
return res;
}
=====================
EDIT:
Adding method 2 ,is this approach is correct ?
Method 2:
// rrrrrggggggbbbbb
#define RB_MASK 63519 // 0b1111100000011111 --> hex :F81F
#define G_MASK 2016 // 0b0000011111100000 --> hex :07E0
#define RB_MUL_MASK 2032608 // 0b111110000001111100000 --> hex :1F03E0
#define G_MUL_MASK 64512 // 0b000001111110000000000 --> hex :FC00
unsigned short blend_rgb565(unsigned short A,unsigned short B,unsigned char Alpha) {
// Alpha converted from [0..255] to [0..31]
Alpha = Alpha >> 3
uint8_t beta = 32 - Alpha;
// so (0..32)*Alpha + (0..32)*beta always in 0..32
return (unsigned short)
(
(
( ( Alpha * (uint32_t)( A & RB_MASK ) + beta * (uint32_t)( B & RB_MASK )) & RB_MUL_MASK )
|
( ( Alpha * ( A & G_MASK ) + beta * ( B & G_MASK )) & G_MUL_MASK )
)
>> 5 // removing the alpha component 5 bit
);
}
It's possible to reduce the multiplies from 6 to 2 if you space out the RGB values into 2 32-bit integers before multiplying:
unsigned short blend_rgb565(unsigned short A, unsigned short B, unsigned char Alpha)
{
unsigned short res = 0;
// Alpha converted from [0..255] to [0..31] (8 bit to 5 bit)
Alpha = Alpha >> 3;
// Alpha = (Alpha + (Alpha >> 5)) >> 3; // map from 0-255 to 0-32 (if Alpha is unsigned short or larger)
// Space out A and B from RRRRRGGGGGGBBBBB to 00000RRRRR00000GGGGGG00000BBBBB
// 31 = 11111 binary
// 63 = 111111 binary
unsigned int A32 = (unsigned int)A;
unsigned int A_spaced = A32 & 31; // B
A_spaced |= (A32 & (63 << 5)) << 5; // G
A_spaced |= (A32 & (31 << 11)) << 11; // R
unsigned int B32 = (unsigned int)B;
unsigned int B_spaced = B32 & 31; // B
B_spaced |= (B32 & (63 << 5)) << 5; // G
B_spaced |= (B32 & (31 << 11)) << 11; // R
// multiply and add the alpha to give a result RRRRRrrrrrGGGGGGgggggBBBBBbbbbb,
// where RGB are the most significant bits we want to keep
unsigned int C_spaced = (A_spaced * Alpha) + (B_spaced * (32 - Alpha));
// remap back to RRRRRGGGGGBBBBB
res = (unsigned short)(((C_spaced >> 5) & 31) + ((C_spaced >> 10) & (63 << 5)) + ((C_spaced >> 16) & (31 << 11)));
return res;
}
You need to profile this to see if it is faster, it assumes that multiplications you save are slower than the extra bit-manipulations you replace them with.
can you please suggest better way wrt less temp variable
There is no advantage to remove temporary variables from the implementation. When you compile with optimizations turned on (e.g. -O2 or /O2) those temp variables will get optimized away.
Two adjustments I would make to your code:
Use uint16_t instead of unsigned short. For most platforms, it won't matter since sizeof(uint16_t)==sizeof(unsigned short), but it helps to be definitive.
No point in converting alpha from an 8-bit value to a 5-bit value. You'll get better accuracy with blending if you let alpha have the full range
Some of your bit-shifting looks weird. It might work. But I use a simpler approach.
Here's an adjustment to your implementation:
#include <stdint.h>
#define MAKE_RGB565(r, g, b) ((r << 11) | (g << 5) | (b))
uint16_t blend_rgb565(uint16_t a, uint16_t b, uint8_t Alpha)
{
const uint8_t invAlpha = 255 - Alpha;
uint16_t A_r = a >> 11;
uint16_t A_g = (a >> 5) & 0x3f;
uint16_t A_b = a & 0x1f;
uint16_t B_r = b >> 11;
uint16_t B_g = (b >> 5) & 0x3f;
uint16_t B_b = b & 0x1f;
uint32_t C_r = (A_r * invAlpha + B_r * Alpha) / 255;
uint32_t C_g = (A_g * invAlpha + B_g * Alpha) / 255;
uint32_t C_b = (A_b * invAlpha + B_b * Alpha) / 255;
return MAKE_RGB565(C_r, C_g, C_b);
}
But the bigger issue is that this function works on exactly one one pair of pixel colors. If you are invoking this function across an entire image or pair of images, the overhead of using the function call is going to be a major performance issue - even with compiler optimizations and inlining. So if you are calling this function row x col times, you should probably manually inline the code into your loop that is enumerating over every pixel on an image (or pair of images).
In the same vein as #samgak's answer, you can implement more efficiently on a 64 bits architecture by "post-masking", as follows:
rrrrrggggggbbbbb
Replicate to a long long (by shifting or mapping the long long to four shorts)
---------------- rrrrrggggggbbbbb rrrrrggggggbbbbb rrrrrggggggbbbbb
Mask out the useless bits
---------------- rrrrr----------- -----gggggg----- -----------bbbbb
Multiply by α
-----------rrrrr rrrrr----------- ggggggggggg----- ------bbbbbbbbbb
Mask out the low order bits
-----------rrrrr ---------------- gggggg---------- ------bbbbb-----
Pack
rrrrrgggggbbbbb
Another saving is possible by rewriting
(1 - α) X + α Y
as
X + α (Y - X)
(or X - α (X - Y) to avoid negatives). This spares a multiply (at the expense of a comparison).
Update:
The "saving" above cannot work because the negatives should be handled component-wise.

How to optimize blend by combining red and blue unsigned byte?

I have this function for RGB blend. What I'm trying to do is put red and blue together to lessen the operations.
Here' the original code :
#define REDMASK (0xff0000)
#define GREENMASK (0x00ff00)
#define BLUEMASK (0x0000ff)
typedef unsigned int Pixel;
inline Pixel AddBlend( Pixel a_Color1, Pixel a_Color2 )
{
const unsigned int r = (a_Color1 & REDMASK) + (a_Color2 & REDMASK);
const unsigned int g = (a_Color1 & GREENMASK) + (a_Color2 & GREENMASK);
const unsigned int b = (a_Color1 & BLUEMASK) + (a_Color2 & BLUEMASK);
const unsigned r1 = (r & REDMASK) | (REDMASK * (r >> 24));
const unsigned g1 = (g & GREENMASK) | (GREENMASK * (g >> 16));
const unsigned b1 = (b & BLUEMASK) | (BLUEMASK * (b >> 8));
return (r1 + g1 + b1);
}`
And here's what I got so far. My problem is right now is that the colours are not blending correctly. What am I doing wrong here?
typedef unsigned int Pixel;
inline Pixel AddBlend( Pixel a_Color1, Pixel a_Color2 ){
const unsigned int rb = ( ( a_Color1 & 0xff00ff ) + ( a_Color2 & 0xff00ff ) );
const unsigned int g = ( a_Color1 & GREENMASK ) + ( a_Color2 & GREENMASK );
const unsigned rb1 = ( rb & 0xff00ff ) | ( 0xff00ff * ( rb >> 8 ));
const unsigned g1 = (g & GREENMASK) | (GREENMASK * (g >> 16));
return (rb1 + g1);
}
The (REDMASK * (r >> 24)) type part in the original code handles clamping values that overflow. This works with one color part, but not two. You'll need to split that into two parts, one to handle the red overflow and one for the blue. Handling the overflow for red can be done as in the original, but the blue overflow needs a little adjustment to ignore any of the red contribution.
BLUE_MASK * ((rb & 0x100) >> 8)
This results in
const unsigned rb1 = (rb & 0xff00ff) | (REDMASK * (r >> 24)) | (BLUE_MASK * ((rb & 0x100) >> 8));
Combining two colors like this works because there is a gap between red and blue that the overflow can occupy (the green bits). If you tried this with red/green or green/blue the overflow for the part stored in the lower byte would collide with the value for the part stored in the higher byte.

Shift masked bits to the lsb

When you and some data with a mask you get some result which is of the same size as the data/mask.
What I want to do, is to take the masked bits in the result (where there was 1 in the mask) and shift them to the right so they are next to each other and I can perform a CTZ (Count Trailing Zeroes) on them.
I didn't know how to name such a procedure so Google has failed me. The operation should preferably not be a loop solution, this has to be as fast operation as possible.
And here is an incredible image made in MS Paint.
This operation is known as compress right. It is implemented as part of BMI2 as the PEXT instruction, in Intel processors as of Haswell.
Unfortunately, without hardware support is it a quite annoying operation. Of course there is an obvious solution, just moving the bits one by one in a loop, here is the one given by Hackers Delight:
unsigned compress(unsigned x, unsigned m) {
unsigned r, s, b; // Result, shift, mask bit.
r = 0;
s = 0;
do {
b = m & 1;
r = r | ((x & b) << s);
s = s + b;
x = x >> 1;
m = m >> 1;
} while (m != 0);
return r;
}
But there is an other way, also given by Hackers Delight, which does less looping (number of iteration logarithmic in the number of bits) but more per iteration:
unsigned compress(unsigned x, unsigned m) {
unsigned mk, mp, mv, t;
int i;
x = x & m; // Clear irrelevant bits.
mk = ~m << 1; // We will count 0's to right.
for (i = 0; i < 5; i++) {
mp = mk ^ (mk << 1); // Parallel prefix.
mp = mp ^ (mp << 2);
mp = mp ^ (mp << 4);
mp = mp ^ (mp << 8);
mp = mp ^ (mp << 16);
mv = mp & m; // Bits to move.
m = m ^ mv | (mv >> (1 << i)); // Compress m.
t = x & mv;
x = x ^ t | (t >> (1 << i)); // Compress x.
mk = mk & ~mp;
}
return x;
}
Notice that a lot of the values there depend only on m. Since you only have 512 different masks, you could precompute those and simplify the code to something like this (not tested)
unsigned compress(unsigned x, int maskindex) {
unsigned t;
int i;
x = x & masks[maskindex][0];
for (i = 0; i < 5; i++) {
t = x & masks[maskindex][i + 1];
x = x ^ t | (t >> (1 << i));
}
return x;
}
Of course all of these can be turned into "not a loop" by unrolling, the second and third ways are probably more suitable for that. That's a bit of cheat however.
You can use the pack-by-multiplication technique similar to the one described here. This way you don't need any loop and can mix the bits in any order.
For example with the mask 0b10101001 == 0xA9 like above and 8-bit data abcdefgh (with a-h is the 8 bits) you can use the below expression to get 0000aceh
uint8_t compress_maskA9(uint8_t x)
{
const uint8_t mask1 = 0xA9 & 0xF0;
const uint8_t mask2 = 0xA9 & 0x0F;
return (((x & mask1)*0x03000000 >> 28) & 0x0C) | ((x & mask2)*0x50000000 >> 30);
}
In this specific case there are some overlaps of the 4 bits while adding (which incur unexpected carry) during the multiplication step, so I've split them into 2 parts, the first one extracts bit a and c, then e and h will be extracted in the latter part. There are other ways to split the bits as well, like a & h then c & e. You can see the results compared to Harold's function live on ideone
An alternate way with only one multiplication
const uint32_t X = (x << 8) | x;
return (X & 0x8821)*0x12050000 >> 28;
I got this by duplicating the bits so that they're spaced out farther, leaving enough space to avoid the carry. This is often better than splitting into 2 multiplications
If you want the result's bits reversed (i.e. heca0000) you can easily change the magic numbers accordingly
// result: he00 | 00ca;
return (((x & 0x09)*0x88000000 >> 28) & 0x0C) | (((x & 0xA0)*0x04800000) >> 30);
or you can also extract the 3 bits e, c and a at the same time, leaving h separately (as I mentioned above, there are often multiple solutions) and you need only one multiplication
return ((x & 0xA8)*0x12400000 >> 29) | (x & 0x01) << 3; // result: 0eca | h000
But there might be a better alternative like the above second snippet
const uint32_t X = (x << 8) | x;
return (X & 0x2881)*0x80290000 >> 28
Correctness check: http://ideone.com/PYUkty
For a larger number of masks you can precompute the magic numbers correspond to those masks and store them in an array so that you can look them up immediately for use. I calculated those mask by hand but you can do that automatically
Explanation
We have abcdefgh & mask1 = a0c00000. Multiply it with magic1
........................a0c00000
× 00000011000000000000000000000000 (magic1 = 0x03000000)
────────────────────────────────
a0c00000........................
+ a0c00000......................... (the leading "a" bit is outside int's range
──────────────────────────────── so it'll be truncated)
r1 = acc.............................
=> (r1 >> 28) & 0x0C = 0000ac00
Similarly we multiply abcdefgh & mask2 = 0000e00h with magic2
........................0000e00h
× 01010000000000000000000000000000 (magic2 = 0x50000000)
────────────────────────────────
e00h............................
+ 0h..............................
────────────────────────────────
r2 = eh..............................
=> (r2 >> 30) = 000000eh
Combine them together we have the expected result
((r1 >> 28) & 0x0C) | (r2 >> 30) = 0000aceh
And here's the demo for the second snippet
abcdefghabcdefgh
& 1000100000100001 (0x8821)
────────────────────────────────
a000e00000c0000h
× 00010010000001010000000000000000 (0x12050000)
────────────────────────────────
000h
00e00000c0000h
+ 0c0000h
a000e00000c0000h
────────────────────────────────
= acehe0h0c0c00h0h
& 11110000000000000000000000000000
────────────────────────────────
= aceh
For the reversed order case:
abcdefghabcdefgh
& 0010100010000001 (0x2881)
────────────────────────────────
00c0e000a000000h
x 10000000001010010000000000000000 (0x80290000)
────────────────────────────────
000a000000h
00c0e000a000000h
+ 0e000a000000h
h
────────────────────────────────
hecaea00a0h0h00h
& 11110000000000000000000000000000
────────────────────────────────
= heca
Related:
How to create a byte out of 8 bool values (and vice versa)?
Redistribute least significant bits from a 4-byte array to a nibble

packing 10 bit values into a byte stream with SIMD [duplicate]

This question already has answers here:
Keep only the 10 useful bits in 16-bit words
(2 answers)
Closed 2 years ago.
I'm trying to packing 10 bit pixels in to a continuous byte stream, using SIMD instructions. The code below does it "in principle" but the SIMD version is slower than the scalar version.
The problem seem to be that I can't find good gather/scatter operations that load the register efficiently.
Any suggestions for improvement?
// SIMD_test.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include "Windows.h"
#include <tmmintrin.h>
#include <stdint.h>
#include <string.h>
// reference non-SIMD implementation that "works"
// 4 uint16 at a time as input, and 5 uint8 as output per loop iteration
void packSlow(uint16_t* ptr, uint8_t* streamBuffer, uint32_t NCOL)
{
for(uint32_t j=0;j<NCOL;j+=4)
{
streamBuffer[0] = (uint8_t)(ptr[0]);
streamBuffer[1] = (uint8_t)(((ptr[0]&0x3FF)>>8) | ((ptr[1]&0x3F) <<2));
streamBuffer[2] = (uint8_t)(((ptr[1]&0x3FF)>>6) | ((ptr[2]&0x0F) <<4));
streamBuffer[3] = (uint8_t)(((ptr[2]&0x3FF)>>4) | ((ptr[3]&0x03) <<6));
streamBuffer[4] = (uint8_t)((ptr[3]&0x3FF)>>2) ;
streamBuffer += 5;
ptr += 4;
}
}
// poorly written SIMD implementation. Attempts to do the same
// as the packSlow, but 8 iterations at a time
void packFast(uint16_t* ptr, uint8_t* streamBuffer, uint32_t NCOL)
{
const __m128i maska = _mm_set_epi16(0x3FF,0x3FF,0x3FF,0x3FF,0x3FF,0x3FF,0x3FF,0x3FF);
const __m128i maskb = _mm_set_epi16(0x3F,0x3F,0x3F,0x3F,0x3F,0x3F,0x3F,0x3F);
const __m128i maskc = _mm_set_epi16(0x0F,0x0F,0x0F,0x0F,0x0F,0x0F,0x0F,0x0F);
const __m128i maskd = _mm_set_epi16(0x03,0x03,0x03,0x03,0x03,0x03,0x03,0x03);
for(uint32_t j=0;j<NCOL;j+=4*8)
{
_mm_prefetch((const char*)(ptr+j),_MM_HINT_T0);
}
for(uint32_t j=0;j<NCOL;j+=4*8)
{
// this "fetch" stage is costly. Each term takes 2 cycles
__m128i ptr0 = _mm_set_epi16(ptr[0],ptr[4],ptr[8],ptr[12],ptr[16],ptr[20],ptr[24],ptr[28]);
__m128i ptr1 = _mm_set_epi16(ptr[1],ptr[5],ptr[9],ptr[13],ptr[17],ptr[21],ptr[25],ptr[29]);
__m128i ptr2 = _mm_set_epi16(ptr[2],ptr[6],ptr[10],ptr[14],ptr[18],ptr[22],ptr[26],ptr[30]);
__m128i ptr3 = _mm_set_epi16(ptr[3],ptr[7],ptr[11],ptr[15],ptr[19],ptr[23],ptr[27],ptr[31]);
// I think this part is fairly well optimized
__m128i streamBuffer0 = ptr0;
__m128i streamBuffer1 = _mm_or_si128(_mm_srl_epi16 (_mm_and_si128 (ptr0 , maska), _mm_set_epi32(0, 0, 0,8)) , _mm_sll_epi16 (_mm_and_si128 (ptr1 , maskb) , _mm_set_epi32(0, 0, 0,2)));
__m128i streamBuffer2 = _mm_or_si128(_mm_srl_epi16 (_mm_and_si128 (ptr1 , maska), _mm_set_epi32(0, 0, 0,6)) , _mm_sll_epi16 (_mm_and_si128 (ptr2 , maskc) , _mm_set_epi32(0, 0, 0,4)));
__m128i streamBuffer3 = _mm_or_si128(_mm_srl_epi16 (_mm_and_si128 (ptr2 , maska), _mm_set_epi32(0, 0, 0,4)) , _mm_sll_epi16 (_mm_and_si128 (ptr3 , maskd) , _mm_set_epi32(0, 0, 0,6)));
__m128i streamBuffer4 = _mm_srl_epi16 (_mm_and_si128 (ptr3 , maska), _mm_set_epi32(0, 0, 0,2)) ;
// this again is terribly slow. ~2 cycles per byte output
for(int j=15;j>=0;j-=2)
{
streamBuffer[0] = streamBuffer0.m128i_u8[j];
streamBuffer[1] = streamBuffer1.m128i_u8[j];
streamBuffer[2] = streamBuffer2.m128i_u8[j];
streamBuffer[3] = streamBuffer3.m128i_u8[j];
streamBuffer[4] = streamBuffer4.m128i_u8[j];
streamBuffer += 5;
}
ptr += 32;
}
}
int _tmain(int argc, _TCHAR* argv[])
{
uint16_t pixels[512];
uint8_t packed1[512*10/8];
uint8_t packed2[512*10/8];
for(int i=0;i<512;i++)
{
pixels[i] = i;
}
LARGE_INTEGER t0,t1,t2;
QueryPerformanceCounter(&t0);
for(int k=0;k<1000;k++) packSlow(pixels,packed1,512);
QueryPerformanceCounter(&t1);
for(int k=0;k<1000;k++) packFast(pixels,packed2,512);
QueryPerformanceCounter(&t2);
printf("%d %d\n",t1.QuadPart-t0.QuadPart,t2.QuadPart-t1.QuadPart);
if (memcmp(packed1,packed2,sizeof(packed1)))
{
printf("failed\n");
}
return 0;
}
On re-reading your code, it looks like you are almost definitely murdering your load/store unit, which wouldn't even get complete relief with the new AVX2 VGATHER[D/Q]P[D/S] instruction family. Even Haswell's architecture still requires a uop per load element, each hitting the L1D TLB and cache, regardless of locality, with efficiency improvements showing in Skylake ca. 2016 at earliest.
Your best recourse at present is probably to do 16B register reads and manually construct your streamBuffer values with register copies, _mm_shuffle_epi8(), and _mm_or_si128() calls, and the inverse for the finishing stores.
In the near future, AVX2 will provide (and does for newer desktops already) VPS[LL/RL/RA]V[D/Q] instructions that allow variable element shifting that, combined with a horizontal add, could do this packing pretty quickly. In this case, you could use simple MOVDQU instructions for loading your values, since you could process contiguous uint16_t input values in a single xmm register.
Also, consider reworking your prefetching. Your j in NCOL loop is processing 64B/1 cache line at a time, so you should probably do a single prefetch for ptr + 32 at the beginning of your second loop's body. You might even consider omitting it, since it's a simple forward scan that the hardware prefetcher will detect and automate for you after a very small number of iterations anyway.
I have no experience specifically in SSE. But I would have tried to optimize the code as follows.
// warning. This routine requires streamBuffer to have at least 3 extra spare bytes
// at the end to be used as scratch space. It will write 0's to those bytes.
// for example, streamBuffer needs to be 640+3 bytes of allocated memory if
// 512 10-bit samples are output.
void packSlow1(uint16_t* ptr, uint8_t* streamBuffer, uint32_t NCOL)
{
for(uint32_t j=0;j<NCOL;j+=4*4)
{
uint64_t *dst;
uint64_t src[4][4];
// __m128i s01 = _mm_set_epi64(ptr[0], ptr[1]);
// __m128i s23 = _mm_set_epi64(ptr[2], ptr[3]);
// ---- or ----
// __m128i s0123 = _mm_load_si128(ptr[0])
// __m128i s01 = _?????_(s0123) // some instruction to extract s01 from s0123
// __m128i s23 = _?????_(s0123) // some instruction to extract s23
src[0][0] = ptr[0] & 0x3ff;
src[0][1] = ptr[1] & 0x3ff;
src[0][2] = ptr[2] & 0x3ff;
src[0][3] = ptr[3] & 0x3ff;
src[1][0] = ptr[4] & 0x3ff;
src[1][1] = ptr[5] & 0x3ff;
src[1][2] = ptr[6] & 0x3ff;
src[1][3] = ptr[7] & 0x3ff;
src[2][0] = ptr[8] & 0x3ff;
src[2][1] = ptr[9] & 0x3ff;
src[2][2] = ptr[10] & 0x3ff;
src[2][3] = ptr[11] & 0x3ff;
src[3][0] = ptr[12] & 0x3ff;
src[3][1] = ptr[13] & 0x3ff;
src[3][2] = ptr[14] & 0x3ff;
src[3][3] = ptr[15] & 0x3ff;
// looks like _mm_maskmoveu_si128 can store result efficiently
dst = (uint64_t*)streamBuffer;
dst[0] = src[0][0] | (src[0][1] << 10) | (src[0][2] << 20) | (src[0][3] << 30);
dst = (uint64_t*)(streamBuffer + 5);
dst[0] = src[1][0] | (src[1][1] << 10) | (src[1][2] << 20) | (src[1][3] << 30);
dst = (uint64_t*)(streamBuffer + 10);
dst[0] = src[2][0] | (src[2][1] << 10) | (src[2][2] << 20) | (src[2][3] << 30);
dst = (uint64_t*)(streamBuffer + 15);
dst[0] = src[3][0] | (src[3][1] << 10) | (src[3][2] << 20) | (src[3][3] << 30);
streamBuffer += 5 * 4;
ptr += 4 * 4;
}
}
UPDATE:
Benchmarks:
Ubuntu 12.04, x86_64 GNU/Linux, gcc v4.6.3 (Virtual Box)
Intel Core i7 (Macbook pro)
compiled with -O3
5717633386 (1X): packSlow
3868744491 (1.4X): packSlow1 (version from the post)
4471858853 (1.2X): packFast2 (from Mark Lakata's post)
1820784764 (3.1X): packFast3 (version from the post)
Windows 8.1, x64, VS2012 Express
Intel Core i5 (Asus)
compiled with standard 'Release' options and SSE2 enabled
00413185 (1X) packSlow
00782005 (0.5X) packSlow1
00236639 (1.7X) packFast2
00148906 (2.8X) packFast3
I see completely different results on Asus notebook with Windows 8.1 and VS Express 2012 (code compiled with -O2). packSlow1 is 2x slower than original packSlow, while packFast2 is 1.7X (not 2.9X) faster than packSlow. After researching this problem, I understood the reason. VC compiler was unable to save all the constants into XMMS registers for packFast2 , so it inserted additional memory accesses into the loop (see generated assembly). Slow memory access explains performance degradation.
In order to get more stable results I increased pixels buffer to 256x512 and increased loop counter from 1000 to 10000000/256.
Here is my version of SSE optimized function.
// warning. This routine requires streamBuffer to have at least 3 extra spare bytes
// at the end to be used as scratch space. It will write 0's to those bytes.
// for example, streamBuffer needs to be 640+3 bytes of allocated memory if
// 512 10-bit samples are output.
void packFast3(uint16_t* ptr, uint8_t* streamBuffer, uint32_t NCOL)
{
const __m128i m0 = _mm_set_epi16(0, 0x3FF, 0, 0x3FF, 0, 0x3FF, 0, 0x3FF);
const __m128i m1 = _mm_set_epi16(0x3FF, 0, 0x3FF, 0, 0x3FF, 0, 0x3FF, 0);
const __m128i m2 = _mm_set_epi32(0, 0xFFFFFFFF, 0, 0xFFFFFFFF);
const __m128i m3 = _mm_set_epi32(0xFFFFFFFF, 0, 0xFFFFFFFF, 0);
const __m128i m4 = _mm_set_epi32(0, 0, 0xFFFFFFFF, 0xFFFFFFFF);
const __m128i m5 = _mm_set_epi32(0xFFFFFFFF, 0xFFFFFFFF, 0, 0);
__m128i s0, t0, r0, x0, x1;
// unrolled and normal loop gives the same result
for(uint32_t j=0;j<NCOL;j+=8)
{
// load 8 samples into s0
s0 = _mm_loadu_si128((__m128i*)ptr); // s0=00070006_00050004_00030002_00010000
// join 16-bit samples into 32-bit words
x0 = _mm_and_si128(s0, m0); // x0=00000006_00000004_00000002_00000000
x1 = _mm_and_si128(s0, m1); // x1=00070000_00050000_00030000_00010000
t0 = _mm_or_si128(x0, _mm_srli_epi32(x1, 6)); // t0=00001c06_00001404_00000c02_00000400
// join 32-bit words into 64-bit dwords
x0 = _mm_and_si128(t0, m2); // x0=00000000_00001404_00000000_00000400
x1 = _mm_and_si128(t0, m3); // x1=00001c06_00000000_00000c02_00000000
t0 = _mm_or_si128(x0, _mm_srli_epi64(x1, 12)); // t0=00000001_c0601404_00000000_c0200400
// join 64-bit dwords
x0 = _mm_and_si128(t0, m4); // x0=00000000_00000000_00000000_c0200400
x1 = _mm_and_si128(t0, m5); // x1=00000001_c0601404_00000000_00000000
r0 = _mm_or_si128(x0, _mm_srli_si128(x1, 3)); // r0=00000000_000001c0_60140400_c0200400
// and store result
_mm_storeu_si128((__m128i*)streamBuffer, r0);
streamBuffer += 10;
ptr += 8;
}
}
I came up with a "better" solution using SIMD, but it doesn't not leverage parallelization, just more efficient loads and stores (I think).
I'm posting it here for reference, not necessarily the best answer.
The benchmarks are (in arbitrary ticks)
gcc4.8.1 -O3 VS2012 /O2 Implementation
-----------------------------------------
369 (1X) 3394 (1X) packSlow (original code)
212 (1.7X) 2010 (1.7X) packSlow (from #alexander)
147 (2.5X) 1178 (2.9X) packFast2 (below)
Here's the code. Essentially #alexander's code except using 128 bit registers instead of 64 bit registers, and unrolled 2x instead of 4x.
void packFast2(uint16_t* ptr, uint8_t* streamBuffer, uint32_t NCOL)
{
const __m128i maska = _mm_set_epi16(0x3FF,0x3FF,0x3FF,0x3FF,0x3FF,0x3FF,0x3FF,0x3FF);
const __m128i mask0 = _mm_set_epi16(0,0,0,0,0,0,0,0x3FF);
const __m128i mask1 = _mm_set_epi16(0,0,0,0,0,0,0x3FF,0);
const __m128i mask2 = _mm_set_epi16(0,0,0,0,0,0x3FF,0,0);
const __m128i mask3 = _mm_set_epi16(0,0,0,0,0x3FF,0,0,0);
const __m128i mask4 = _mm_set_epi16(0,0,0,0x3FF,0,0,0,0);
const __m128i mask5 = _mm_set_epi16(0,0,0x3FF,0,0,0,0,0);
const __m128i mask6 = _mm_set_epi16(0,0x3FF,0,0,0,0,0,0);
const __m128i mask7 = _mm_set_epi16(0x3FF,0,0,0,0,0,0,0);
for(uint32_t j=0;j<NCOL;j+=16)
{
__m128i s = _mm_load_si128((__m128i*)ptr); // load 8 16 bit values
__m128i s2 = _mm_load_si128((__m128i*)(ptr+8)); // load 8 16 bit values
__m128i a = _mm_and_si128(s,mask0);
a = _mm_or_si128( a, _mm_srli_epi64 (_mm_and_si128(s, mask1),6));
a = _mm_or_si128( a, _mm_srli_epi64 (_mm_and_si128(s, mask2),12));
a = _mm_or_si128( a, _mm_srli_epi64 (_mm_and_si128(s, mask3),18));
a = _mm_or_si128( a, _mm_srli_si128 (_mm_and_si128(s, mask4),24/8)); // special shift 24 bits to the right, staddling the middle. luckily use just on 128 byte shift (24/8)
a = _mm_or_si128( a, _mm_srli_si128 (_mm_srli_epi64 (_mm_and_si128(s, mask5),6),24/8)); // special. shift net 30 bits. first shift 6 bits, then 3 bytes.
a = _mm_or_si128( a, _mm_srli_si128 (_mm_srli_epi64 (_mm_and_si128(s, mask6),4),32/8)); // special. shift net 36 bits. first shift 4 bits, then 4 bytes (32 bits).
a = _mm_or_si128( a, _mm_srli_epi64 (_mm_and_si128(s, mask7),42));
_mm_storeu_si128((__m128i*)streamBuffer, a);
__m128i a2 = _mm_and_si128(s2,mask0);
a2 = _mm_or_si128( a2, _mm_srli_epi64 (_mm_and_si128(s2, mask1),6));
a2 = _mm_or_si128( a2, _mm_srli_epi64 (_mm_and_si128(s2, mask2),12));
a2 = _mm_or_si128( a2, _mm_srli_epi64 (_mm_and_si128(s2, mask3),18));
a2 = _mm_or_si128( a2, _mm_srli_si128 (_mm_and_si128(s2, mask4),24/8)); // special shift 24 bits to the right, staddling the middle. luckily use just on 128 byte shift (24/8)
a2 = _mm_or_si128( a2, _mm_srli_si128 (_mm_srli_epi64 (_mm_and_si128(s2, mask5),6),24/8)); // special. shift net 30 bits. first shift 6 bits, then 3 bytes.
a2 = _mm_or_si128( a2, _mm_srli_si128 (_mm_srli_epi64 (_mm_and_si128(s2, mask6),4),32/8)); // special. shift net 36 bits. first shift 4 bits, then 4 bytes (32 bits).
a2 = _mm_or_si128( a2, _mm_srli_epi64 (_mm_and_si128(s2, mask7),42));
_mm_storeu_si128((__m128i*)(streamBuffer+10), a2);
streamBuffer += 20 ;
ptr += 16 ;
}
}

Horizontal minimum and maximum using SSE

I have a function using SSE to do a lot of stuff, and the profiler shows me that the code portion I use to compute the horizontal minimum and maximum consumes most of the time.
I have been using the following implementation for the minimum for instance:
static inline int16_t hMin(__m128i buffer) {
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m1));
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m2));
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m3));
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m4));
return ((int8_t*) ((void *) &buffer))[0];
}
I need to compute the minimum and the maximum of 16 1-byte integers, as you see.
Any good suggestions are highly appreciated :)
Thanks
SSE 4.1 has an instruction that does almost what you want. Its name is PHMINPOSUW, C/C++ intrinsic is _mm_minpos_epu16. It is limited to 16-bit unsigned values and cannot give maximum, but these problems could be easily solved.
If you need to find minimum of non-negative bytes, do nothing. If bytes may be negative, add 128 to each. If you need maximum, subtract each from 127.
Use either _mm_srli_pi16 or _mm_shuffle_epi8, and then _mm_min_epu8 to get 8 pairwise minimum values in even bytes and zeros in odd bytes of some XMM register. (These zeros are produced by shift/shuffle instruction and should remain at their places after _mm_min_epu8).
Use _mm_minpos_epu16 to find minimum among these values.
Extract the resulting minimum value with _mm_cvtsi128_si32.
Undo effect of step 1 to get the original byte value.
Here is an example that returns maximum of 16 signed bytes:
static inline int16_t hMax(__m128i buffer)
{
__m128i tmp1 = _mm_sub_epi8(_mm_set1_epi8(127), buffer);
__m128i tmp2 = _mm_min_epu8(tmp1, _mm_srli_epi16(tmp1, 8));
__m128i tmp3 = _mm_minpos_epu16(tmp2);
return (int8_t)(127 - _mm_cvtsi128_si32(tmp3));
}
I suggest two changes:
Replace ((int8_t*) ((void *) &buffer))[0] with _mm_cvtsi128_si32.
Replace _mm_shuffle_epi8 with _mm_shuffle_epi32/_mm_shufflelo_epi16 which have lower latency on recent AMD processors and Intel Atom, and will save you memory load operations:
static inline int16_t hMin(__m128i buffer)
{
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi32(buffer, _MM_SHUFFLE(3, 2, 3, 2)));
buffer = _mm_min_epi8(buffer, _mm_shuffle_epi32(buffer, _MM_SHUFFLE(1, 1, 1, 1)));
buffer = _mm_min_epi8(buffer, _mm_shufflelo_epi16(buffer, _MM_SHUFFLE(1, 1, 1, 1)));
buffer = _mm_min_epi8(buffer, _mm_srli_epi16(buffer, 8));
return (int8_t)_mm_cvtsi128_si32(buffer);
}
here's an implementation without shuffle, shuffle is slow on AMD 5000 Ryzen 7 for some reason
float max_elem3() const {
__m128 a = _mm_unpacklo_ps(mm, mm); // x x y y
__m128 b = _mm_unpackhi_ps(mm, mm); // z z w w
__m128 c = _mm_max_ps(a, b); // ..., max(x, z), ..., ...
Vector4 res = _mm_max_ps(mm, c); // ..., max(y, max(x, z)), ..., ...
return res.y;
}
float min_elem3() const {
__m128 a = _mm_unpacklo_ps(mm, mm); // x x y y
__m128 b = _mm_unpackhi_ps(mm, mm); // z z w w
__m128 c = _mm_min_ps(a, b); // ..., min(x, z), ..., ...
Vector4 res = _mm_min_ps(mm, c); // ..., min(y, min(x, z)), ..., ...
return res.y;
}