Fast conversion of 16-bit big-endian to little-endian in ARM

Fast conversion of 16-bit big-endian to little-endian in ARM - c++

I need to convert big arrays of 16-bit integer values from big-endian to little-endian format.
Now I use for conversion the following function:
inline void Reorder16bit(const uint8_t * src, uint8_t * dst)
{
uint16_t value = *(uint16_t*)src;
*(uint16_t*)dst = value >> 8 | value << 8;
}
void Reorder16bit(const uint8_t * src, size_t size, uint8_t * dst)
{
assert(size%2 == 0);
for(size_t i = 0; i < size; i += 2)
Reorder16bit(src + i, dst + i);
}
I use GCC. Target platform is ARMv7 (Raspberry Phi 2B).
Is there any way to optimize it?
This conversion is needed for loading audio samples which can be as in little- endian as in big-endian format. Of course it is not a bottleneck now, but it takes about 10% of total processing time. And I think that is too much for such a simple operation.

If you want to improve performance of your code you can make following:
1) Processing of 4-bytes for one step:
inline void Reorder16bit(const uint8_t * src, uint8_t * dst)
{
uint16_t value = *(uint16_t*)src;
*(uint16_t*)dst = value >> 8 | value << 8;
}
inline void Reorder16bit2(const uint8_t * src, uint8_t * dst)
{
uint32_t value = *(uint32_t*)src;
*(size_t*)dst = (value & 0xFF00FF00) >> 8 | (value & 0x00FF00FF) << 8;
}
void Reorder16bit(const uint8_t * src, size_t size, uint8_t * dst)
{
assert(size%2 == 0);
size_t alignedSize = size/4*4;
for(size_t i = 0; i < alignedSize; i += 4)
Reorder16bit2(src + i, dst + i);
for(size_t i = alignedSize; i < size; i += 2)
Reorder16bit(src + i, dst + i);
}
If you use a 64-bit platform, it is possible to process 8 bytes for one step the same way.
2) ARMv7 platform supports SIMD instructions called NEON.
With using of them you can make you code even faster then in 1):
inline void Reorder16bit(const uint8_t * src, uint8_t * dst)
{
uint16_t value = *(uint16_t*)src;
*(uint16_t*)dst = value >> 8 | value << 8;
}
inline void Reorder16bit8(const uint8_t * src, uint8_t * dst)
{
uint8x16_t _src = vld1q_u8(src);
vst1q_u8(dst, vrev16q_u8(_src));
}
void Reorder16bit(const uint8_t * src, size_t size, uint8_t * dst)
{
assert(size%2 == 0);
size_t alignedSize = size/16*16;
for(size_t i = 0; i < alignedSize; i += 16)
Reorder16bit8(src + i, dst + i);
for(size_t i = alignedSize; i < size; i += 2)
Reorder16bit(src + i, dst + i);
}

https://goo.gl/4bRGNh
int swap(int b) {
return __builtin_bswap16(b);
}
becomes
swap(int):
rev16 r0, r0
uxth r0, r0
bx lr
So yours could be written as (gcc-explorer: https://goo.gl/HFLdMb)
void fast_Reorder16bit(const uint16_t * src, size_t size, uint16_t * dst)
{
assert(size%2 == 0);
for(size_t i = 0; i < size; i++)
dst[i] = __builtin_bswap16(src[i]);
}
which should make for loop
.L13:
ldrh r4, [r0, r3]
rev16 r4, r4
strh r4, [r2, r3] # movhi
adds r3, r3, #2
cmp r3, r1
bne .L13
read more about __builtin_bswap16 at GCC builtin docs.
Neon suggestion (kinda tested, gcc-explorer: https://goo.gl/fLNYuc):
void neon_Reorder16bit(const uint8_t * src, size_t size, uint8_t * dst)
{
assert(size%16 == 0);
//uint16x8_t vld1q_u16 (const uint16_t *)
//vrev64q_u16(uint16x8_t vec);
//void vst1q_u16 (uint16_t *, uint16x8_t)
for (size_t i = 0; i < size; i += 16)
vst1q_u8(dst + i, vrev16q_u8(vld1q_u8(src + i)));
}
which becomes
.L23:
adds r5, r0, r3
adds r4, r2, r3
adds r3, r3, #16
vld1.8 {d16-d17}, [r5]
cmp r1, r3
vrev16.8 q8, q8
vst1.8 {d16-d17}, [r4]
bhi .L23
See more about neon intrinsics here: https://gcc.gnu.org/onlinedocs/gcc-4.4.1/gcc/ARM-NEON-Intrinsics.html
Bonus from ARM ARM A8.8.386:
VREV16 (Vector Reverse in halfwords) reverses the order of 8-bit elements in each halfword of the vector, and places the result in the corresponding destination vector.
VREV32 (Vector Reverse in words) reverses the order of 8-bit or 16-bit elements in each word of the vector, and places the result in the corresponding destination vector.
VREV64 (Vector Reverse in doublewords) reverses the order of 8-bit, 16-bit, or 32-bit elements in each doubleword of the vector, and places the result in the corresponding destination vector.
There is no distinction between data types, other than size.

If it's specifically for ARM there's a REV instruction, specifically REV16 which would do two 16 bit ints at a time.

I don't know much about ARM instruction sets, but I guess there are some special instruction to endianess conversion. Apparently, ARMv7 has things like rev etc.
Have you tried the compiler intrinsic __builtin_bswap16? It should compile to CPU-specific code, e.g. rev on ARM. In addition, it helps the compiler to recognize that you are actually doing a byte-swap, and it performs other optimizations with that knowledge, e.g. eliminate redundant byte swaps entirely in cases like y=swap(x); y &= some_value; x = swap(y);.
I googled a little bit, and this thread discusses an issue with the optimization potential. According to this discussion, the compiler can also vectorize the conversion if the CPU supports the vrev NEON instruction.

You'd want to measure to see which is faster, but an alternative body for Reorder16bit would be
*(uint16_t*)dst = 256 * src[0] + src[1];
assuming that your native ints are little-endian. Another possibility:
dst[0] = src[1];
dst[1] = src[0];

Related

Matrix transpose and population count

I have a square boolean matrix M of size N, stored by rows and I want to count the number of bits set to 1 for each column.
For instance for n=4:
1101
0101
0001
1001
M stored as { { 1,1,0,1}, {0,1,0,1}, {0,0,0,1}, {1,0,0,1} };
result = { 2, 2, 0, 4};
I can obviously
transpose the matrix M into a matrix M'
popcount each row of M'.
Good algorithms exist for matrix transposition and popcounting through bit manipulation.
My question is: would it be possible to "merge" such algorithms into a single one ?
Note that N could be quite large (say 1024 and more) regarding 64 bits architecture.

Related: Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2 and https://github.com/mklarqvist/positional-popcount
I had another idea which I haven't finished writing up nicely.
Godbolt link to messy work-in-progress which doesn't have correct loop bounds / cleanup, but for large buffers runs ~3x faster than #edrezen's version on my Skylake i7-6700k, with g++7.3 -O3 -march=native. See the test_SWAR_avx2 function. (I know it doesn't compile on Godbolt; Agner Fog's asmlib.h isn't present.)
I might have some columns in the wrong order, too, but from stepping through the asm I think it's doing the right amount of work. i.e. any necessary bugfixes won't slow it down.
I used 16-bit accumulators, so another outer loop might be necessary if you care about inputs large enough to overflow 16-bit per-column counters.
Interesting observation: An earlier buggy version of my loop used sum0123 twice in store_globalsums_from_vec16, leaving sum4567 unused, so it optimized away in the main loop. With less work, gcc fully unrolled the large for(int i=0 ; i<5 ; i++) loop, and the code ran slower, like about 1 cycle per byte instead of 0.5. The loop was probably too big for the uop cache or something (I didn't profile yet but a front-end decode bottleneck would explain it). For some reason #edrezen's version is only running at about 1.5c/B for me, not the ~1.25 reported in the answer. My CPU is actually running 3.9GHz, but Agner Fog's library detects it at 4.0, but that's not enough to explain it.
Also, gcc spills sum4567_16bit to the stack, so we're already pushing the boundary of register pressure without AVX512. It's updated infrequently and isn't a problem, but needing more accumulators in the inner loop could be.
Your data layout isn't clear about when the number of columns isn't 32.
It seems that for each uint32_t chunk of 32 columns, you have all the rows stored contiguously in memory. i.e. looping over the rows for a column is efficient. If you had more than 32 columns, the rows for columns 32..63 will be contiguous and come after all the rows for columns 0..31.
(If instead you have all the columns for a single row contiguous, you could still use this idea, but might need to spill/reload some accumulators to memory, or let the compiler do that for you if it makes good choices.)
So loading a 32-byte (8 dword) vector gets 8 rows of data for one column chunk. That's extremely convenient, and allows widening from 1-bit (in memory) to 2-bit accumulators, then grab more data before we widen to 4-bit, and so on, summing along the way so we get significant work done while the data is still dense. (Rather than only adding 1 bit (0 or 1) per byte to vector accumulators.)
The more we unroll, the more data we can grab from memory to make better use of the coding space in our vectors. i.e. our variables have higher entropy. Throwing around more data (in terms of bits of memory that contributed to it) per vpaddb/w/d/q or unpack/shuffle instruction is a Good Thing.
Accumulators narrower than 1 byte within a SIMD vector is basically an https://en.wikipedia.org/wiki/SWAR technique, where you have to AND away bits that you shift past an element boundary, because we don't have SIMD element boundaries to do it for us. (And we avoid overflow anyway, so ADD carrying into the next element isn't a problem.)
Each inner loop iteration:
take a vector of data from the same columns in each of 2 or 3 (groups of) rows. So you either have 3 * 8 rows from one chunk of 32 columns, or 3 rows of 256 columns.
mask them with set1(0b01010101) to get the even (low) bits, and with (vec>>1) & mask (_mm256_srli_epi32(v,1)) to get the odd (high) bits. Use _mm256_add_epi8 to accumulate within those 2-bit accumulators. They can't overflow with only 3 ones, so carry-propagation boundaries don't actually matter.
Each byte of your vector has 4 separate vertical sums, and you have two vectors (odd/even).
Repeat the above again, to get another pair of vectors from 3 vectors of data from memory.
Combine again to get 4 vectors of 4-bit accumulators (with possible values 0..6). Still without mixing bits from within a single 32-bit element, of course, because we must never do that. Shifts only move bits for odd / high columns to the bottom of the 2-bit or 4-bit unit that contains them so they can be added with bits that were moved the same way in other vectors.
_mm256_unpacklo/hi_epi8 and mask or shift+mask to get 8-bit accumulators
Put the above in a loop that runs up to 5 times, so the 0..12 accumulator values go up to 0..60 (i.e. leaving 2 bits of headroom for unpacking the 8-bit accumulators, using all their coding space.)
If you have the data layout from your answer, then we can add data from dword elements within the same vector. We can do that so we don't run out of registers when widening our accumulators up to 16-bit (because x86-64 only has 16 YMM registers, and we need some for constants.)
_mm256_unpacklo/hi_epi16 and add, to interleave pairs of 8-bit counters so a group of counters for the same column has expanded from a dword to a qword.
Repeat this general idea to reduce the number of registers (or __m256i variables) your accumulators are spread over.
Efficiently handling the lack of a lane-crossing 2-input byte or word shuffle is inconvenient, but it's a pretty small part of the total work. vextracti128 / vpaddb xmm -> vpmovzxbw worked well enough.

I made some benchmark between the two approaches:
transpose + popcount
update row by row
I wrote a naive version and an AVX2 one for both approaches. I used some functions (found on stackoverflow or elsewhere) for the AVX2 "transpose+popcount" approach.
In my test, I make the assumption that the input is a nbRowsx32 matrix in a bits packed format (nbRows itself being a multiple of 32); the matrix is therefore stored as an array of uint32_t.
The code is the following:
#include <cinttypes>
#include <cstdio>
#include <cstring>
#include <cmath>
#include <cassert>
#include <chrono>
#include <immintrin.h>
#include <asmlib.h>
using namespace std;
using namespace std::chrono;
// see https://stackoverflow.com/questions/24225786/fastest-way-to-unpack-32-bits-to-a-32-byte-simd-vector
static __m256i expand_bits_to_bytes (uint32_t x);
// see https://mischasan.wordpress.com/2011/10/03/the-full-sse2-bit-matrix-transpose-routine/
static void sse_trans(char const *inp, char *out);
static double deviation (double n, double sum2, double sum);
////////////////////////////////////////////////////////////////////////////////
// Naive approach (matrix transposition)
////////////////////////////////////////////////////////////////////////////////
void test_transpose_popcnt_naive (uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums)
{
assert (nbRows%32==0);
uint8_t transpo[32][32]; memset (transpo, 0, sizeof(transpo));
for (uint64_t k=0; k<nbRows; k+=32)
{
// We unpack and transpose the input into a 32x32 bytes matrix
for (size_t row=0; row<32; row++)
{
for (size_t col=0; col<32; col++) { transpo[col][row] = (bitmap[k+row] >> col) & 1 ; }
}
for (size_t row=0; row<32; row++)
{
// We popcount the current row
u_int8_t sum=0;
for (size_t col=0; col<32; col++) { sum += transpo[row][col]; }
// We update the corresponding global sum
globalSums[row] += sum;
}
}
}
////////////////////////////////////////////////////////////////////////////////
// Naive approach (row by row)
////////////////////////////////////////////////////////////////////////////////
void test_update_row_by_row_naive (uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums)
{
for (uint64_t row=0; row<nbRows; row++)
{
for (size_t col=0; col<32; col++)
{
globalSums[col] += (bitmap[row] >> col) & 1;
}
}
}
////////////////////////////////////////////////////////////////////////////////
// AVX2 (matrix transposition + popcount)
////////////////////////////////////////////////////////////////////////////////
void test_transpose_popcnt_avx2 (uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums)
{
assert (nbRows%32==0);
uint32_t transpo[32];
const uint32_t* loop = bitmap;
for (uint64_t k=0; k<nbRows; loop+=32, k+=32)
{
// We transpose the input as a 32x32 bytes matrix
sse_trans ((const char*)loop, (char*)transpo);
// We update the global sums
for (size_t i=0; i<32; i++)
{
globalSums[i] += __builtin_popcount (transpo[i]);
}
}
}
////////////////////////////////////////////////////////////////////////////////
// AVX2 approach (update totals row by row)
////////////////////////////////////////////////////////////////////////////////
// Note: we use template specialization to unroll some portions of a loop
template<int N>
void UpdateLocalSums (__m256i& localSums, const uint32_t* bitmap, uint64_t& k)
{
// We update the local sums with the current row
localSums = _mm256_sub_epi8 (localSums, expand_bits_to_bytes (bitmap[k++]));
// Go recursively
UpdateLocalSums<N-1>(localSums, bitmap, k);
}
template<>
void UpdateLocalSums<0> (__m256i& localSums, const uint32_t* bitmap, uint64_t& k)
{
}
// Dillon Davis proposal: use 4 registers holding uint32_t values and update them from local sums with AVX2
#define USE_AVX2_FOR_GRAND_TOTALS 1
void test_update_row_by_row_avx2 (uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums)
{
union U256i { __m256i v; uint8_t a[32]; uint32_t b[8]; };
// We use 1 register for updating local totals
__m256i localSums = _mm256_setzero_si256();
#ifdef USE_AVX2_FOR_GRAND_TOTALS
// Dillon Davis proposal: use 4 registers holding uint32_t values and update them from local sums with AVX2
__m256i globalSumsReg[4]; for (size_t r=0; r<4; r++) { globalSumsReg[r] = _mm256_setzero_si256(); }
#endif
uint64_t steps = nbRows / 255;
uint64_t k=0;
const int divisorOf255 = 5;
// We iterate over all rows
for (uint64_t i=0; i<steps; i++)
{
// we update the local totals (255*32=8160 additions)
for (int j=0; j<255/divisorOf255; j++)
{
// unroll some portion of the 255 loop through template specialization
UpdateLocalSums<divisorOf255>(localSums, bitmap, k);
}
#ifdef USE_AVX2_FOR_GRAND_TOTALS
// Dillon Davis proposal: use 4 registers holding uint32_t values and update them from local sums
// We take the 128 high bits of the local sums
__m256i localSums2 = _mm256_broadcastsi128_si256(_mm256_extracti128_si256(localSums,1));
globalSumsReg[0] = _mm256_add_epi32 (globalSumsReg[0],
_mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums, 0)))
);
globalSumsReg[1] = _mm256_add_epi32 (globalSumsReg[1],
_mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums, 8)))
);
globalSumsReg[2] = _mm256_add_epi32 (globalSumsReg[2],
_mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums2, 0)))
);
globalSumsReg[3] = _mm256_add_epi32 (globalSumsReg[3],
_mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums2, 8)))
);
#else
// we update the global totals
U256i tmp = { localSums };
for (size_t k=0; k<32; k++) { globalSums[k] += tmp.a[k]; }
#endif
// we reset the local totals
localSums = _mm256_setzero_si256();
}
#ifdef USE_AVX2_FOR_GRAND_TOTALS
// We update the global totals into the final uint32_t array
for (size_t r=0; r<4; r++)
{
U256i tmp = { globalSumsReg[r] };
for (size_t k=0; k<8; k++) { globalSums[r*8+k] += tmp.b[k]; }
}
#endif
// we update the remaining local totals
for (uint64_t i=steps*255; i<nbRows; i++)
{
UpdateLocalSums<1>(localSums, bitmap, k);
}
// we update the global totals
U256i tmp = { localSums };
for (size_t k=0; k<32; k++) { globalSums[k] += tmp.a[k]; }
}
////////////////////////////////////////////////////////////////////////////////
void execute (
const char* name,
void (*fct)(uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums),
size_t nbRuns,
uint64_t nbRows,
u_int32_t* bitmap
)
{
uint64_t sums[32];
double timeTotal=0;
double cycleTotal=0;
double timeTotal2=0;
double cycleTotal2=0;
uint64_t check=0;
for (size_t n=0; n<nbRuns; n++)
{
memset(sums,0,sizeof(sums));
// We want both time and cpu cycles information
milliseconds t0 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
uint64_t c0 = ReadTSC();
// We run the test
(*fct) (nbRows, bitmap, sums);
uint64_t c1 = ReadTSC();
milliseconds t1 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
timeTotal += (t1-t0).count();
cycleTotal += (double)(c1-c0) / nbRows;
timeTotal2 += (t1-t0).count() * (t1-t0).count();
cycleTotal2 += ((double)(c1-c0) / nbRows) * ((double)(c1-c0) / nbRows);
// We compute some dummy checksum
for (size_t k=0; k<32; k++) { check += sums[k]; }
}
printf ("%-21s | %5.0lf (%5.1lf) | %5.2lf (%4.2lf) | %.3lf | 0x%lx\n",
name,
timeTotal / nbRuns,
deviation (nbRuns, timeTotal2, timeTotal),
cycleTotal/nbRuns,
deviation (nbRuns, cycleTotal2, cycleTotal),
check,
nbRows * cycleTotal / timeTotal / 1000000.0
);
}
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
// We set rows number as 2^n where n is the provided argument
// For simplification, we assume that the rows number is a multiple of 32
uint64_t nbRows = 1ULL << (argc>1 ? atoi(argv[1]) : 28);
size_t nbRuns = argc>2 ? atoi(argv[2]) : 10;
// We build an bitmap of size nbRows*32
uint32_t* bitmap = new uint32_t[nbRows];
if (bitmap==nullptr)
{
fprintf(stderr, "unable to allocate the bitmap\n");
exit(1);
}
// We fill the bitmap with random values
srand(time(nullptr));
for (uint64_t i=0; i<nbRows; i++) { bitmap[i] = rand() & 0xFFFFFFFF; }
printf ("\n");
printf ("nbRows=%ld nbRuns=%ld\n", nbRows, nbRuns);
printf ("------------------------------------------------------------------------------------------------------------\n");
printf ("name | time in msec : mean (sd) | cycles/row : mean (sd) | frequency in GHz | checksum\n");
printf ("------------------------------------------------------------------------------------------------------------\n");
// We launch the benchmark
execute ("naive (transpo) ", test_transpose_popcnt_naive, nbRuns, nbRows, bitmap);
execute ("naive (row by row)", test_update_row_by_row_naive, nbRuns, nbRows, bitmap);
execute ("AVX2 (transpo) ", test_transpose_popcnt_avx2, nbRuns, nbRows, bitmap);
execute ("AVX2 (row by row)", test_update_row_by_row_avx2, nbRuns, nbRows, bitmap);
printf ("\n");
// Some clean up
delete[] bitmap;
return EXIT_SUCCESS;
}
////////////////////////////////////////////////////////////////////////////////
__m256i expand_bits_to_bytes(uint32_t x)
{
__m256i xbcast = _mm256_set1_epi32(x);
// Each byte gets the source byte containing the corresponding bit
__m256i shufmask = _mm256_set_epi64x(
0x0303030303030303, 0x0202020202020202,
0x0101010101010101, 0x0000000000000000);
__m256i shuf = _mm256_shuffle_epi8(xbcast, shufmask);
__m256i andmask = _mm256_set1_epi64x(0x8040201008040201); // every 8 bits -> 8 bytes, pattern repeats.
__m256i isolated_inverted = _mm256_and_si256(shuf, andmask);
// Avoid an _mm256_add_epi8 thanks to Peter Cordes's comment
return _mm256_cmpeq_epi8(isolated_inverted, andmask);
}
////////////////////////////////////////////////////////////////////////////////
void sse_trans(char const *inp, char *out)
{
#define INP(x,y) inp[(x)*4 + (y)/8]
#define OUT(x,y) out[(y)*4 + (x)/8]
int rr, cc, i, h;
union { __m256i x; uint8_t b[32]; } tmp;
for (cc = 0; cc < 32; cc += 8)
{
for (i = 0; i < 32; ++i)
tmp.b[i] = INP(i, cc);
for (i = 8; i--; tmp.x = _mm256_slli_epi64(tmp.x, 1))
*(uint32_t*)&OUT(0, cc + i) = _mm256_movemask_epi8(tmp.x);
}
}
////////////////////////////////////////////////////////////////////////////////
double deviation (double n, double sum2, double sum) { return sqrt (sum2/n - (sum/n)*(sum/n)); }
Some remarks:
I used the Agner Fog's asmlib to have a function that returns CPU cycles
The compilation command is g++ -O3 -march=native ../Test.cpp -o ./Test -laelf64
The gcc version is 7.3.1
The CPU is Intel(R) Core(TM) i7-6700HQ CPU # 2.60GHz
I compute some dummy checksum to compare the results of the different tests
Now the results:
------------------------------------------------------------------------------------------------------------
name | time in msec : mean (sd) | cycles/row : mean (sd) | frequency in GHz | checksum
------------------------------------------------------------------------------------------------------------
naive (transpo) | 4548 ( 36.5) | 43.91 (0.35) | 2.592 | 0x9affeb5a6
naive (row by row) | 3033 ( 11.0) | 29.29 (0.11) | 2.592 | 0x9affeb5a6
AVX2 (transpo) | 767 ( 12.8) | 7.40 (0.12) | 2.592 | 0x9affeb5a6
AVX2 (row by row) | 130 ( 4.0) | 1.25 (0.04) | 2.591 | 0x9affeb5a6
So it seems that the "row by row" in AVX2 is the best so far.
Note that when I saw this result (less than 2 cycles per row), I made no more effort to optimize the AVX2 "transpose+popcount" method, which should be feasable by computing several popcounts in parallel (I may test it later).

I eventually wrote another implementation, following the high entropy SWAR approach proposed by Peter Cordes. This implementation is recursive and relies on C++ template specialization.
The global idea is to fill N-bit accumulators to their maximum without carry overflow (this is where recursion is used). When these accumulators are filled, we update the grand totals and we start again with new N-bit accumulators to fill until all rows have been processed.
Here is the code (see function test_SWAR_recursive):
#include <immintrin.h>
#include <cassert>
#include <chrono>
#include <cinttypes>
#include <cmath>
#include <cstdio>
#include <cstring>
using namespace std;
using namespace std::chrono;
// avoid the #include <asmlib.h>
extern "C" u_int64_t ReadTSC();
static double deviation (double n, double sum2, double sum) { return sqrt (sum2/n - (sum/n)*(sum/n)); }
////////////////////////////////////////////////////////////////////////////////
// Recursive SWAR approach (with template specialization)
////////////////////////////////////////////////////////////////////////////////
template<int DEPTH>
struct RecursiveSWAR
{
// Number of accumulators for current depth
static const int N = 1<<DEPTH;
// Array of N-bit accumulators
typedef __m256i Array[N];
// Magic numbers (0x55555555, 0x33333333, ...) computed recursively
static const u_int32_t MAGIC_NUMBER =
RecursiveSWAR<DEPTH-1>::MAGIC_NUMBER
* (1 + (1<<(1<<(DEPTH-1))))
/ (1 + (1<<(1<<(DEPTH+0))));
static void fillAccumulators (u_int32_t*& begin, const u_int32_t* end, Array accumulators)
{
// We reset the N-bit accumulators
for (int i=0; i<N; i++) { accumulators[i] = _mm256_setzero_si256(); }
// We check (only for depth big enough) that we have still rows to process
if (DEPTH>=3) if (begin>=end) { return; }
typename RecursiveSWAR<DEPTH-1>::Array accumulatorsMinusOne;
// We load a register with the mask
__m256i mask = _mm256_set1_epi32 (RecursiveSWAR<DEPTH-1>::MAGIC_NUMBER);
// We fill the N-bit accumulators to their maximum capacity without carry overflow
for (int i=0; i<N+1; i++)
{
// We fill (N-1)-bit accumulators recursively
RecursiveSWAR<DEPTH-1>::fillAccumulators (begin, end, accumulatorsMinusOne);
// We update the N-bit accumulators from the (N-1)-bit accumulators
for (int j=0; j<RecursiveSWAR<DEPTH-1>::N; j++)
{
// LOW part
accumulators[2*j+0] = _mm256_add_epi32 (
accumulators[2*j+0],
_mm256_and_si256 (
accumulatorsMinusOne[j],
mask
)
);
// HIGH part
accumulators[2*j+1] = _mm256_add_epi32 (
accumulators[2*j+1],
_mm256_and_si256 (
_mm256_srli_epi32 (
accumulatorsMinusOne[j],
RecursiveSWAR<DEPTH-1>::N
),
mask
)
);
}
}
}
};
// Template specialization for DEPTH=0
template<>
struct RecursiveSWAR<0>
{
static const int N = 1;
typedef __m256i Array[N];
static const u_int32_t MAGIC_NUMBER = 0x55555555;
static void fillAccumulators (u_int32_t*& begin, const u_int32_t* end, Array result)
{
// We just load 8 rows in the AVX2 register
result[0] = _mm256_loadu_si256 ((__m256i*)begin);
// We update the iterator
begin += 1*sizeof(__m256i)/sizeof(u_int32_t);
}
};
template<int DEPTH> struct TypeInfo { };
template<> struct TypeInfo<3> { typedef u_int8_t Type; };
template<> struct TypeInfo<4> { typedef u_int16_t Type; };
template<> struct TypeInfo<5> { typedef u_int32_t Type; };
unsigned char reversebits (unsigned char b)
{
return ((b * 0x80200802ULL) & 0x0884422110ULL) * 0x0101010101ULL >> 32;
}
void test_SWAR_recursive (uint64_t nbRows, const uint32_t* bitmap, uint32_t* globalSums)
{
static const int DEPTH = 4;
RecursiveSWAR<DEPTH>::Array accumulators;
uint32_t* begin = (uint32_t*) bitmap;
const uint32_t* end = bitmap + nbRows;
// We reset the grand totals
for (int i=0; i<32; i++) { globalSums[i] = 0; }
while (begin < end)
{
// We fill the N-bit accumulators to the maximum without overflow
RecursiveSWAR<DEPTH>::fillAccumulators (begin, end, accumulators);
// We update grand totals from the filled N-bit accumulators
for (int i=0; i<RecursiveSWAR<DEPTH>::N; i++)
{
int r = reversebits(i) >> (8-DEPTH);
u_int32_t* sums = globalSums+r;
TypeInfo<DEPTH>::Type* values = (TypeInfo<DEPTH>::Type*) (accumulators+i);
for (int j=0; j<8*(1<<(5-DEPTH)); j++)
{
sums[(j*RecursiveSWAR<DEPTH>::N) % 32] += values[j];
}
}
}
}
////////////////////////////////////////////////////////////////////////////////
void execute (
const char* name,
void (*fct)(uint64_t nbRows, const uint32_t* bitmap, uint32_t* globalSums),
size_t nbRuns,
uint64_t nbRows,
u_int32_t* bitmap
)
{
uint32_t sums[32];
double timeTotal=0;
double cycleTotal=0;
double timeTotal2=0;
double cycleTotal2=0;
uint64_t check=0;
for (size_t n=0; n<nbRuns; n++)
{
memset(sums,0,sizeof(sums));
// We want both time and cpu cycles information
milliseconds t0 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
uint64_t c0 = ReadTSC();
// We run the test
(*fct) (nbRows, bitmap, sums);
uint64_t c1 = ReadTSC();
milliseconds t1 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
timeTotal += (t1-t0).count();
cycleTotal += (double)(c1-c0) / nbRows;
timeTotal2 += (t1-t0).count() * (t1-t0).count();
cycleTotal2 += ((double)(c1-c0) / nbRows) * ((double)(c1-c0) / nbRows);
// We compute some dummy checksum
for (size_t k=0; k<32; k++) { check += (k+1)*sums[k]; }
}
printf ("%-21s | %5.0lf (%5.1lf) | %5.2lf (%5.3lf) | %.3lf | 0x%lx\n",
name,
timeTotal / nbRuns,
deviation (nbRuns, timeTotal2, timeTotal),
cycleTotal/nbRuns,
deviation (nbRuns, cycleTotal2, cycleTotal),
nbRows * cycleTotal / timeTotal / 1000000.0,
check/nbRuns
);
}
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
// We set rows number as 2^n where n is the provided argument
// For simplification, we assume that the rows number is a multiple of 32
uint64_t nbRows = 1ULL << (argc>1 ? atoi(argv[1]) : 28);
size_t nbRuns = argc>2 ? atoi(argv[2]) : 10;
// We build an bitmap of size nbRows*32
uint64_t actualNbRows = nbRows + 100000;
uint32_t* bitmap = (uint32_t*)_mm_malloc(sizeof(uint32_t)*actualNbRows, 256);
if (bitmap==nullptr)
{
fprintf(stderr, "unable to allocate the bitmap\n");
exit(1);
}
memset (bitmap, 0, sizeof(u_int32_t)*actualNbRows);
// We fill the bitmap with random values
// srand(time(nullptr));
for (uint64_t i=0; i<nbRows; i++) { bitmap[i] = rand() & 0xFFFFFFFF; }
printf ("\n");
printf ("nbRows=%ld nbRuns=%ld\n", nbRows, nbRuns);
printf ("------------------------------------------------------------------------------------------------------------\n");
printf ("name | time in msec : mean (sd) | cycles/row : mean (sd) | frequency in GHz | checksum\n");
printf ("------------------------------------------------------------------------------------------------------------\n");
// We launch the benchmark
execute ("AVX2 (SWAR rec) ", test_SWAR_recursive, nbRuns, nbRows, bitmap);
printf ("\n");
// Some clean up
_mm_free (bitmap);
return EXIT_SUCCESS;
}
The size of the accumulators is 2DEPTH in this code. Note that this implementation is valid up to DEPTH=5. For DEPTH=4, here are the performance results compared to the implementation of Peter Cordes (named high entropy SWAR):
The graph gives the number of cycles required to process a row (of 32 items) as a function of the number of rows of the matrix. As expected, the results are pretty similar since the main idea is the same. It is interesting to note the three parts of the graph:
constant value for log2(n)<=20
increasing value for log2(n) between 20 and 22
constant value for log2(n)>=22
I guess that CPU caches properties can explain this behaviour.

Entrywise addition of two double arrays using AVX

I need a function to entrywise add the elements of two double arrays and store the result in a third array. Currently I use (simplified)
void add( double* result, const double* a, const double* b, size_t size) {
memcpy(result, a, size*sizeof(double));
for(size_t i = 0; i < size; ++i) {
result[i] += b[i];
}
}
As far as I know the memcpy function uses AVX. In order to improve the performance I would like to also enforce AVX use for the addition. This should be one of the most basic examples for AVX, however I couldn't find any description how to do this in C/C++. I would like to avoid the use of external libraries if possible.

You'll need something like this, assuming AVX-512:
void add( double* result, const double* a, const double* b, size_t size)
{
size_t i = 0;
// Note we are doing as many blocks of 8 as we can. If the size is not divisible by 8
// then we will have some left over that will then be performed serially.
// AVX-512 loop
for( ; i < (size & ~0x7); i += 8)
{
const __m512d kA8 = _mm512_load_pd( &a[i] );
const __m512d kB8 = _mm512_load_pd( &b[i] );
const __m512d kRes = _mm512_add_pd( kA8, kB8 );
_mm512_stream_pd( &res[i], kRes );
}
// AVX loop
for ( ; i < (size & ~0x3); i += 4 )
{
const __m256d kA4 = _mm256_load_pd( &a[i] );
const __m256d kB4 = _mm256_load_pd( &b[i] );
const __m256d kRes = _mm256_add_pd( kA4, kB4 );
_mm256_stream_pd( &res[i], kRes );
}
// SSE2 loop
for ( ; i < (size & ~0x1); i += 2 )
{
const __m128d kA2 = _mm_load_pd( &a[i] );
const __m128d kB2 = _mm_load_pd( &b[i] );
const __m128d kRes = _mm_add_pd( kA2, kB2 );
_mm_stream_pd( &res[i], kRes );
}
// Serial loop
for( ; i < size; i++ )
{
result[i] = a[i] + b[i];
}
}
(Though be warned I've just thrown that together off the top of my head).
Something to note form the above code is that I essentially process the remaining values using the next best parallel code. Primarily this is for illustration of the 3 possible ways you could do it parallely. The loops will work perfectly well on their own. For example if you can't support AVX-512 then you'd jump straight to the AVX loop. If you can't support AVX even then if you jump straight to the SSE2 loop then you'll be using the most performant loop that your hardware can support.
For best performance your arrays should be aligned to the relevant size used in the load. So for AVX-512 you would want 512-bit of 64 byte alignment. For AVX, 256-bit or 32 byte alignment. For SSE2 128-bit or 16 byte alignment. If you use 64 byte alignment for all your arrays then you will always have good alignment, though you may want to go for 128 byte alignment to ease moving over to AVX-1024 when that appears ;)

Why is this slower than memcmp

I am trying to compare two rows of pixels.
A pixel is defined as a struct containing 4 float values (RGBA).
The reason I am not using memcmp is because I need to return the position of the 1st different pixel, which memcmp does not do.
My first implementation uses SSE intrinsics, and is ~30% slower than memcmp:
inline int PixelMemCmp(const Pixel* a, const Pixel* b, int count)
{
for (int i = 0; i < count; i++)
{
__m128 x = _mm_load_ps((float*)(a + i));
__m128 y = _mm_load_ps((float*)(b + i));
__m128 cmp = _mm_cmpeq_ps(x, y);
if (_mm_movemask_ps(cmp) != 15) return i;
}
return -1;
}
I then found that treating the values as integers instead of floats sped things up a bit, and is now only ~20% slower than memcmp.
inline int PixelMemCmp(const Pixel* a, const Pixel* b, int count)
{
for (int i = 0; i < count; i++)
{
__m128i x = _mm_load_si128((__m128i*)(a + i));
__m128i y = _mm_load_si128((__m128i*)(b + i));
__m128i cmp = _mm_cmpeq_epi32(x, y);
if (_mm_movemask_epi8(cmp) != 0xffff) return i;
}
return -1;
}
From what I've read on other questions, the MS implementation of memcmp is also implemented using SSE. My question is what other tricks does the MS implementation have up it's sleeve that I don't? How is it still faster even though it does a byte-by-byte comparison?
Is alignment an issue? If the pixel contains 4 floats, won't an array of pixels already be allocated on a 16 byte boundary?
I am compiling with /o2 and all the optimization flags.

I have written strcmp/memcmp optimizations with SSE (and MMX/3DNow!), and the first step is to ensure that the arrays are as aligned as possible - you may find that you have to do the first and/or last bytes "one at a time".
If you can align the data before it gets to the loop [if your code does the allocation], then that's ideal.
The second part is to unroll the loop, so you don't get so many "if loop isn't at the end, jump back to beginning of loop" - assuming the loop is quite long.
You may find that preloading the next data of the input before doing the "do we leave now" condition helps too.
Edit: The last paragraph may need an example. This code assumes an unrolled loop of at least two:
__m128i x = _mm_load_si128((__m128i*)(a));
__m128i y = _mm_load_si128((__m128i*)(b));
for(int i = 0; i < count; i+=2)
{
__m128i cmp = _mm_cmpeq_epi32(x, y);
__m128i x1 = _mm_load_si128((__m128i*)(a + i + 1));
__m128i y1 = _mm_load_si128((__m128i*)(b + i + 1));
if (_mm_movemask_epi8(cmp) != 0xffff) return i;
cmp = _mm_cmpeq_epi32(x1, y1);
__m128i x = _mm_load_si128((__m128i*)(a + i + 2));
__m128i y = _mm_load_si128((__m128i*)(b + i + 2));
if (_mm_movemask_epi8(cmp) != 0xffff) return i + 1;
}
Roughly something like that.

You might want to check this memcmp SSE implementation, specifically the __sse_memcmp function, it starts with some sanity checks and then checks if the pointers are aligned or not:
aligned_a = ( (unsigned long)a & (sizeof(__m128i)-1) );
aligned_b = ( (unsigned long)b & (sizeof(__m128i)-1) );
If they are not aligned it compares the pointers byte by byte until the start of an aligned address:
while( len && ( (unsigned long) a & ( sizeof(__m128i)-1) ) )
{
if(*a++ != *b++) return -1;
--len;
}
And then compares the remaining memory with SSE instructions similar to your code:
if(!len) return 0;
while( len && !(len & 7 ) )
{
__m128i x = _mm_load_si128( (__m128i*)&a[i]);
__m128i y = _mm_load_si128( (__m128i*)&b[i]);
....

I cannot help you directly because I'm using Mac, but there's an easy way to figure out what happens:
You just step into memcpy in the debug mode and switch to Disassembly view. As the memcpy is a simple little function, you will easily figure out all the implementation tricks.

SSE intrinsics cause normal float operation to return -1.#INV

I am having a problem with a SSE method I am writing that performs audio processing. I have implemented a SSE random function based on Intel's paper here:
http://software.intel.com/en-us/articles/fast-random-number-generator-on-the-intel-pentiumr-4-processor/
I also have a method that is performing conversions from Float to S16 using SSE also, the conversion is performed quite simply as follows:
unsigned int Float_S16LE(float *data, const unsigned int samples, uint8_t *dest)
{
int16_t *dst = (int16_t*)dest;
const __m128 mul = _mm_set_ps1((float)INT16_MAX);
__m128 rand;
const uint32_t even = count & ~0x3;
for(uint32_t i = 0; i < even; i += 4, data += 4, dst += 4)
{
/* random round to dither */
FloatRand4(-0.5f, 0.5f, NULL, &rand);
__m128 rmul = _mm_add_ps(mul, rand);
__m128 in = _mm_mul_ps(_mm_load_ps(data),rmul);
__m64 con = _mm_cvtps_pi16(in);
memcpy(dst, &con, sizeof(int16_t) * 4);
}
}
FloatRand4 is defined as follows:
static inline void FloatRand4(const float min, const float max, float result[4], __m128 *sseresult = NULL)
{
const float delta = (max - min) / 2.0f;
const float factor = delta / (float)INT32_MAX;
...
}
If sseresult != NULL the __m128 result is returned and result is unused.
This performs perfectly on the first loop, but on the next loop delta becomes -1.#INF instead of 1.0. If I comment out the line __m64 con = _mm_cvtps_pi16(in); the problem goes away.
I think that the FPU is getting into an unknown state or something.

Mixing SSE Integer arithmetic and (regular) Floating point math. Can produce weird results because both are operating on the same registers. If you use:
_mm_empty()
the FPU is reset into a correct state. Microsoft has Guidelines for When to Use EMMS

_mm_load_ps is not guaranteed to do an aligned load. float* data can be aligned to 4 bytes instead of 16 _ => _mm_loadu_ps
memcpy will probably kill the advantages achieved with SSE, you should use a store command for __m64 but here again, take care of the alignment.
If it's impossible to do an unaligned stream or store of an __m64, I'd either keep it inside an _m128i and do a masked write with _mm_maskmoveu_si128 or store those 8 bytes by hand.
http://msdn.microsoft.com/en-us/library/bytwczae.aspx

Faster alpha blending using a lookup table?

I have made a lookup table that allows you to blend two single-byte channels (256 colors per channel) using a single-byte alpha channel using no floating point values (hence no float to int conversions). Each index in the lookup table corresponds to the value of 256ths of a channel, as related to an alpha value.
In all, to fully calculate a 3-channel RGB blend, it would require two lookups into the array per channel, plus an addition. This is a total of 6 lookups and 3 additions. In the example below, I split the colors into separate values for ease of demonstration. This example shows how to blend three channels, R G and B by an alpha value ranging from 0 to 256.
BYTE r1, r2, rDest;
BYTE g1, g2, gDest;
BYTE b1, b2, bDest;
BYTE av; // Alpha value
BYTE rem = 255 - av; // Remaining fraction
rDest = _lookup[r1][rem] + _lookup[r2][av];
gDest = _lookup[g1][rem] + _lookup[g2][av];
bDest = _lookup[b1][rem] + _lookup[b2][av];
It works great. Precise as you can get using 256 color channels. In fact, you would get the same exact values using the actual floating point calculations. The lookup table was calculated using doubles to begin with. The lookup table is too big to fit in this post (65536 bytes). (If you would like a copy of it, email me at ten.turtle.toes#gmail.com, but don't expect a reply until tomorrow because I am going to sleep now.)
So... what do you think? Is it worth it or not?

I would be interested in seeing some benchmarks.
There is an algorithm that can do perfect alpha blending without any floating point calculations or lookup tables. You can find more info in the following document (the algorithm and code is described at the end)
I also did an SSE implementation of this long ago, if you are interested...
void PreOver_SSE2(void* dest, const void* source1, const void* source2, size_t size)
{
static const size_t STRIDE = sizeof(__m128i)*4;
static const u32 PSD = 64;
static const __m128i round = _mm_set1_epi16(128);
static const __m128i lomask = _mm_set1_epi32(0x00FF00FF);
assert(source1 != NULL && source2 != NULL && dest != NULL);
assert(size % STRIDE == 0);
const __m128i* source128_1 = reinterpret_cast<const __m128i*>(source1);
const __m128i* source128_2 = reinterpret_cast<const __m128i*>(source2);
__m128i* dest128 = reinterpret_cast<__m128i*>(dest);
__m128i d, s, a, rb, ag, t;
for(size_t k = 0, length = size/STRIDE; k < length; ++k)
{
// TODO: put prefetch between calculations?(R.N)
_mm_prefetch(reinterpret_cast<const s8*>(source128_1+PSD), _MM_HINT_NTA);
_mm_prefetch(reinterpret_cast<const s8*>(source128_2+PSD), _MM_HINT_NTA);
// work on entire cacheline before next prefetch
for(int n = 0; n < 4; ++n, ++dest128, ++source128_1, ++source128_2)
{
// TODO: assembly optimization use PSHUFD on moves before calculations, lower latency than MOVDQA (R.N) http://software.intel.com/en-us/articles/fast-simd-integer-move-for-the-intel-pentiumr-4-processor/
// TODO: load entire cacheline at the same time? are there enough registers? 32 bit mode (special compile for 64bit?) (R.N)
s = _mm_load_si128(source128_1); // AABGGRR
d = _mm_load_si128(source128_2); // AABGGRR
// PRELERP(S, D) = S+D - ((S*D[A]+0x80)>>8)+(S*D[A]+0x80))>>8
// T = S*D[A]+0x80 => PRELERP(S,D) = S+D - ((T>>8)+T)>>8
// set alpha to lo16 from dest_
a = _mm_srli_epi32(d, 24); // 000000AA
rb = _mm_slli_epi32(a, 16); // 00AA0000
a = _mm_or_si128(rb, a); // 00AA00AA
rb = _mm_and_si128(lomask, s); // 00BB00RR
rb = _mm_mullo_epi16(rb, a); // BBBBRRRR
rb = _mm_add_epi16(rb, round); // BBBBRRRR
t = _mm_srli_epi16(rb, 8);
t = _mm_add_epi16(t, rb);
rb = _mm_srli_epi16(t, 8); // 00BB00RR
ag = _mm_srli_epi16(s, 8); // 00AA00GG
ag = _mm_mullo_epi16(ag, a); // AAAAGGGG
ag = _mm_add_epi16(ag, round);
t = _mm_srli_epi16(ag, 8);
t = _mm_add_epi16(t, ag);
ag = _mm_andnot_si128(lomask, t); // AA00GG00
rb = _mm_or_si128(rb, ag); // AABGGRR pack
rb = _mm_sub_epi8(s, rb); // sub S-[(D[A]*S)/255]
d = _mm_add_epi8(d, rb); // add D+[S-(D[A]*S)/255]
_mm_stream_si128(dest128, d);
}
}
_mm_mfence(); //ensure last WC buffers get flushed to memory
}

Today's processors can do a lot of calculation in the time it takes to get one value from memory, especially if it's not in cache. That makes it especially important to benchmark possible solutions, because you can't easily reason what the result will be.
I don't know why you're concerned about floating point conversions, this can all be done as integers.
BYTE r1, r2, rDest;
BYTE g1, g2, gDest;
BYTE b1, b2, bDest;
BYTE av; // Alpha value BYTE
rem = 255 - av; // Remaining fraction
rDest = (r1*rem + r2*av) / 255;
gDest = (g1*rem + g2*av) / 255;
bDest = (b1*rem + b2*av) / 255;
If you want to get really clever, you can replace the divide with a multiply followed by a right-shift.
Edit: Here's the version using a right-shift. Adding a constant might reduce any truncation errors, but I'll leave that as an exercise for the reader.
BYTE r1, r2, rDest;
BYTE g1, g2, gDest;
BYTE b1, b2, bDest;
BYTE av; // Alpha value BYTE
int aMult = 0x10000 * av / 255;
rem = 0x10000 - aMult; // Remaining fraction
rDest = (r1*rem + r2*aMult) >> 16;
gDest = (g1*rem + g2*aMult) >> 16;
bDest = (b1*rem + b2*aMult) >> 16;

Mixing colors is computationally trivial. I would be surprised if this yielded a significant benefit, but as always, you must first prove that this calculation is a bottleneck in performance in the first place. And I would suggest that a better solution is to use hardware accelerated blending.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js