Matrix transpose and population count - bit-manipulation

I have a square boolean matrix M of size N, stored by rows and I want to count the number of bits set to 1 for each column.
For instance for n=4:
1101
0101
0001
1001
M stored as { { 1,1,0,1}, {0,1,0,1}, {0,0,0,1}, {1,0,0,1} };
result = { 2, 2, 0, 4};
I can obviously
transpose the matrix M into a matrix M'
popcount each row of M'.
Good algorithms exist for matrix transposition and popcounting through bit manipulation.
My question is: would it be possible to "merge" such algorithms into a single one ?
Note that N could be quite large (say 1024 and more) regarding 64 bits architecture.

Related: Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2 and https://github.com/mklarqvist/positional-popcount
I had another idea which I haven't finished writing up nicely.
Godbolt link to messy work-in-progress which doesn't have correct loop bounds / cleanup, but for large buffers runs ~3x faster than #edrezen's version on my Skylake i7-6700k, with g++7.3 -O3 -march=native. See the test_SWAR_avx2 function. (I know it doesn't compile on Godbolt; Agner Fog's asmlib.h isn't present.)
I might have some columns in the wrong order, too, but from stepping through the asm I think it's doing the right amount of work. i.e. any necessary bugfixes won't slow it down.
I used 16-bit accumulators, so another outer loop might be necessary if you care about inputs large enough to overflow 16-bit per-column counters.
Interesting observation: An earlier buggy version of my loop used sum0123 twice in store_globalsums_from_vec16, leaving sum4567 unused, so it optimized away in the main loop. With less work, gcc fully unrolled the large for(int i=0 ; i<5 ; i++) loop, and the code ran slower, like about 1 cycle per byte instead of 0.5. The loop was probably too big for the uop cache or something (I didn't profile yet but a front-end decode bottleneck would explain it). For some reason #edrezen's version is only running at about 1.5c/B for me, not the ~1.25 reported in the answer. My CPU is actually running 3.9GHz, but Agner Fog's library detects it at 4.0, but that's not enough to explain it.
Also, gcc spills sum4567_16bit to the stack, so we're already pushing the boundary of register pressure without AVX512. It's updated infrequently and isn't a problem, but needing more accumulators in the inner loop could be.
Your data layout isn't clear about when the number of columns isn't 32.
It seems that for each uint32_t chunk of 32 columns, you have all the rows stored contiguously in memory. i.e. looping over the rows for a column is efficient. If you had more than 32 columns, the rows for columns 32..63 will be contiguous and come after all the rows for columns 0..31.
(If instead you have all the columns for a single row contiguous, you could still use this idea, but might need to spill/reload some accumulators to memory, or let the compiler do that for you if it makes good choices.)
So loading a 32-byte (8 dword) vector gets 8 rows of data for one column chunk. That's extremely convenient, and allows widening from 1-bit (in memory) to 2-bit accumulators, then grab more data before we widen to 4-bit, and so on, summing along the way so we get significant work done while the data is still dense. (Rather than only adding 1 bit (0 or 1) per byte to vector accumulators.)
The more we unroll, the more data we can grab from memory to make better use of the coding space in our vectors. i.e. our variables have higher entropy. Throwing around more data (in terms of bits of memory that contributed to it) per vpaddb/w/d/q or unpack/shuffle instruction is a Good Thing.
Accumulators narrower than 1 byte within a SIMD vector is basically an https://en.wikipedia.org/wiki/SWAR technique, where you have to AND away bits that you shift past an element boundary, because we don't have SIMD element boundaries to do it for us. (And we avoid overflow anyway, so ADD carrying into the next element isn't a problem.)
Each inner loop iteration:
take a vector of data from the same columns in each of 2 or 3 (groups of) rows. So you either have 3 * 8 rows from one chunk of 32 columns, or 3 rows of 256 columns.
mask them with set1(0b01010101) to get the even (low) bits, and with (vec>>1) & mask (_mm256_srli_epi32(v,1)) to get the odd (high) bits. Use _mm256_add_epi8 to accumulate within those 2-bit accumulators. They can't overflow with only 3 ones, so carry-propagation boundaries don't actually matter.
Each byte of your vector has 4 separate vertical sums, and you have two vectors (odd/even).
Repeat the above again, to get another pair of vectors from 3 vectors of data from memory.
Combine again to get 4 vectors of 4-bit accumulators (with possible values 0..6). Still without mixing bits from within a single 32-bit element, of course, because we must never do that. Shifts only move bits for odd / high columns to the bottom of the 2-bit or 4-bit unit that contains them so they can be added with bits that were moved the same way in other vectors.
_mm256_unpacklo/hi_epi8 and mask or shift+mask to get 8-bit accumulators
Put the above in a loop that runs up to 5 times, so the 0..12 accumulator values go up to 0..60 (i.e. leaving 2 bits of headroom for unpacking the 8-bit accumulators, using all their coding space.)
If you have the data layout from your answer, then we can add data from dword elements within the same vector. We can do that so we don't run out of registers when widening our accumulators up to 16-bit (because x86-64 only has 16 YMM registers, and we need some for constants.)
_mm256_unpacklo/hi_epi16 and add, to interleave pairs of 8-bit counters so a group of counters for the same column has expanded from a dword to a qword.
Repeat this general idea to reduce the number of registers (or __m256i variables) your accumulators are spread over.
Efficiently handling the lack of a lane-crossing 2-input byte or word shuffle is inconvenient, but it's a pretty small part of the total work. vextracti128 / vpaddb xmm -> vpmovzxbw worked well enough.

I made some benchmark between the two approaches:
transpose + popcount
update row by row
I wrote a naive version and an AVX2 one for both approaches. I used some functions (found on stackoverflow or elsewhere) for the AVX2 "transpose+popcount" approach.
In my test, I make the assumption that the input is a nbRowsx32 matrix in a bits packed format (nbRows itself being a multiple of 32); the matrix is therefore stored as an array of uint32_t.
The code is the following:
#include <cinttypes>
#include <cstdio>
#include <cstring>
#include <cmath>
#include <cassert>
#include <chrono>
#include <immintrin.h>
#include <asmlib.h>
using namespace std;
using namespace std::chrono;
// see https://stackoverflow.com/questions/24225786/fastest-way-to-unpack-32-bits-to-a-32-byte-simd-vector
static __m256i expand_bits_to_bytes (uint32_t x);
// see https://mischasan.wordpress.com/2011/10/03/the-full-sse2-bit-matrix-transpose-routine/
static void sse_trans(char const *inp, char *out);
static double deviation (double n, double sum2, double sum);
////////////////////////////////////////////////////////////////////////////////
// Naive approach (matrix transposition)
////////////////////////////////////////////////////////////////////////////////
void test_transpose_popcnt_naive (uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums)
{
assert (nbRows%32==0);
uint8_t transpo[32][32]; memset (transpo, 0, sizeof(transpo));
for (uint64_t k=0; k<nbRows; k+=32)
{
// We unpack and transpose the input into a 32x32 bytes matrix
for (size_t row=0; row<32; row++)
{
for (size_t col=0; col<32; col++) { transpo[col][row] = (bitmap[k+row] >> col) & 1 ; }
}
for (size_t row=0; row<32; row++)
{
// We popcount the current row
u_int8_t sum=0;
for (size_t col=0; col<32; col++) { sum += transpo[row][col]; }
// We update the corresponding global sum
globalSums[row] += sum;
}
}
}
////////////////////////////////////////////////////////////////////////////////
// Naive approach (row by row)
////////////////////////////////////////////////////////////////////////////////
void test_update_row_by_row_naive (uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums)
{
for (uint64_t row=0; row<nbRows; row++)
{
for (size_t col=0; col<32; col++)
{
globalSums[col] += (bitmap[row] >> col) & 1;
}
}
}
////////////////////////////////////////////////////////////////////////////////
// AVX2 (matrix transposition + popcount)
////////////////////////////////////////////////////////////////////////////////
void test_transpose_popcnt_avx2 (uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums)
{
assert (nbRows%32==0);
uint32_t transpo[32];
const uint32_t* loop = bitmap;
for (uint64_t k=0; k<nbRows; loop+=32, k+=32)
{
// We transpose the input as a 32x32 bytes matrix
sse_trans ((const char*)loop, (char*)transpo);
// We update the global sums
for (size_t i=0; i<32; i++)
{
globalSums[i] += __builtin_popcount (transpo[i]);
}
}
}
////////////////////////////////////////////////////////////////////////////////
// AVX2 approach (update totals row by row)
////////////////////////////////////////////////////////////////////////////////
// Note: we use template specialization to unroll some portions of a loop
template<int N>
void UpdateLocalSums (__m256i& localSums, const uint32_t* bitmap, uint64_t& k)
{
// We update the local sums with the current row
localSums = _mm256_sub_epi8 (localSums, expand_bits_to_bytes (bitmap[k++]));
// Go recursively
UpdateLocalSums<N-1>(localSums, bitmap, k);
}
template<>
void UpdateLocalSums<0> (__m256i& localSums, const uint32_t* bitmap, uint64_t& k)
{
}
// Dillon Davis proposal: use 4 registers holding uint32_t values and update them from local sums with AVX2
#define USE_AVX2_FOR_GRAND_TOTALS 1
void test_update_row_by_row_avx2 (uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums)
{
union U256i { __m256i v; uint8_t a[32]; uint32_t b[8]; };
// We use 1 register for updating local totals
__m256i localSums = _mm256_setzero_si256();
#ifdef USE_AVX2_FOR_GRAND_TOTALS
// Dillon Davis proposal: use 4 registers holding uint32_t values and update them from local sums with AVX2
__m256i globalSumsReg[4]; for (size_t r=0; r<4; r++) { globalSumsReg[r] = _mm256_setzero_si256(); }
#endif
uint64_t steps = nbRows / 255;
uint64_t k=0;
const int divisorOf255 = 5;
// We iterate over all rows
for (uint64_t i=0; i<steps; i++)
{
// we update the local totals (255*32=8160 additions)
for (int j=0; j<255/divisorOf255; j++)
{
// unroll some portion of the 255 loop through template specialization
UpdateLocalSums<divisorOf255>(localSums, bitmap, k);
}
#ifdef USE_AVX2_FOR_GRAND_TOTALS
// Dillon Davis proposal: use 4 registers holding uint32_t values and update them from local sums
// We take the 128 high bits of the local sums
__m256i localSums2 = _mm256_broadcastsi128_si256(_mm256_extracti128_si256(localSums,1));
globalSumsReg[0] = _mm256_add_epi32 (globalSumsReg[0],
_mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums, 0)))
);
globalSumsReg[1] = _mm256_add_epi32 (globalSumsReg[1],
_mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums, 8)))
);
globalSumsReg[2] = _mm256_add_epi32 (globalSumsReg[2],
_mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums2, 0)))
);
globalSumsReg[3] = _mm256_add_epi32 (globalSumsReg[3],
_mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums2, 8)))
);
#else
// we update the global totals
U256i tmp = { localSums };
for (size_t k=0; k<32; k++) { globalSums[k] += tmp.a[k]; }
#endif
// we reset the local totals
localSums = _mm256_setzero_si256();
}
#ifdef USE_AVX2_FOR_GRAND_TOTALS
// We update the global totals into the final uint32_t array
for (size_t r=0; r<4; r++)
{
U256i tmp = { globalSumsReg[r] };
for (size_t k=0; k<8; k++) { globalSums[r*8+k] += tmp.b[k]; }
}
#endif
// we update the remaining local totals
for (uint64_t i=steps*255; i<nbRows; i++)
{
UpdateLocalSums<1>(localSums, bitmap, k);
}
// we update the global totals
U256i tmp = { localSums };
for (size_t k=0; k<32; k++) { globalSums[k] += tmp.a[k]; }
}
////////////////////////////////////////////////////////////////////////////////
void execute (
const char* name,
void (*fct)(uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums),
size_t nbRuns,
uint64_t nbRows,
u_int32_t* bitmap
)
{
uint64_t sums[32];
double timeTotal=0;
double cycleTotal=0;
double timeTotal2=0;
double cycleTotal2=0;
uint64_t check=0;
for (size_t n=0; n<nbRuns; n++)
{
memset(sums,0,sizeof(sums));
// We want both time and cpu cycles information
milliseconds t0 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
uint64_t c0 = ReadTSC();
// We run the test
(*fct) (nbRows, bitmap, sums);
uint64_t c1 = ReadTSC();
milliseconds t1 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
timeTotal += (t1-t0).count();
cycleTotal += (double)(c1-c0) / nbRows;
timeTotal2 += (t1-t0).count() * (t1-t0).count();
cycleTotal2 += ((double)(c1-c0) / nbRows) * ((double)(c1-c0) / nbRows);
// We compute some dummy checksum
for (size_t k=0; k<32; k++) { check += sums[k]; }
}
printf ("%-21s | %5.0lf (%5.1lf) | %5.2lf (%4.2lf) | %.3lf | 0x%lx\n",
name,
timeTotal / nbRuns,
deviation (nbRuns, timeTotal2, timeTotal),
cycleTotal/nbRuns,
deviation (nbRuns, cycleTotal2, cycleTotal),
check,
nbRows * cycleTotal / timeTotal / 1000000.0
);
}
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
// We set rows number as 2^n where n is the provided argument
// For simplification, we assume that the rows number is a multiple of 32
uint64_t nbRows = 1ULL << (argc>1 ? atoi(argv[1]) : 28);
size_t nbRuns = argc>2 ? atoi(argv[2]) : 10;
// We build an bitmap of size nbRows*32
uint32_t* bitmap = new uint32_t[nbRows];
if (bitmap==nullptr)
{
fprintf(stderr, "unable to allocate the bitmap\n");
exit(1);
}
// We fill the bitmap with random values
srand(time(nullptr));
for (uint64_t i=0; i<nbRows; i++) { bitmap[i] = rand() & 0xFFFFFFFF; }
printf ("\n");
printf ("nbRows=%ld nbRuns=%ld\n", nbRows, nbRuns);
printf ("------------------------------------------------------------------------------------------------------------\n");
printf ("name | time in msec : mean (sd) | cycles/row : mean (sd) | frequency in GHz | checksum\n");
printf ("------------------------------------------------------------------------------------------------------------\n");
// We launch the benchmark
execute ("naive (transpo) ", test_transpose_popcnt_naive, nbRuns, nbRows, bitmap);
execute ("naive (row by row)", test_update_row_by_row_naive, nbRuns, nbRows, bitmap);
execute ("AVX2 (transpo) ", test_transpose_popcnt_avx2, nbRuns, nbRows, bitmap);
execute ("AVX2 (row by row)", test_update_row_by_row_avx2, nbRuns, nbRows, bitmap);
printf ("\n");
// Some clean up
delete[] bitmap;
return EXIT_SUCCESS;
}
////////////////////////////////////////////////////////////////////////////////
__m256i expand_bits_to_bytes(uint32_t x)
{
__m256i xbcast = _mm256_set1_epi32(x);
// Each byte gets the source byte containing the corresponding bit
__m256i shufmask = _mm256_set_epi64x(
0x0303030303030303, 0x0202020202020202,
0x0101010101010101, 0x0000000000000000);
__m256i shuf = _mm256_shuffle_epi8(xbcast, shufmask);
__m256i andmask = _mm256_set1_epi64x(0x8040201008040201); // every 8 bits -> 8 bytes, pattern repeats.
__m256i isolated_inverted = _mm256_and_si256(shuf, andmask);
// Avoid an _mm256_add_epi8 thanks to Peter Cordes's comment
return _mm256_cmpeq_epi8(isolated_inverted, andmask);
}
////////////////////////////////////////////////////////////////////////////////
void sse_trans(char const *inp, char *out)
{
#define INP(x,y) inp[(x)*4 + (y)/8]
#define OUT(x,y) out[(y)*4 + (x)/8]
int rr, cc, i, h;
union { __m256i x; uint8_t b[32]; } tmp;
for (cc = 0; cc < 32; cc += 8)
{
for (i = 0; i < 32; ++i)
tmp.b[i] = INP(i, cc);
for (i = 8; i--; tmp.x = _mm256_slli_epi64(tmp.x, 1))
*(uint32_t*)&OUT(0, cc + i) = _mm256_movemask_epi8(tmp.x);
}
}
////////////////////////////////////////////////////////////////////////////////
double deviation (double n, double sum2, double sum) { return sqrt (sum2/n - (sum/n)*(sum/n)); }
Some remarks:
I used the Agner Fog's asmlib to have a function that returns CPU cycles
The compilation command is g++ -O3 -march=native ../Test.cpp -o ./Test -laelf64
The gcc version is 7.3.1
The CPU is Intel(R) Core(TM) i7-6700HQ CPU # 2.60GHz
I compute some dummy checksum to compare the results of the different tests
Now the results:
------------------------------------------------------------------------------------------------------------
name | time in msec : mean (sd) | cycles/row : mean (sd) | frequency in GHz | checksum
------------------------------------------------------------------------------------------------------------
naive (transpo) | 4548 ( 36.5) | 43.91 (0.35) | 2.592 | 0x9affeb5a6
naive (row by row) | 3033 ( 11.0) | 29.29 (0.11) | 2.592 | 0x9affeb5a6
AVX2 (transpo) | 767 ( 12.8) | 7.40 (0.12) | 2.592 | 0x9affeb5a6
AVX2 (row by row) | 130 ( 4.0) | 1.25 (0.04) | 2.591 | 0x9affeb5a6
So it seems that the "row by row" in AVX2 is the best so far.
Note that when I saw this result (less than 2 cycles per row), I made no more effort to optimize the AVX2 "transpose+popcount" method, which should be feasable by computing several popcounts in parallel (I may test it later).

I eventually wrote another implementation, following the high entropy SWAR approach proposed by Peter Cordes. This implementation is recursive and relies on C++ template specialization.
The global idea is to fill N-bit accumulators to their maximum without carry overflow (this is where recursion is used). When these accumulators are filled, we update the grand totals and we start again with new N-bit accumulators to fill until all rows have been processed.
Here is the code (see function test_SWAR_recursive):
#include <immintrin.h>
#include <cassert>
#include <chrono>
#include <cinttypes>
#include <cmath>
#include <cstdio>
#include <cstring>
using namespace std;
using namespace std::chrono;
// avoid the #include <asmlib.h>
extern "C" u_int64_t ReadTSC();
static double deviation (double n, double sum2, double sum) { return sqrt (sum2/n - (sum/n)*(sum/n)); }
////////////////////////////////////////////////////////////////////////////////
// Recursive SWAR approach (with template specialization)
////////////////////////////////////////////////////////////////////////////////
template<int DEPTH>
struct RecursiveSWAR
{
// Number of accumulators for current depth
static const int N = 1<<DEPTH;
// Array of N-bit accumulators
typedef __m256i Array[N];
// Magic numbers (0x55555555, 0x33333333, ...) computed recursively
static const u_int32_t MAGIC_NUMBER =
RecursiveSWAR<DEPTH-1>::MAGIC_NUMBER
* (1 + (1<<(1<<(DEPTH-1))))
/ (1 + (1<<(1<<(DEPTH+0))));
static void fillAccumulators (u_int32_t*& begin, const u_int32_t* end, Array accumulators)
{
// We reset the N-bit accumulators
for (int i=0; i<N; i++) { accumulators[i] = _mm256_setzero_si256(); }
// We check (only for depth big enough) that we have still rows to process
if (DEPTH>=3) if (begin>=end) { return; }
typename RecursiveSWAR<DEPTH-1>::Array accumulatorsMinusOne;
// We load a register with the mask
__m256i mask = _mm256_set1_epi32 (RecursiveSWAR<DEPTH-1>::MAGIC_NUMBER);
// We fill the N-bit accumulators to their maximum capacity without carry overflow
for (int i=0; i<N+1; i++)
{
// We fill (N-1)-bit accumulators recursively
RecursiveSWAR<DEPTH-1>::fillAccumulators (begin, end, accumulatorsMinusOne);
// We update the N-bit accumulators from the (N-1)-bit accumulators
for (int j=0; j<RecursiveSWAR<DEPTH-1>::N; j++)
{
// LOW part
accumulators[2*j+0] = _mm256_add_epi32 (
accumulators[2*j+0],
_mm256_and_si256 (
accumulatorsMinusOne[j],
mask
)
);
// HIGH part
accumulators[2*j+1] = _mm256_add_epi32 (
accumulators[2*j+1],
_mm256_and_si256 (
_mm256_srli_epi32 (
accumulatorsMinusOne[j],
RecursiveSWAR<DEPTH-1>::N
),
mask
)
);
}
}
}
};
// Template specialization for DEPTH=0
template<>
struct RecursiveSWAR<0>
{
static const int N = 1;
typedef __m256i Array[N];
static const u_int32_t MAGIC_NUMBER = 0x55555555;
static void fillAccumulators (u_int32_t*& begin, const u_int32_t* end, Array result)
{
// We just load 8 rows in the AVX2 register
result[0] = _mm256_loadu_si256 ((__m256i*)begin);
// We update the iterator
begin += 1*sizeof(__m256i)/sizeof(u_int32_t);
}
};
template<int DEPTH> struct TypeInfo { };
template<> struct TypeInfo<3> { typedef u_int8_t Type; };
template<> struct TypeInfo<4> { typedef u_int16_t Type; };
template<> struct TypeInfo<5> { typedef u_int32_t Type; };
unsigned char reversebits (unsigned char b)
{
return ((b * 0x80200802ULL) & 0x0884422110ULL) * 0x0101010101ULL >> 32;
}
void test_SWAR_recursive (uint64_t nbRows, const uint32_t* bitmap, uint32_t* globalSums)
{
static const int DEPTH = 4;
RecursiveSWAR<DEPTH>::Array accumulators;
uint32_t* begin = (uint32_t*) bitmap;
const uint32_t* end = bitmap + nbRows;
// We reset the grand totals
for (int i=0; i<32; i++) { globalSums[i] = 0; }
while (begin < end)
{
// We fill the N-bit accumulators to the maximum without overflow
RecursiveSWAR<DEPTH>::fillAccumulators (begin, end, accumulators);
// We update grand totals from the filled N-bit accumulators
for (int i=0; i<RecursiveSWAR<DEPTH>::N; i++)
{
int r = reversebits(i) >> (8-DEPTH);
u_int32_t* sums = globalSums+r;
TypeInfo<DEPTH>::Type* values = (TypeInfo<DEPTH>::Type*) (accumulators+i);
for (int j=0; j<8*(1<<(5-DEPTH)); j++)
{
sums[(j*RecursiveSWAR<DEPTH>::N) % 32] += values[j];
}
}
}
}
////////////////////////////////////////////////////////////////////////////////
void execute (
const char* name,
void (*fct)(uint64_t nbRows, const uint32_t* bitmap, uint32_t* globalSums),
size_t nbRuns,
uint64_t nbRows,
u_int32_t* bitmap
)
{
uint32_t sums[32];
double timeTotal=0;
double cycleTotal=0;
double timeTotal2=0;
double cycleTotal2=0;
uint64_t check=0;
for (size_t n=0; n<nbRuns; n++)
{
memset(sums,0,sizeof(sums));
// We want both time and cpu cycles information
milliseconds t0 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
uint64_t c0 = ReadTSC();
// We run the test
(*fct) (nbRows, bitmap, sums);
uint64_t c1 = ReadTSC();
milliseconds t1 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
timeTotal += (t1-t0).count();
cycleTotal += (double)(c1-c0) / nbRows;
timeTotal2 += (t1-t0).count() * (t1-t0).count();
cycleTotal2 += ((double)(c1-c0) / nbRows) * ((double)(c1-c0) / nbRows);
// We compute some dummy checksum
for (size_t k=0; k<32; k++) { check += (k+1)*sums[k]; }
}
printf ("%-21s | %5.0lf (%5.1lf) | %5.2lf (%5.3lf) | %.3lf | 0x%lx\n",
name,
timeTotal / nbRuns,
deviation (nbRuns, timeTotal2, timeTotal),
cycleTotal/nbRuns,
deviation (nbRuns, cycleTotal2, cycleTotal),
nbRows * cycleTotal / timeTotal / 1000000.0,
check/nbRuns
);
}
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
// We set rows number as 2^n where n is the provided argument
// For simplification, we assume that the rows number is a multiple of 32
uint64_t nbRows = 1ULL << (argc>1 ? atoi(argv[1]) : 28);
size_t nbRuns = argc>2 ? atoi(argv[2]) : 10;
// We build an bitmap of size nbRows*32
uint64_t actualNbRows = nbRows + 100000;
uint32_t* bitmap = (uint32_t*)_mm_malloc(sizeof(uint32_t)*actualNbRows, 256);
if (bitmap==nullptr)
{
fprintf(stderr, "unable to allocate the bitmap\n");
exit(1);
}
memset (bitmap, 0, sizeof(u_int32_t)*actualNbRows);
// We fill the bitmap with random values
// srand(time(nullptr));
for (uint64_t i=0; i<nbRows; i++) { bitmap[i] = rand() & 0xFFFFFFFF; }
printf ("\n");
printf ("nbRows=%ld nbRuns=%ld\n", nbRows, nbRuns);
printf ("------------------------------------------------------------------------------------------------------------\n");
printf ("name | time in msec : mean (sd) | cycles/row : mean (sd) | frequency in GHz | checksum\n");
printf ("------------------------------------------------------------------------------------------------------------\n");
// We launch the benchmark
execute ("AVX2 (SWAR rec) ", test_SWAR_recursive, nbRuns, nbRows, bitmap);
printf ("\n");
// Some clean up
_mm_free (bitmap);
return EXIT_SUCCESS;
}
The size of the accumulators is 2DEPTH in this code. Note that this implementation is valid up to DEPTH=5. For DEPTH=4, here are the performance results compared to the implementation of Peter Cordes (named high entropy SWAR):
The graph gives the number of cycles required to process a row (of 32 items) as a function of the number of rows of the matrix. As expected, the results are pretty similar since the main idea is the same. It is interesting to note the three parts of the graph:
constant value for log2(n)<=20
increasing value for log2(n) between 20 and 22
constant value for log2(n)>=22
I guess that CPU caches properties can explain this behaviour.

Related

Casting AVX512 mask types

I'm trying to figure out how to use masked loads and stores for the last few elements to be processed. My use case involves converting a packed 10 bit data stream to 16 bit which means loading 5 bytes before storing 4 shorts. This results in different masks of different types.
The main loop itself is not a problem. But at the end I'm left with up to 19 bytes input / 15 shorts output which I thought I could process in up to two loop iterations using the 128 bit vectors. Here is the outline of the code.
#include <immintrin.h>
#include <stddef.h>
#include <stdint.h>
void convert(uint16_t* out, ptrdiff_t n, const uint8_t* in)
{
uint16_t* const out_end = out + n;
for(uint16_t* out32_end = out + (n & -32); out < out32_end; in += 40, out += 32) {
/*
* insert main loop here using ZMM vectors
*/
}
if(out_end - out >= 16) {
/*
* insert half-sized iteration here using YMM vectors
*/
in += 20;
out += 16;
}
// up to 19 byte input remaining, up to 15 shorts output
const unsigned out_remain = out_end - out;
const unsigned in_remain = (out_remain * 10 + 7) / 8;
unsigned in_mask = (1 << in_remain) - 1;
unsigned out_mask = (1 << out_remain) - 1;
while(out_mask) {
__mmask16 load_mask = _cvtu32_mask16(in_mask);
__m128i packed = _mm_maskz_loadu_epi8(load_mask, in);
/* insert computation here. No masks required */
__mmask8 store_mask = _cvtu32_mask8(out_mask);
_mm_mask_storeu_epi16(out, store_mask, packed);
in += 10;
out += 8;
in_mask >>= 10;
out_mask >>= 8;
}
}
(Compile with -O3 -mavx2 -mavx512f -mavx512bw -mavx512vl -mavx512dq)
My idea was to create a bit mask from the number of remaining elements (since I know it fits comfortably in an integer / mask register), then shift values out of the mask as they are processed.
I have two issues with this approach:
I'm re-setting the masks from GP registers each iteration instead of using the kshift family of instructions
_cvtu32_mask8 (kmovb) is the only instruction in this code that requires AVX512DQ. Limiting the number of suitable hardware platforms just for that seems weird
What I'm wondering about:
Can I cast mmask32 to mmask16 and mmask8?
If I can, I could set it once from the GP register, then shift it in its own register. Like this:
__mmask32 load_mask = _cvtu32_mask32(in_mask);
__mmask32 store_mask = _cvtu32_mask32(out_mask);
while(out < out_end) {
__m128i packed = _mm_maskz_loadu_epi8((__mmask16) load_mask, in);
/* insert computation here. No masks required */
_mm_mask_storeu_epi16(out, (__mmask8) store_mask, packed);
load_mask = _kshiftri_mask32(load_mask, 10);
store_mask = _kshiftri_mask32(store_mask, 8);
in += 10;
out += 8;
}
GCC seems to be fine with this pattern. But Clang and MSVC create worse code, moving the mask in and out of GP registers without any apparent reason.

Accumulating Doubles Into Bins via intrinsics

I have a vector of observations and an equal length vector of offsets assigning observations to a set of bins. The value of each bin should be the sum of all observations assigned to that bin, and I'm wondering if there's a vectorized method to do the reduction.
A naive implementation is below:
const int N_OBS = 100`000`000;
const int N_BINS = 16;
double obs[N_OBS]; // Observations
int8_t offsets[N_OBS];
double acc[N_BINS] = {0};
for (int i = 0; i < N_OBS; ++i) {
acc[offsets[i]] += obs[i]; // accumulate obs value into its assigned bin
}
Is this possible using simd/avx intrinsics? Something similar to the above will be run millions of times. I've looked at scatter/gather approaches, but can't seem to figure out a good way to get it done.
Modern CPUs are surprisingly good running your naïve version. On AMD Zen3, I’m getting 48ms for 100M random numbers on input, that’s 18 GB/sec RAM read bandwidth. That’s like 35% of the hard bandwidth limit on my computer (dual-channel DDR4-3200).
No SIMD gonna help, I’m afraid. Still, the best version I got is the following. Compile with OpenMP support, the switch depends on your C++ compiler.
void computeHistogramScalarOmp( const double* rsi, const int8_t* indices, size_t length, double* rdi )
{
// Count of OpenMP threads = CPU cores to use
constexpr int ompThreadsCount = 4;
// Use independent set of accumulators per thread, otherwise concurrency gonna corrupt data.
// Aligning by 64 = cache line, we want to assign cache lines to CPU cores, sharing them is extremely expensive
alignas( 64 ) double accumulators[ 16 * ompThreadsCount ];
memset( &accumulators, 0, sizeof( accumulators ) );
// Minimize OMP overhead by dispatching very few large tasks
#pragma omp parallel for schedule(static, 1)
for( int i = 0; i < ompThreadsCount; i++ )
{
// Grab a slice of the output buffer
double* const acc = &accumulators[ i * 16 ];
// Compute a slice of the source data for this thread
const size_t first = i * length / ompThreadsCount;
const size_t last = ( i + 1 ) * length / ompThreadsCount;
// Accumulate into thread-local portion of the buffer
for( size_t i = first; i < last; i++ )
{
const int8_t idx = indices[ i ];
acc[ idx ] += rsi[ i ];
}
}
// Reduce 16*N scalars to 16 with a few AVX instructions
for( int i = 0; i < 16; i += 4 )
{
__m256d v = _mm256_load_pd( &accumulators[ i ] );
for( int j = 1; j < ompThreadsCount; j++ )
{
__m256d v2 = _mm256_load_pd( &accumulators[ i + j * 16 ] );
v = _mm256_add_pd( v, v2 );
}
_mm256_storeu_pd( rdi + i, v );
}
}
The above version results in 20.5ms time, translates to 88% of RAM bandwidth limit.
P.S. I have no idea why the optimal threads count is 4 here, I have 8 cores/16 threads in the CPU. Both lower and higher values decrease the bandwidth. The constant is probably CPU-specific.
If indeed the offsets do not change for thousands (probably even tens) of times, it is likely worthwile to "transpose" them, i.e., to store all indices which need to be added to acc[0], then all indices which need to be added to acc[1], etc.
Essentially, what you are doing originally is a sparse-matrix times dense-vector product with the matrix in compressed-column-storage format (without explicitly storing the 1-values).
As shown in this answer sparse GEMV products are usually faster if the matrix is stored in compressed-row-storage (even without AVX2's gather instruction, you don't need to load and store the accumulated value every time).
Untested example implementation:
using sparse_matrix = std::vector<std::vector<int> >;
// call this once:
sparse_matrix transpose(uint8_t const* offsets, int n_bins, int n_obs){
sparse_matrix res;
res.resize(n_bins);
// count entries for each bin:
for(int i=0; i<n_obs; ++i) {
// assert(offsets[i] < n_bins);
res[offsets[i]].push_back(i);
}
return res;
}
void accumulate(double acc[], sparse_matrix const& indexes, double const* obs){
for(std::size_t row=0; row<indexes.size(); ++row) {
double sum = 0;
for(int col : indexes[row]) {
// you can manually vectorize this using _mm256_i32gather_pd,
// but clang/gcc should autovectorize this with -ffast-math -O3 -march=native
sum += obs[col];
}
acc[row] = sum;
}
}

my intrinsic function in getting the dot product of an int array is slower than the normal code, what am I doing wrong?

I'm trying to learn about intrinsic and how to properly utilize, and optimize it, I decided to implement a function to get the dot product of two arrays as a starting point to learn.
I create two functions to get the dot product of an array of integers int, one is coded in a normal way where you loop through every elements of the two arrays then perform multiplication with each element then add/accumulate/sum the resulting products to get the dot product.
The other uses intrinsic in a way where, I perform intrinsic operations on four elements of each array, I multiply each of them using _mm_mullo_epi32, then uses 2 horizontal add _mm_hadd_epi32 to get the sum of the current 4 elements, after that I add it up to the dot_product, then proceed to the the next four element, then repeat until I get to the calculated limit vec_loop, then I calculate the other remaining elements using the normal way to avoid calculating out of the array's memory, then I compare the performance of the two.
header file with the two types of dot product function:
// main.hpp
#ifndef main_hpp
#define main_hpp
#include <iostream>
#include <immintrin.h>
template<typename T>
T scalar_dot(T* a, T* b, size_t len){
T dot_product = 0;
for(size_t i=0; i<len; ++i) dot_product += a[i]*b[i];
return dot_product;
}
int sse_int_dot(int* a, int* b, size_t len){
size_t vec_loop = len/4;
size_t non_vec = len%4;
size_t start_non_vec_i = len-non_vec;
int dot_prod = 0;
for(size_t i=0; i<vec_loop; ++i)
{
__m128i va = _mm_loadu_si128((__m128i*)(a+(i*4)));
__m128i vb = _mm_loadu_si128((__m128i*)(b+(i*4)));
va = _mm_mullo_epi32(va,vb);
va = _mm_hadd_epi32(va,va);
va = _mm_hadd_epi32(va,va);
dot_prod += _mm_cvtsi128_si32(va);
}
for(size_t i=start_non_vec_i; i<len; ++i) dot_prod += a[i]*b[i];
return dot_prod;
}
#endif
cpp code to measure the time taken of each function
// main.cpp
#include <iostream>
#include <chrono>
#include <random>
#include "main.hpp"
int main()
{
// generate random integers
unsigned seed = std::chrono::steady_clock::now().time_since_epoch().count();
std::mt19937_64 rand_engine(seed);
std::mt19937_64 rand_engine2(seed/2);
std::uniform_int_distribution<int> random_number(0,9);
size_t LEN = 10000000;
int* a = new int[LEN];
int* b = new int[LEN];
for(size_t i=0; i<LEN; ++i)
{
a[i] = random_number(rand_engine);
b[i] = random_number(rand_engine2);
}
#ifdef SCALAR
int dot1 = 0;
#endif
#ifdef VECTOR
int dot2 = 0;
#endif
// timing
auto start = std::chrono::high_resolution_clock::now();
#ifdef SCALAR
dot1 = scalar_dot(a,b,LEN);
#endif
#ifdef VECTOR
dot2 = sse_int_dot(a,b,LEN);
#endif
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end-start);
std::cout<<"proccess taken "<<duration.count()<<" nanoseconds\n";
#ifdef SCALAR
std::cout<<"\nScalar : Dot product = "<<dot1<<"\n";
#endif
#ifdef VECTOR
std::cout<<"\nVector : Dot product = "<<dot2<<"\n";
#endif
return 0;
}
compilation:
intrinsic version : g++ main.cpp -DVECTOR -msse4.1 -o main.o
normal version : g++ main.cpp -DSCALAR -msse4.1 -o main.o
my machine:
Architecture: x86_64
CPU(s) : 1
CPU core(s): 4
Thread(s) per core: 1
Model name: Intel(R) Pentium(R) CPU N3700 # 1.60GHz
L1d cache: 96 KiB
L1i cache: 128 KiB
L2 cache: 2 MiB
some Flags : sse, sse2, sse4_1, sse4_2
In the main.cpp there are 10000000 elements of int array, when I compile the code above in my machine, it seems that the intrinsic function runs slower than the normal version, most of the time, intrinsic take around97529675 nanoseconds and sometimes even longer, while the normal code only takes around 87568313 nanoseconds, here I thought that my intrinsic function should run faster if the optimization flags is off, but turns out it is indeed somehow a little bit slower.
so my questions are:
why is my intrinsic function runs slower? (am I doing something wrong?)
how can I correct my intrinsic implementation, what is the proper way?
does the compiler auto vectorize/unroll the normal code even when the optimization flag is off
what is the fastest way to get the dot product given the specs of my machine?
I hope someone can help, thanks
So with #Peter Cordes, #Qubit and #j6t suggestions, I tweeked the code a little bit, I now only do multiplication inside the loop, then I moved the horizontal addition outside the loop... It managed to increase the performance of the intrinsic version from around 97529675 nanoseconds, to around 56444187 nanoseconds which is significantly faster than my previous implementation, with the same compilation flags and 10000000 elements of int array.
here is the new function from main.hpp
int _sse_int_dot(int* a, int* b, size_t len){
size_t vec_loop = len/4;
size_t non_vec = len%4;
size_t start_non_vec_i = len-non_vec;
int dot_product;
__m128i vdot_product = _mm_set1_epi32(0);
for(size_t i=0; i<vec_loop; ++i)
{
__m128i va = _mm_loadu_si128((__m128i*)(a+(i*4)));
__m128i vb = _mm_loadu_si128((__m128i*)(b+(i*4)));
__m128i vc = _mm_mullo_epi32(va,vb);
vdot_product = _mm_add_epi32(vdot_product,vc);
}
vdot_product = _mm_hadd_epi32(vdot_product,vdot_product);
vdot_product = _mm_hadd_epi32(vdot_product,vdot_product);
dot_product = _mm_cvtsi128_si32(vdot_product);
for(size_t i=start_non_vec_i; i<len; ++i) dot_product += a[i]*b[i];
return dot_product;
}
If there is more to improve with this code, please point it out, for now I'm just gonna leave it here as the answer.

Efficient C++ code (no libs) for image transformation into custom RGB pixel greyscale

Currently working on C++ implementation of ToGreyscale method and I want to ask what is the most efficient way to transform "unsigned char* source" using custom RGB input params.
Below is a current idea, but maybe using a Vector would be better?
uint8_t* pixel = source;
for (int i = 0; i < sourceInfo.height; ++i) {
for (int j = 0; j < sourceInfo.width; ++j, pixel += pixelSize) {
float r = pixel[0];
float g = pixel[1];
float b = pixel[2];
// Do something with r, g, b
}
}
The most efficient single threaded CPU implementation, is using manually optimized SIMD implementation.
SIMD extensions are specific for processor architecture.
For x86 there are SSE and AVX extensions, NEON for ARM, AltiVec for PowerPC...
In many cases the compiler is able to generate very efficient code that utilize the SIMD extension without any knowledge of the programmer (just by setting compiler flags).
There are also many cases where the compiler can't generate efficient code (many reasons for that).
When you need to get very high performance, it's recommended to implement it using C intrinsic functions.
Most of intrinsic instructions are converted directly to assembly instructions (instruction to instruction), without the need to know assembly.
There are many downsides of using intrinsic (compared to generic C implementation): Implementation is complicated to code and to maintain, and the code is platform specific and not portable.
A good reference for x86 intrinsics is Intel Intrinsics Guide.
The posted code uses SSE instruction set extension.
The implementation is very efficient, but not the top performance (using AVX2 for example may be faster, but less portable).
For better efficiency my code uses fixed point implementation.
In many cases fixed point is more efficient than floating point (but more difficult).
The most complicated part of the specific algorithm is reordering the RGB elements.
When RGB elements are ordered in triples r,g,b,r,g,b,r,g,b... you need to reorder them to rrrr... gggg... bbbb... in order of utilizing SIMD.
Naming conventions:
Don't be scared by the long weird variable names.
I am using this weird naming convention (it's my convention), because it helps me follow the code.
r7_r6_r5_r4_r3_r2_r1_r0 for example marks an XMM register with 8 uint16 elements.
The following implementation includes code with and without SSE intrinsics:
//Optimized implementation (use SSE intrinsics):
//----------------------------------------------
#include <intrin.h>
//Convert from RGBRGBRGB... to RRR..., GGG..., BBB...
//Input: Two XMM registers (24 uint8 elements) ordered RGBRGB...
//Output: Three XMM registers ordered RRR..., GGG... and BBB...
// Unpack the result from uint8 elements to uint16 elements.
static __inline void GatherRGBx8(const __m128i r5_b4_g4_r4_b3_g3_r3_b2_g2_r2_b1_g1_r1_b0_g0_r0,
const __m128i b7_g7_r7_b6_g6_r6_b5_g5,
__m128i &r7_r6_r5_r4_r3_r2_r1_r0,
__m128i &g7_g6_g5_g4_g3_g2_g1_g0,
__m128i &b7_b6_b5_b4_b3_b2_b1_b0)
{
//Shuffle mask for gathering 4 R elements, 4 G elements and 4 B elements (also set last 4 elements to duplication of first 4 elements).
const __m128i shuffle_mask = _mm_set_epi8(9,6,3,0, 11,8,5,2, 10,7,4,1, 9,6,3,0);
__m128i b7_g7_r7_b6_g6_r6_b5_g5_r5_b4_g4_r4 = _mm_alignr_epi8(b7_g7_r7_b6_g6_r6_b5_g5, r5_b4_g4_r4_b3_g3_r3_b2_g2_r2_b1_g1_r1_b0_g0_r0, 12);
//Gather 4 R elements, 4 G elements and 4 B elements.
//Remark: As I recall _mm_shuffle_epi8 instruction is not so efficient (I think execution is about 5 times longer than other shuffle instructions).
__m128i r3_r2_r1_r0_b3_b2_b1_b0_g3_g2_g1_g0_r3_r2_r1_r0 = _mm_shuffle_epi8(r5_b4_g4_r4_b3_g3_r3_b2_g2_r2_b1_g1_r1_b0_g0_r0, shuffle_mask);
__m128i r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4_r7_r6_r5_r4 = _mm_shuffle_epi8(b7_g7_r7_b6_g6_r6_b5_g5_r5_b4_g4_r4, shuffle_mask);
//Put 8 R elements in lower part.
__m128i b7_b6_b5_b4_g7_g6_g5_g4_r7_r6_r5_r4_r3_r2_r1_r0 = _mm_alignr_epi8(r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4_r7_r6_r5_r4, r3_r2_r1_r0_b3_b2_b1_b0_g3_g2_g1_g0_r3_r2_r1_r0, 12);
//Put 8 G elements in lower part.
__m128i g3_g2_g1_g0_r3_r2_r1_r0_zz_zz_zz_zz_zz_zz_zz_zz = _mm_slli_si128(r3_r2_r1_r0_b3_b2_b1_b0_g3_g2_g1_g0_r3_r2_r1_r0, 8);
__m128i zz_zz_zz_zz_r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4 = _mm_srli_si128(r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4_r7_r6_r5_r4, 4);
__m128i r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4_g3_g2_g1_g0 = _mm_alignr_epi8(zz_zz_zz_zz_r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4, g3_g2_g1_g0_r3_r2_r1_r0_zz_zz_zz_zz_zz_zz_zz_zz, 12);
//Put 8 B elements in lower part.
__m128i b3_b2_b1_b0_g3_g2_g1_g0_r3_r2_r1_r0_zz_zz_zz_zz = _mm_slli_si128(r3_r2_r1_r0_b3_b2_b1_b0_g3_g2_g1_g0_r3_r2_r1_r0, 4);
__m128i zz_zz_zz_zz_zz_zz_zz_zz_r7_r6_r5_r4_b7_b6_b5_b4 = _mm_srli_si128(r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4_r7_r6_r5_r4, 8);
__m128i zz_zz_zz_zz_r7_r6_r5_r4_b7_b6_b5_b4_b3_b2_b1_b0 = _mm_alignr_epi8(zz_zz_zz_zz_zz_zz_zz_zz_r7_r6_r5_r4_b7_b6_b5_b4, b3_b2_b1_b0_g3_g2_g1_g0_r3_r2_r1_r0_zz_zz_zz_zz, 12);
//Unpack uint8 elements to uint16 elements.
r7_r6_r5_r4_r3_r2_r1_r0 = _mm_cvtepu8_epi16(b7_b6_b5_b4_g7_g6_g5_g4_r7_r6_r5_r4_r3_r2_r1_r0);
g7_g6_g5_g4_g3_g2_g1_g0 = _mm_cvtepu8_epi16(r7_r6_r5_r4_b7_b6_b5_b4_g7_g6_g5_g4_g3_g2_g1_g0);
b7_b6_b5_b4_b3_b2_b1_b0 = _mm_cvtepu8_epi16(zz_zz_zz_zz_r7_r6_r5_r4_b7_b6_b5_b4_b3_b2_b1_b0);
}
//Calculate 8 Grayscale elements from 8 RGB elements.
//Y = 0.2989*R + 0.5870*G + 0.1140*B
//Conversion model used by MATLAB https://www.mathworks.com/help/matlab/ref/rgb2gray.html
static __inline __m128i Rgb2Yx8(__m128i r7_r6_r5_r4_r3_r2_r1_r0,
__m128i g7_g6_g5_g4_g3_g2_g1_g0,
__m128i b7_b6_b5_b4_b3_b2_b1_b0)
{
//Each coefficient is expanded by 2^15, and rounded to int16 (add 0.5 for rounding).
const __m128i r_coef = _mm_set1_epi16((short)(0.2989*32768.0 + 0.5)); //8 coefficients - R scale factor.
const __m128i g_coef = _mm_set1_epi16((short)(0.5870*32768.0 + 0.5)); //8 coefficients - G scale factor.
const __m128i b_coef = _mm_set1_epi16((short)(0.1140*32768.0 + 0.5)); //8 coefficients - B scale factor.
//Multiply input elements by 64 for improved accuracy.
r7_r6_r5_r4_r3_r2_r1_r0 = _mm_slli_epi16(r7_r6_r5_r4_r3_r2_r1_r0, 6);
g7_g6_g5_g4_g3_g2_g1_g0 = _mm_slli_epi16(g7_g6_g5_g4_g3_g2_g1_g0, 6);
b7_b6_b5_b4_b3_b2_b1_b0 = _mm_slli_epi16(b7_b6_b5_b4_b3_b2_b1_b0, 6);
//Use the special intrinsic _mm_mulhrs_epi16 that calculates round(r*r_coef/2^15).
//Calculate Y = 0.2989*R + 0.5870*G + 0.1140*B (use fixed point computations)
__m128i y7_y6_y5_y4_y3_y2_y1_y0 = _mm_add_epi16(_mm_add_epi16(
_mm_mulhrs_epi16(r7_r6_r5_r4_r3_r2_r1_r0, r_coef),
_mm_mulhrs_epi16(g7_g6_g5_g4_g3_g2_g1_g0, g_coef)),
_mm_mulhrs_epi16(b7_b6_b5_b4_b3_b2_b1_b0, b_coef));
//Divide result by 64.
y7_y6_y5_y4_y3_y2_y1_y0 = _mm_srli_epi16(y7_y6_y5_y4_y3_y2_y1_y0, 6);
return y7_y6_y5_y4_y3_y2_y1_y0;
}
//Convert single row from RGB to Grayscale (use SSE intrinsics).
//I0 points source row, and J0 points destination row.
//I0 -> rgbrgbrgbrgbrgbrgb...
//J0 -> yyyyyy
static void Rgb2GraySingleRow_useSSE(const unsigned char I0[],
const int image_width,
unsigned char J0[])
{
int x; //Index in J0.
int srcx; //Index in I0.
__m128i r7_r6_r5_r4_r3_r2_r1_r0;
__m128i g7_g6_g5_g4_g3_g2_g1_g0;
__m128i b7_b6_b5_b4_b3_b2_b1_b0;
srcx = 0;
//Process 8 pixels per iteration.
for (x = 0; x < image_width; x += 8)
{
//Load 8 elements of each color channel R,G,B from first row.
__m128i r5_b4_g4_r4_b3_g3_r3_b2_g2_r2_b1_g1_r1_b0_g0_r0 = _mm_loadu_si128((__m128i*)&I0[srcx]); //Unaligned load of 16 uint8 elements
__m128i b7_g7_r7_b6_g6_r6_b5_g5 = _mm_loadu_si128((__m128i*)&I0[srcx+16]); //Unaligned load of (only) 8 uint8 elements (lower half of XMM register).
//Separate RGB, and put together R elements, G elements and B elements (together in same XMM register).
//Result is also unpacked from uint8 to uint16 elements.
GatherRGBx8(r5_b4_g4_r4_b3_g3_r3_b2_g2_r2_b1_g1_r1_b0_g0_r0,
b7_g7_r7_b6_g6_r6_b5_g5,
r7_r6_r5_r4_r3_r2_r1_r0,
g7_g6_g5_g4_g3_g2_g1_g0,
b7_b6_b5_b4_b3_b2_b1_b0);
//Calculate 8 Y elements.
__m128i y7_y6_y5_y4_y3_y2_y1_y0 = Rgb2Yx8(r7_r6_r5_r4_r3_r2_r1_r0,
g7_g6_g5_g4_g3_g2_g1_g0,
b7_b6_b5_b4_b3_b2_b1_b0);
//Pack uint16 elements to 16 uint8 elements (put result in single XMM register). Only lower 8 uint8 elements are relevant.
__m128i j7_j6_j5_j4_j3_j2_j1_j0 = _mm_packus_epi16(y7_y6_y5_y4_y3_y2_y1_y0, y7_y6_y5_y4_y3_y2_y1_y0);
//Store 8 elements of Y in row Y0, and 8 elements of Y in row Y1.
_mm_storel_epi64((__m128i*)&J0[x], j7_j6_j5_j4_j3_j2_j1_j0);
srcx += 24; //Advance 24 source bytes per iteration.
}
}
//Convert image I from pixel ordered RGB to Grayscale format.
//Conversion formula: Y = 0.2989*R + 0.5870*G + 0.1140*B (Rec.ITU-R BT.601)
//Formula is based on MATLAB rgb2gray function: https://www.mathworks.com/help/matlab/ref/rgb2gray.html
//Implementation uses SSE intrinsics for performance optimization.
//Use fixed point computations for better performance.
//I - Input image in pixel ordered RGB format.
//image_width - Number of columns of I.
//image_height - Number of rows of I.
//J - Destination "image" in Grayscale format.
//I is pixel ordered RGB color format (size in bytes is image_width*image_height*3):
//RGBRGBRGBRGBRGBRGB
//RGBRGBRGBRGBRGBRGB
//RGBRGBRGBRGBRGBRGB
//RGBRGBRGBRGBRGBRGB
//
//J is in Grayscale format (size in bytes is image_width*image_height):
//YYYYYY
//YYYYYY
//YYYYYY
//YYYYYY
//
//Limitations:
//1. image_width must be a multiple of 8.
//2. I and J must be two separate arrays (in place computation is not supported).
//3. Rows of I and J are continues in memory (bytes stride is not supported, [but simple to add]).
//
//Comments:
//1. The conversion formula is incorrect, but it's a commonly used approximation.
//2. Code uses SSE 4.1 instruction set.
// Better performance can be archived using AVX2 implementation.
// (AVX2 is supported by Intel Core 4'th generation and above, and new AMD processors).
//3. The code is not the best SSE optimization:
// Uses unaligned load and store operations.
// Utilize only half XMM register in few cases.
// Instruction selection is probably sub-optimal.
void Rgb2Gray_useSSE(const unsigned char I[],
const int image_width,
const int image_height,
unsigned char J[])
{
//I0 points source image row.
const unsigned char *I0; //I0 -> rgbrgbrgbrgbrgbrgb...
//J0 points destination image row.
unsigned char *J0; //J0 -> YYYYYY
int y; //Row index
//Process one row per iteration.
for (y = 0; y < image_height; y ++)
{
I0 = &I[y*image_width*3]; //Input row width is image_width*3 bytes (each pixel is R,G,B).
J0 = &J[y*image_width]; //Output Y row width is image_width bytes (one Y element per pixel).
//Convert row I0 from RGB to Grayscale.
Rgb2GraySingleRow_useSSE(I0,
image_width,
J0);
}
}
//Convert single row from RGB to Grayscale (simple C code without intrinsics).
static void Rgb2GraySingleRow_Simple(const unsigned char I0[],
const int image_width,
unsigned char J0[])
{
int x; //index in J0.
int srcx; //Index in I0.
srcx = 0;
//Process 1 pixel per iteration.
for (x = 0; x < image_width; x++)
{
float r = (float)I0[srcx]; //Load red pixel and convert to float
float g = (float)I0[srcx+1]; //Green
float b = (float)I0[srcx+2]; //Blue
float gray = 0.2989f*r + 0.5870f*g + 0.1140f*b; //Convert to Grayscale (use BT.601 conversion coefficients).
J0[x] = (unsigned char)(gray + 0.5f); //Add 0.5 for rounding.
srcx += 3; //Advance 3 source bytes per iteration.
}
}
//Convert RGB to Grayscale using simple C code (without SIMD intrinsics).
//Use as reference (for time measurements).
void Rgb2Gray_Simple(const unsigned char I[],
const int image_width,
const int image_height,
unsigned char J[])
{
//I0 points source image row.
const unsigned char *I0; //I0 -> rgbrgbrgbrgbrgbrgb...
//J0 points destination image row.
unsigned char *J0; //J0 -> YYYYYY
int y; //Row index
//Process one row per iteration.
for (y = 0; y < image_height; y ++)
{
I0 = &I[y*image_width*3]; //Input row width is image_width*3 bytes (each pixel is R,G,B).
J0 = &J[y*image_width]; //Output Y row width is image_width bytes (one Y element per pixel).
//Convert row I0 from RGB to Grayscale.
Rgb2GraySingleRow_Simple(I0,
image_width,
J0);
}
}
In my machine, the manually optimized code is about 3 times faster.
You can find medium value between RGB channels, then, assign the result to each channel.
uint8_t* pixel = source;
for (int i = 0; i < sourceInfo.height; ++i) {
for (int j = 0; j < sourceInfo.width; ++j, pixel += pixelSize) {
float grayscaleValue = 0;
for (int k = 0; k < 3; k++) {
grayscaleValue += pixel[k];
}
grayscaleValue /= 3;
for (int k = 0; k < 3; k++) {
pixel[k] = grayscaleValue;
}
}
}

Parallelizing a for loop gives no performance gain

I have an algorithm which converts a bayer image channel to RGB. In my implementation I have a single nested for loop which iterates over the bayer channel, calculates the rgb index from the bayer index and then sets that pixel's value from the bayer channel.
The main thing to notice here is that each pixel can be calculated independently from other pixels (doesn't rely on previous calculations) and so the algorithm is a natural candidate for paralleization. The calculation does however rely on some preset arrays which all threads will be accessing in the same time but will not change.
However, when I tried parallelizing the main forwith MS's cuncurrency::parallel_for I gained no boost in performance. In fact, for an input of size 3264X2540 running over a 4-core CPU, the non parallelized version ran in ~34ms and the parallelized version ran in ~69ms (averaged over 10 runs). I confirmed that the operation was indeed parallelized (3 new threads were created for the task).
Using Intel's compiler with tbb::parallel_for gave near exact results.
For comparison, I started out with this algorithm implemented in C# in which I also used parallel_for loops and there I encountered near X4 performance gains (I opted for C++ because for this particular task C++ was faster even with a single core).
Any ideas what is preventing my code from parallelizing well?
My code:
template<typename T>
void static ConvertBayerToRgbImageAsIs(T* BayerChannel, T* RgbChannel, int Width, int Height, ColorSpace ColorSpace)
{
//Translates index offset in Bayer image to channel offset in RGB image
int offsets[4];
//calculate offsets according to color space
switch (ColorSpace)
{
case ColorSpace::BGGR:
offsets[0] = 2;
offsets[1] = 1;
offsets[2] = 1;
offsets[3] = 0;
break;
...other color spaces
}
memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
parallel_for(0, Height, [&] (int row)
{
for (auto col = 0, bayerIndex = row * Width; col < Width; col++, bayerIndex++)
{
auto offset = (row%2)*2 + (col%2); //0...3
auto rgbIndex = bayerIndex * 3 + offsets[offset];
RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
}
});
}
First of all, your algorithm is memory bandwidth bounded. That is memory load/store would outweigh any index calculations you do.
Vector operations like SSE/AVX would not help either - you are not doing any intensive calculations.
Increasing work amount per iteration is also useless - both PPL and TBB are smart enough, to not create thread per iteration, they would use some good partition, which would additionaly try to preserve locality. For instance, here is quote from TBB::parallel_for:
When worker threads are available, parallel_for executes iterations is non-deterministic order. Do not rely upon any particular execution order for correctness. However, for efficiency, do expect parallel_for to tend towards operating on consecutive runs of values.
What really matters is to reduce memory operations. Any superfluous traversal over input or output buffer is poison for performance, so you should try to remove your memset or do it in parallel too.
You are fully traversing input and output data. Even if you skip something in output - that doesn't mater, because memory operations are happening by 64 byte chunks at modern hardware. So, calculate size of your input and output, measure time of algorithm, divide size/time and compare result with maximal characteristics of your system (for instance, measure with benchmark).
I have made test for Microsoft PPL, OpenMP and Native for, results are (I used 8x of your height):
Native_For 0.21 s
OpenMP_For 0.15 s
Intel_TBB_For 0.15 s
MS_PPL_For 0.15 s
If remove memset then:
Native_For 0.15 s
OpenMP_For 0.09 s
Intel_TBB_For 0.09 s
MS_PPL_For 0.09 s
As you can see memset (which is highly optimized) is responsoble for significant amount of execution time, which shows how your algorithm is memory bounded.
FULL SOURCE CODE:
#include <boost/exception/detail/type_info.hpp>
#include <boost/mpl/for_each.hpp>
#include <boost/mpl/vector.hpp>
#include <boost/progress.hpp>
#include <tbb/tbb.h>
#include <iostream>
#include <ostream>
#include <vector>
#include <string>
#include <omp.h>
#include <ppl.h>
using namespace boost;
using namespace std;
const auto Width = 3264;
const auto Height = 2540*8;
struct MS_PPL_For
{
template<typename F,typename Index>
void operator()(Index first,Index last,F f) const
{
concurrency::parallel_for(first,last,f);
}
};
struct Intel_TBB_For
{
template<typename F,typename Index>
void operator()(Index first,Index last,F f) const
{
tbb::parallel_for(first,last,f);
}
};
struct Native_For
{
template<typename F,typename Index>
void operator()(Index first,Index last,F f) const
{
for(; first!=last; ++first) f(first);
}
};
struct OpenMP_For
{
template<typename F,typename Index>
void operator()(Index first,Index last,F f) const
{
#pragma omp parallel for
for(auto i=first; i<last; ++i) f(i);
}
};
template<typename T>
struct ConvertBayerToRgbImageAsIs
{
const T* BayerChannel;
T* RgbChannel;
template<typename For>
void operator()(For for_)
{
cout << type_name<For>() << "\t";
progress_timer t;
int offsets[] = {2,1,1,0};
//memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
for_(0, Height, [&] (int row)
{
for (auto col = 0, bayerIndex = row * Width; col < Width; col++, bayerIndex++)
{
auto offset = (row % 2)*2 + (col % 2); //0...3
auto rgbIndex = bayerIndex * 3 + offsets[offset];
RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
}
});
}
};
int main()
{
vector<float> bayer(Width*Height);
vector<float> rgb(Width*Height*3);
ConvertBayerToRgbImageAsIs<float> work = {&bayer[0],&rgb[0]};
for(auto i=0;i!=4;++i)
{
mpl::for_each<mpl::vector<Native_For, OpenMP_For,Intel_TBB_For,MS_PPL_For>>(work);
cout << string(16,'_') << endl;
}
}
Synchronization overhead
I would guess that the amount of work done per iteration of the loop is too small. Had you split the image into four parts and ran the computation in parallel, you would have noticed a large gain. Try to design the loop in a way that would case less iterations and more work per iteration. The reasoning behind this is that there is too much synchronization done.
Cache usage
An important factor may be how the data is split (partitioned) for the processing. If the proceessed rows are separated as in the bad case below, then more rows will cause a cache miss. This effect will become more important with each additional thread, because the distance between rows will be greater. If you are certain that the parallelizing function performs reasonable partitioning, then manual work-splitting will not give any results
bad good
****** t1 ****** t1
****** t2 ****** t1
****** t1 ****** t1
****** t2 ****** t1
****** t1 ****** t2
****** t2 ****** t2
****** t1 ****** t2
****** t2 ****** t2
Also make sure that you access your data in the same way it is aligned; it is possible that each call to offset[] and BayerChannel[] is a cache miss. Your algorithm is very memory intensive. Almost all operations are either accessing a memory segment or writing to it. Preventing cache misses and minimizing memory access is crucial.
Code optimizations
the optimizations shown below may be done by the compiler and may not give better results. It is worth knowing that they can be done.
// is the memset really necessary?
//memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
parallel_for(0, Height, [&] (int row)
{
int rowMod = (row & 1) << 1;
for (auto col = 0, bayerIndex = row * Width, tripleBayerIndex=row*Width*3; col < Width; col+=2, bayerIndex+=2, tripleBayerIndex+=6)
{
auto rgbIndex = tripleBayerIndex + offsets[rowMod];
RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
//unrolled the loop to save col & 1 operation
rgbIndex = tripleBayerIndex + 3 + offsets[rowMod+1];
RgbChannel[rgbIndex] = BayerChannel[bayerIndex+1];
}
});
Here comes my suggestion:
Computer larger chunks in parallel
get rid of modulo/multiplication
unroll inner loop to compute one full pixel (simplifies code)
template<typename T> void static ConvertBayerToRgbImageAsIsNew(T* BayerChannel, T* RgbChannel, int Width, int Height)
{
// convert BGGR->RGB
// have as many threads as the hardware concurrency is
parallel_for(0, Height, static_cast<int>(Height/(thread::hardware_concurrency())), [&] (int stride)
{
for (auto row = stride; row<2*stride; row++)
{
for (auto col = row*Width, rgbCol =row*Width; col < row*Width+Width; rgbCol +=3, col+=4)
{
RgbChannel[rgbCol+0] = BayerChannel[col+3];
RgbChannel[rgbCol+1] = BayerChannel[col+1];
// RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted
RgbChannel[rgbCol+2] = BayerChannel[col+0];
}
}
});
}
This code is 60% faster than the original version but still only half as fast as the non parallelized version on my laptop. This seemed to be due to the memory boundedness of the algorithm as others have pointed out already.
edit: But I was not happy with that. I could greatly improve the parallel performance when going from parallel_for to std::async:
int hc = thread::hardware_concurrency();
future<void>* res = new future<void>[hc];
for (int i = 0; i<hc; ++i)
{
res[i] = async(Converter<char>(bayerChannel, rgbChannel, rows, cols, rows/hc*i, rows/hc*(i+1)));
}
for (int i = 0; i<hc; ++i)
{
res[i].wait();
}
delete [] res;
with converter being a simple class:
template <class T> class Converter
{
public:
Converter(T* BayerChannel, T* RgbChannel, int Width, int Height, int startRow, int endRow) :
BayerChannel(BayerChannel), RgbChannel(RgbChannel), Width(Width), Height(Height), startRow(startRow), endRow(endRow)
{
}
void operator()()
{
// convert BGGR->RGB
for(int row = startRow; row < endRow; row++)
{
for (auto col = row*Width, rgbCol =row*Width; col < row*Width+Width; rgbCol +=3, col+=4)
{
RgbChannel[rgbCol+0] = BayerChannel[col+3];
RgbChannel[rgbCol+1] = BayerChannel[col+1];
// RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted
RgbChannel[rgbCol+2] = BayerChannel[col+0];
}
};
}
private:
T* BayerChannel;
T* RgbChannel;
int Width;
int Height;
int startRow;
int endRow;
};
This is now 3.5 times faster than the non parallelized version. From what I have seen in the profiler so far, I assume that the work stealing approach of parallel_for incurs a lot of waiting and synchronization overhead.
I have not used tbb::parallel_for not cuncurrency::parallel_for, but if your numbers are correct they seem to carry too much overhead. However, I strongly advice you to run more that 10 iterations when testing, and also be sure to do as many warmup iterations before timing.
I tested your code exactly using three different methods, averaged over 1000 tries.
Serial: 14.6 += 1.0 ms
std::async: 13.6 += 1.6 ms
workers: 11.8 += 1.2 ms
The first is serial calculation. The second is done using four calls to std::async. The last is done by sending four jobs to four already started (but sleeping) background threads.
The gains aren't big, but at least they are gains. I did the test on a 2012 MacBook Pro, with dual hyper threaded cores = 4 logical cores.
For reference, here's my std::async parallel for:
template<typename Int=int, class Fun>
void std_par_for(Int beg, Int end, const Fun& fun)
{
auto N = std::thread::hardware_concurrency();
std::vector<std::future<void>> futures;
for (Int ti=0; ti<N; ++ti) {
Int b = ti * (end - beg) / N;
Int e = (ti+1) * (end - beg) / N;
if (ti == N-1) { e = end; }
futures.emplace_back( std::async([&,b,e]() {
for (Int ix=b; ix<e; ++ix) {
fun( ix );
}
}));
}
for (auto&& f : futures) {
f.wait();
}
}
Things to check or do
Are you using a Core 2 or older processor? They have a very narrow memory bus that's easy to saturate with code like this. In contrast, 4-channel Sandy Bridge-E processors require multiple threads to saturate the memory bus (it's not possible for a single memory-bound thread to fully saturate it).
Have you populated all of your memory channels? E.g. if you have a dual-channel CPU but have just one RAM card installed or two that are on the same channel, you're getting half the available bandwidth.
How are you timing your code?
The timing should be done inside the application like Evgeny Panasyuk suggests.
You should do multiple runs within the same application. Otherwise, you may be timing one-time startup code to launch the thread pools, etc.
Remove the superfluous memset, as others have explained.
As ogni42 and others have suggested, unroll your inner loop (I didn't bother checking the correctness of that solution, but if it's wrong, you should be able to fix it). This is orthogonal to the main question of parallelization, but it's a good idea anyway.
Make sure your machine is otherwise idle when doing performance testing.
Additional timings
I've merged the suggestions of Evgeny Panasyuk and ogni42 in a bare-bones C++03 Win32 implementation:
#include "stdafx.h"
#include <omp.h>
#include <vector>
#include <iostream>
#include <stdio.h>
using namespace std;
const int Width = 3264;
const int Height = 2540*8;
class Timer {
private:
string name;
LARGE_INTEGER start;
LARGE_INTEGER stop;
LARGE_INTEGER frequency;
public:
Timer(const char *name) : name(name) {
QueryPerformanceFrequency(&frequency);
QueryPerformanceCounter(&start);
}
~Timer() {
QueryPerformanceCounter(&stop);
LARGE_INTEGER time;
time.QuadPart = stop.QuadPart - start.QuadPart;
double elapsed = ((double)time.QuadPart /(double)frequency.QuadPart);
printf("%-20s : %5.2f\n", name.c_str(), elapsed);
}
};
static const int offsets[] = {2,1,1,0};
template <typename T>
void Inner_Orig(const T* BayerChannel, T* RgbChannel, int row)
{
for (int col = 0, bayerIndex = row * Width;
col < Width; col++, bayerIndex++)
{
int offset = (row % 2)*2 + (col % 2); //0...3
int rgbIndex = bayerIndex * 3 + offsets[offset];
RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
}
}
// adapted from ogni42's answer
template <typename T>
void Inner_Unrolled(const T* BayerChannel, T* RgbChannel, int row)
{
for (int col = row*Width, rgbCol =row*Width;
col < row*Width+Width; rgbCol +=3, col+=4)
{
RgbChannel[rgbCol+0] = BayerChannel[col+3];
RgbChannel[rgbCol+1] = BayerChannel[col+1];
// RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted
RgbChannel[rgbCol+2] = BayerChannel[col+0];
}
}
int _tmain(int argc, _TCHAR* argv[])
{
vector<float> bayer(Width*Height);
vector<float> rgb(Width*Height*3);
for(int i = 0; i < 4; ++i)
{
{
Timer t("serial_orig");
for(int row = 0; row < Height; ++row) {
Inner_Orig<float>(&bayer[0], &rgb[0], row);
}
}
{
Timer t("omp_dynamic_orig");
#pragma omp parallel for
for(int row = 0; row < Height; ++row) {
Inner_Orig<float>(&bayer[0], &rgb[0], row);
}
}
{
Timer t("omp_static_orig");
#pragma omp parallel for schedule(static)
for(int row = 0; row < Height; ++row) {
Inner_Orig<float>(&bayer[0], &rgb[0], row);
}
}
{
Timer t("serial_unrolled");
for(int row = 0; row < Height; ++row) {
Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
}
}
{
Timer t("omp_dynamic_unrolled");
#pragma omp parallel for
for(int row = 0; row < Height; ++row) {
Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
}
}
{
Timer t("omp_static_unrolled");
#pragma omp parallel for schedule(static)
for(int row = 0; row < Height; ++row) {
Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
}
}
printf("-----------------------------\n");
}
return 0;
}
Here are the timings I see on a triple-channel 8-way hyperthreaded Core i7-950 box:
serial_orig : 0.13
omp_dynamic_orig : 0.10
omp_static_orig : 0.10
serial_unrolled : 0.06
omp_dynamic_unrolled : 0.04
omp_static_unrolled : 0.04
The "static" versions tell the compiler to evenly divide up the work between threads at loop entry. This avoids the overhead of attempting to do work stealing or other dynamic load balancing. For this code snippet, it doesn't seem to make a difference, even though the workload is very uniform across threads.
The performance reduction might be happening because your are trying to distribute for loop on "row" number of cores, which wont be available and hence again it become like a sequential execution with the overhead of parallelism.
Not very familiar with parallel for loops but it seems to me the contention is in the memory access. It appears your threads are overlapping access to the same pages.
Can you break up your array access into 4k chunks somewhat align with the page boundary?
There is no point talking about parallel performance before not having optimized the for loop for serial code. Here is my attempt at that (some good compilers may be able to obtain similarly optimized versions, but I'd rather not rely on that)
parallel_for(0, Height, [=] (int row) noexcept
{
for (auto col=0, bayerindex=row*Width,
rgb0=3*bayerindex+offset[(row%2)*2],
rgb1=3*bayerindex+offset[(row%2)*2+1];
col < Width; col+=2, bayerindex+=2, rgb0+=6, rgb1+=6 )
{
RgbChannel[rgb0] = BayerChannel[bayerindex ];
RgbChannel[rgb1] = BayerChannel[bayerindex+1];
}
});