Accumulating Doubles Into Bins via intrinsics

Accumulating Doubles Into Bins via intrinsics - c++

I have a vector of observations and an equal length vector of offsets assigning observations to a set of bins. The value of each bin should be the sum of all observations assigned to that bin, and I'm wondering if there's a vectorized method to do the reduction.
A naive implementation is below:
const int N_OBS = 100`000`000;
const int N_BINS = 16;
double obs[N_OBS]; // Observations
int8_t offsets[N_OBS];
double acc[N_BINS] = {0};
for (int i = 0; i < N_OBS; ++i) {
acc[offsets[i]] += obs[i]; // accumulate obs value into its assigned bin
}
Is this possible using simd/avx intrinsics? Something similar to the above will be run millions of times. I've looked at scatter/gather approaches, but can't seem to figure out a good way to get it done.

Modern CPUs are surprisingly good running your naïve version. On AMD Zen3, I’m getting 48ms for 100M random numbers on input, that’s 18 GB/sec RAM read bandwidth. That’s like 35% of the hard bandwidth limit on my computer (dual-channel DDR4-3200).
No SIMD gonna help, I’m afraid. Still, the best version I got is the following. Compile with OpenMP support, the switch depends on your C++ compiler.
void computeHistogramScalarOmp( const double* rsi, const int8_t* indices, size_t length, double* rdi )
{
// Count of OpenMP threads = CPU cores to use
constexpr int ompThreadsCount = 4;
// Use independent set of accumulators per thread, otherwise concurrency gonna corrupt data.
// Aligning by 64 = cache line, we want to assign cache lines to CPU cores, sharing them is extremely expensive
alignas( 64 ) double accumulators[ 16 * ompThreadsCount ];
memset( &accumulators, 0, sizeof( accumulators ) );
// Minimize OMP overhead by dispatching very few large tasks
#pragma omp parallel for schedule(static, 1)
for( int i = 0; i < ompThreadsCount; i++ )
{
// Grab a slice of the output buffer
double* const acc = &accumulators[ i * 16 ];
// Compute a slice of the source data for this thread
const size_t first = i * length / ompThreadsCount;
const size_t last = ( i + 1 ) * length / ompThreadsCount;
// Accumulate into thread-local portion of the buffer
for( size_t i = first; i < last; i++ )
{
const int8_t idx = indices[ i ];
acc[ idx ] += rsi[ i ];
}
}
// Reduce 16*N scalars to 16 with a few AVX instructions
for( int i = 0; i < 16; i += 4 )
{
__m256d v = _mm256_load_pd( &accumulators[ i ] );
for( int j = 1; j < ompThreadsCount; j++ )
{
__m256d v2 = _mm256_load_pd( &accumulators[ i + j * 16 ] );
v = _mm256_add_pd( v, v2 );
}
_mm256_storeu_pd( rdi + i, v );
}
}
The above version results in 20.5ms time, translates to 88% of RAM bandwidth limit.
P.S. I have no idea why the optimal threads count is 4 here, I have 8 cores/16 threads in the CPU. Both lower and higher values decrease the bandwidth. The constant is probably CPU-specific.

If indeed the offsets do not change for thousands (probably even tens) of times, it is likely worthwile to "transpose" them, i.e., to store all indices which need to be added to acc[0], then all indices which need to be added to acc[1], etc.
Essentially, what you are doing originally is a sparse-matrix times dense-vector product with the matrix in compressed-column-storage format (without explicitly storing the 1-values).
As shown in this answer sparse GEMV products are usually faster if the matrix is stored in compressed-row-storage (even without AVX2's gather instruction, you don't need to load and store the accumulated value every time).
Untested example implementation:
using sparse_matrix = std::vector<std::vector<int> >;
// call this once:
sparse_matrix transpose(uint8_t const* offsets, int n_bins, int n_obs){
sparse_matrix res;
res.resize(n_bins);
// count entries for each bin:
for(int i=0; i<n_obs; ++i) {
// assert(offsets[i] < n_bins);
res[offsets[i]].push_back(i);
}
return res;
}
void accumulate(double acc[], sparse_matrix const& indexes, double const* obs){
for(std::size_t row=0; row<indexes.size(); ++row) {
double sum = 0;
for(int col : indexes[row]) {
// you can manually vectorize this using _mm256_i32gather_pd,
// but clang/gcc should autovectorize this with -ffast-math -O3 -march=native
sum += obs[col];
}
acc[row] = sum;
}
}

Related

Fully Connected Layer (dot product) using AVX

I have the following C++ code to perform the multiply and accumulate steps of a fully connected layer (without the bias). Basically I just do a dot product using a vector (inputs) and a matrix (weights). I used AVX vectors to speed up the operation.
const float* src = inputs[0]->buffer();
const float* scl = weights->buffer();
float* dst = outputs[0]->buffer();
SizeVector in_dims = inputs[0]->getTensorDesc().getDims();
SizeVector out_dims = outputs[0]->getTensorDesc().getDims();
const int in_neurons = static_cast<int>(in_dims[1]);
const int out_neurons = static_cast<int>(out_dims[1]);
for(size_t n = 0; n < out_neurons; n++){
float accum = 0.0;
float temp[4] = {0,0,0,0};
float *p = temp;
__m128 in, ws, dp;
for(size_t i = 0; i < in_neurons; i+=4){
// read and save the weights correctly by applying the mask
temp[0] = scl[(i+0)*out_neurons + n];
temp[1] = scl[(i+1)*out_neurons + n];
temp[2] = scl[(i+2)*out_neurons + n];
temp[3] = scl[(i+3)*out_neurons + n];
// load input neurons sequentially
in = _mm_load_ps(&src[i]);
// load weights
ws = _mm_load_ps(p);
// dot product
dp = _mm_dp_ps(in, ws, 0xff);
// accumulator
accum += dp.m128_f32[0];
}
// save the final result
dst[n] = accum.m128_f32[0];
}
It works but the speedup is far from what I expected. As you can see below a convolutional layer with x24 more operations than my custom dot product layer takes less time. This makes no sense and there should be much more room for improvements. What are my major mistakes when trying to use AVX? (I'm new to AVX programming so I don't fully understand from where I should start to look to fully optimize the code).
**Convolutional Convolutional Layer Fully Optimized (AVX)**
Layer: CONV3-32
Input: 28x28x32 = 25K
Weights: (3*3*32)*32 = 9K
Number of MACs: 3*3*27*27*32*32 = 7M
Execution Time on OpenVINO framework: 0.049 ms
**My Custom Dot Product Layer Far From Optimized (AVX)**
Layer: FC
Inputs: 1x1x512
Outputs: 576
Weights: 3*3*64*512 = 295K
Number of MACs: 295K
Execution Time on OpenVINO framework: 0.197 ms
Thanks for all help in advance!

Addendum: What you are doing is actually a Matrix-Vector-product. It is well-understood how to implement this efficiently (although with caching and instruction-level parallelism it is not completely trivial). The rest of the answer just shows a very simple vectorized implementation.
You can drastically simplify your implementation by incrementing n+=8 and i+=1 (assuming out_neurons is a multiple of 8, otherwise, some special handling needs to be done for the last elements), i.e., you can accumulate 8 dst values at once.
A very simple implementation assuming FMA is available (otherwise use multiplication and addition):
void dot_product(const float* src, const float* scl, float* dst,
const int in_neurons, const int out_neurons)
{
for(size_t n = 0; n < out_neurons; n+=8){
__m256 accum = _mm256_setzero_ps();
for(size_t i = 0; i < in_neurons; i++){
accum = _mm256_fmadd_ps(_mm256_loadu_ps(&scl[i*out_neurons+n]), _mm256_set1_ps(src[i]), accum);
}
// save the result
_mm256_storeu_ps(dst+n ,accum);
}
}
This could still be optimized e.g., by accumulating 2, 4, or 8 dst packets inside the inner loop, which would not only save some broadcast operations (the _mm256_set1_ps instruction), but also compensate latencies of the FMA instruction.
Godbolt-Link, if you want to play around with the code: https://godbolt.org/z/mm-YHi

Matrix transpose and population count

I have a square boolean matrix M of size N, stored by rows and I want to count the number of bits set to 1 for each column.
For instance for n=4:
1101
0101
0001
1001
M stored as { { 1,1,0,1}, {0,1,0,1}, {0,0,0,1}, {1,0,0,1} };
result = { 2, 2, 0, 4};
I can obviously
transpose the matrix M into a matrix M'
popcount each row of M'.
Good algorithms exist for matrix transposition and popcounting through bit manipulation.
My question is: would it be possible to "merge" such algorithms into a single one ?
Note that N could be quite large (say 1024 and more) regarding 64 bits architecture.

Related: Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2 and https://github.com/mklarqvist/positional-popcount
I had another idea which I haven't finished writing up nicely.
Godbolt link to messy work-in-progress which doesn't have correct loop bounds / cleanup, but for large buffers runs ~3x faster than #edrezen's version on my Skylake i7-6700k, with g++7.3 -O3 -march=native. See the test_SWAR_avx2 function. (I know it doesn't compile on Godbolt; Agner Fog's asmlib.h isn't present.)
I might have some columns in the wrong order, too, but from stepping through the asm I think it's doing the right amount of work. i.e. any necessary bugfixes won't slow it down.
I used 16-bit accumulators, so another outer loop might be necessary if you care about inputs large enough to overflow 16-bit per-column counters.
Interesting observation: An earlier buggy version of my loop used sum0123 twice in store_globalsums_from_vec16, leaving sum4567 unused, so it optimized away in the main loop. With less work, gcc fully unrolled the large for(int i=0 ; i<5 ; i++) loop, and the code ran slower, like about 1 cycle per byte instead of 0.5. The loop was probably too big for the uop cache or something (I didn't profile yet but a front-end decode bottleneck would explain it). For some reason #edrezen's version is only running at about 1.5c/B for me, not the ~1.25 reported in the answer. My CPU is actually running 3.9GHz, but Agner Fog's library detects it at 4.0, but that's not enough to explain it.
Also, gcc spills sum4567_16bit to the stack, so we're already pushing the boundary of register pressure without AVX512. It's updated infrequently and isn't a problem, but needing more accumulators in the inner loop could be.
Your data layout isn't clear about when the number of columns isn't 32.
It seems that for each uint32_t chunk of 32 columns, you have all the rows stored contiguously in memory. i.e. looping over the rows for a column is efficient. If you had more than 32 columns, the rows for columns 32..63 will be contiguous and come after all the rows for columns 0..31.
(If instead you have all the columns for a single row contiguous, you could still use this idea, but might need to spill/reload some accumulators to memory, or let the compiler do that for you if it makes good choices.)
So loading a 32-byte (8 dword) vector gets 8 rows of data for one column chunk. That's extremely convenient, and allows widening from 1-bit (in memory) to 2-bit accumulators, then grab more data before we widen to 4-bit, and so on, summing along the way so we get significant work done while the data is still dense. (Rather than only adding 1 bit (0 or 1) per byte to vector accumulators.)
The more we unroll, the more data we can grab from memory to make better use of the coding space in our vectors. i.e. our variables have higher entropy. Throwing around more data (in terms of bits of memory that contributed to it) per vpaddb/w/d/q or unpack/shuffle instruction is a Good Thing.
Accumulators narrower than 1 byte within a SIMD vector is basically an https://en.wikipedia.org/wiki/SWAR technique, where you have to AND away bits that you shift past an element boundary, because we don't have SIMD element boundaries to do it for us. (And we avoid overflow anyway, so ADD carrying into the next element isn't a problem.)
Each inner loop iteration:
take a vector of data from the same columns in each of 2 or 3 (groups of) rows. So you either have 3 * 8 rows from one chunk of 32 columns, or 3 rows of 256 columns.
mask them with set1(0b01010101) to get the even (low) bits, and with (vec>>1) & mask (_mm256_srli_epi32(v,1)) to get the odd (high) bits. Use _mm256_add_epi8 to accumulate within those 2-bit accumulators. They can't overflow with only 3 ones, so carry-propagation boundaries don't actually matter.
Each byte of your vector has 4 separate vertical sums, and you have two vectors (odd/even).
Repeat the above again, to get another pair of vectors from 3 vectors of data from memory.
Combine again to get 4 vectors of 4-bit accumulators (with possible values 0..6). Still without mixing bits from within a single 32-bit element, of course, because we must never do that. Shifts only move bits for odd / high columns to the bottom of the 2-bit or 4-bit unit that contains them so they can be added with bits that were moved the same way in other vectors.
_mm256_unpacklo/hi_epi8 and mask or shift+mask to get 8-bit accumulators
Put the above in a loop that runs up to 5 times, so the 0..12 accumulator values go up to 0..60 (i.e. leaving 2 bits of headroom for unpacking the 8-bit accumulators, using all their coding space.)
If you have the data layout from your answer, then we can add data from dword elements within the same vector. We can do that so we don't run out of registers when widening our accumulators up to 16-bit (because x86-64 only has 16 YMM registers, and we need some for constants.)
_mm256_unpacklo/hi_epi16 and add, to interleave pairs of 8-bit counters so a group of counters for the same column has expanded from a dword to a qword.
Repeat this general idea to reduce the number of registers (or __m256i variables) your accumulators are spread over.
Efficiently handling the lack of a lane-crossing 2-input byte or word shuffle is inconvenient, but it's a pretty small part of the total work. vextracti128 / vpaddb xmm -> vpmovzxbw worked well enough.

I made some benchmark between the two approaches:
transpose + popcount
update row by row
I wrote a naive version and an AVX2 one for both approaches. I used some functions (found on stackoverflow or elsewhere) for the AVX2 "transpose+popcount" approach.
In my test, I make the assumption that the input is a nbRowsx32 matrix in a bits packed format (nbRows itself being a multiple of 32); the matrix is therefore stored as an array of uint32_t.
The code is the following:
#include <cinttypes>
#include <cstdio>
#include <cstring>
#include <cmath>
#include <cassert>
#include <chrono>
#include <immintrin.h>
#include <asmlib.h>
using namespace std;
using namespace std::chrono;
// see https://stackoverflow.com/questions/24225786/fastest-way-to-unpack-32-bits-to-a-32-byte-simd-vector
static __m256i expand_bits_to_bytes (uint32_t x);
// see https://mischasan.wordpress.com/2011/10/03/the-full-sse2-bit-matrix-transpose-routine/
static void sse_trans(char const *inp, char *out);
static double deviation (double n, double sum2, double sum);
////////////////////////////////////////////////////////////////////////////////
// Naive approach (matrix transposition)
////////////////////////////////////////////////////////////////////////////////
void test_transpose_popcnt_naive (uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums)
{
assert (nbRows%32==0);
uint8_t transpo[32][32]; memset (transpo, 0, sizeof(transpo));
for (uint64_t k=0; k<nbRows; k+=32)
{
// We unpack and transpose the input into a 32x32 bytes matrix
for (size_t row=0; row<32; row++)
{
for (size_t col=0; col<32; col++) { transpo[col][row] = (bitmap[k+row] >> col) & 1 ; }
}
for (size_t row=0; row<32; row++)
{
// We popcount the current row
u_int8_t sum=0;
for (size_t col=0; col<32; col++) { sum += transpo[row][col]; }
// We update the corresponding global sum
globalSums[row] += sum;
}
}
}
////////////////////////////////////////////////////////////////////////////////
// Naive approach (row by row)
////////////////////////////////////////////////////////////////////////////////
void test_update_row_by_row_naive (uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums)
{
for (uint64_t row=0; row<nbRows; row++)
{
for (size_t col=0; col<32; col++)
{
globalSums[col] += (bitmap[row] >> col) & 1;
}
}
}
////////////////////////////////////////////////////////////////////////////////
// AVX2 (matrix transposition + popcount)
////////////////////////////////////////////////////////////////////////////////
void test_transpose_popcnt_avx2 (uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums)
{
assert (nbRows%32==0);
uint32_t transpo[32];
const uint32_t* loop = bitmap;
for (uint64_t k=0; k<nbRows; loop+=32, k+=32)
{
// We transpose the input as a 32x32 bytes matrix
sse_trans ((const char*)loop, (char*)transpo);
// We update the global sums
for (size_t i=0; i<32; i++)
{
globalSums[i] += __builtin_popcount (transpo[i]);
}
}
}
////////////////////////////////////////////////////////////////////////////////
// AVX2 approach (update totals row by row)
////////////////////////////////////////////////////////////////////////////////
// Note: we use template specialization to unroll some portions of a loop
template<int N>
void UpdateLocalSums (__m256i& localSums, const uint32_t* bitmap, uint64_t& k)
{
// We update the local sums with the current row
localSums = _mm256_sub_epi8 (localSums, expand_bits_to_bytes (bitmap[k++]));
// Go recursively
UpdateLocalSums<N-1>(localSums, bitmap, k);
}
template<>
void UpdateLocalSums<0> (__m256i& localSums, const uint32_t* bitmap, uint64_t& k)
{
}
// Dillon Davis proposal: use 4 registers holding uint32_t values and update them from local sums with AVX2
#define USE_AVX2_FOR_GRAND_TOTALS 1
void test_update_row_by_row_avx2 (uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums)
{
union U256i { __m256i v; uint8_t a[32]; uint32_t b[8]; };
// We use 1 register for updating local totals
__m256i localSums = _mm256_setzero_si256();
#ifdef USE_AVX2_FOR_GRAND_TOTALS
// Dillon Davis proposal: use 4 registers holding uint32_t values and update them from local sums with AVX2
__m256i globalSumsReg[4]; for (size_t r=0; r<4; r++) { globalSumsReg[r] = _mm256_setzero_si256(); }
#endif
uint64_t steps = nbRows / 255;
uint64_t k=0;
const int divisorOf255 = 5;
// We iterate over all rows
for (uint64_t i=0; i<steps; i++)
{
// we update the local totals (255*32=8160 additions)
for (int j=0; j<255/divisorOf255; j++)
{
// unroll some portion of the 255 loop through template specialization
UpdateLocalSums<divisorOf255>(localSums, bitmap, k);
}
#ifdef USE_AVX2_FOR_GRAND_TOTALS
// Dillon Davis proposal: use 4 registers holding uint32_t values and update them from local sums
// We take the 128 high bits of the local sums
__m256i localSums2 = _mm256_broadcastsi128_si256(_mm256_extracti128_si256(localSums,1));
globalSumsReg[0] = _mm256_add_epi32 (globalSumsReg[0],
_mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums, 0)))
);
globalSumsReg[1] = _mm256_add_epi32 (globalSumsReg[1],
_mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums, 8)))
);
globalSumsReg[2] = _mm256_add_epi32 (globalSumsReg[2],
_mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums2, 0)))
);
globalSumsReg[3] = _mm256_add_epi32 (globalSumsReg[3],
_mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums2, 8)))
);
#else
// we update the global totals
U256i tmp = { localSums };
for (size_t k=0; k<32; k++) { globalSums[k] += tmp.a[k]; }
#endif
// we reset the local totals
localSums = _mm256_setzero_si256();
}
#ifdef USE_AVX2_FOR_GRAND_TOTALS
// We update the global totals into the final uint32_t array
for (size_t r=0; r<4; r++)
{
U256i tmp = { globalSumsReg[r] };
for (size_t k=0; k<8; k++) { globalSums[r*8+k] += tmp.b[k]; }
}
#endif
// we update the remaining local totals
for (uint64_t i=steps*255; i<nbRows; i++)
{
UpdateLocalSums<1>(localSums, bitmap, k);
}
// we update the global totals
U256i tmp = { localSums };
for (size_t k=0; k<32; k++) { globalSums[k] += tmp.a[k]; }
}
////////////////////////////////////////////////////////////////////////////////
void execute (
const char* name,
void (*fct)(uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums),
size_t nbRuns,
uint64_t nbRows,
u_int32_t* bitmap
)
{
uint64_t sums[32];
double timeTotal=0;
double cycleTotal=0;
double timeTotal2=0;
double cycleTotal2=0;
uint64_t check=0;
for (size_t n=0; n<nbRuns; n++)
{
memset(sums,0,sizeof(sums));
// We want both time and cpu cycles information
milliseconds t0 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
uint64_t c0 = ReadTSC();
// We run the test
(*fct) (nbRows, bitmap, sums);
uint64_t c1 = ReadTSC();
milliseconds t1 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
timeTotal += (t1-t0).count();
cycleTotal += (double)(c1-c0) / nbRows;
timeTotal2 += (t1-t0).count() * (t1-t0).count();
cycleTotal2 += ((double)(c1-c0) / nbRows) * ((double)(c1-c0) / nbRows);
// We compute some dummy checksum
for (size_t k=0; k<32; k++) { check += sums[k]; }
}
printf ("%-21s | %5.0lf (%5.1lf) | %5.2lf (%4.2lf) | %.3lf | 0x%lx\n",
name,
timeTotal / nbRuns,
deviation (nbRuns, timeTotal2, timeTotal),
cycleTotal/nbRuns,
deviation (nbRuns, cycleTotal2, cycleTotal),
check,
nbRows * cycleTotal / timeTotal / 1000000.0
);
}
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
// We set rows number as 2^n where n is the provided argument
// For simplification, we assume that the rows number is a multiple of 32
uint64_t nbRows = 1ULL << (argc>1 ? atoi(argv[1]) : 28);
size_t nbRuns = argc>2 ? atoi(argv[2]) : 10;
// We build an bitmap of size nbRows*32
uint32_t* bitmap = new uint32_t[nbRows];
if (bitmap==nullptr)
{
fprintf(stderr, "unable to allocate the bitmap\n");
exit(1);
}
// We fill the bitmap with random values
srand(time(nullptr));
for (uint64_t i=0; i<nbRows; i++) { bitmap[i] = rand() & 0xFFFFFFFF; }
printf ("\n");
printf ("nbRows=%ld nbRuns=%ld\n", nbRows, nbRuns);
printf ("------------------------------------------------------------------------------------------------------------\n");
printf ("name | time in msec : mean (sd) | cycles/row : mean (sd) | frequency in GHz | checksum\n");
printf ("------------------------------------------------------------------------------------------------------------\n");
// We launch the benchmark
execute ("naive (transpo) ", test_transpose_popcnt_naive, nbRuns, nbRows, bitmap);
execute ("naive (row by row)", test_update_row_by_row_naive, nbRuns, nbRows, bitmap);
execute ("AVX2 (transpo) ", test_transpose_popcnt_avx2, nbRuns, nbRows, bitmap);
execute ("AVX2 (row by row)", test_update_row_by_row_avx2, nbRuns, nbRows, bitmap);
printf ("\n");
// Some clean up
delete[] bitmap;
return EXIT_SUCCESS;
}
////////////////////////////////////////////////////////////////////////////////
__m256i expand_bits_to_bytes(uint32_t x)
{
__m256i xbcast = _mm256_set1_epi32(x);
// Each byte gets the source byte containing the corresponding bit
__m256i shufmask = _mm256_set_epi64x(
0x0303030303030303, 0x0202020202020202,
0x0101010101010101, 0x0000000000000000);
__m256i shuf = _mm256_shuffle_epi8(xbcast, shufmask);
__m256i andmask = _mm256_set1_epi64x(0x8040201008040201); // every 8 bits -> 8 bytes, pattern repeats.
__m256i isolated_inverted = _mm256_and_si256(shuf, andmask);
// Avoid an _mm256_add_epi8 thanks to Peter Cordes's comment
return _mm256_cmpeq_epi8(isolated_inverted, andmask);
}
////////////////////////////////////////////////////////////////////////////////
void sse_trans(char const *inp, char *out)
{
#define INP(x,y) inp[(x)*4 + (y)/8]
#define OUT(x,y) out[(y)*4 + (x)/8]
int rr, cc, i, h;
union { __m256i x; uint8_t b[32]; } tmp;
for (cc = 0; cc < 32; cc += 8)
{
for (i = 0; i < 32; ++i)
tmp.b[i] = INP(i, cc);
for (i = 8; i--; tmp.x = _mm256_slli_epi64(tmp.x, 1))
*(uint32_t*)&OUT(0, cc + i) = _mm256_movemask_epi8(tmp.x);
}
}
////////////////////////////////////////////////////////////////////////////////
double deviation (double n, double sum2, double sum) { return sqrt (sum2/n - (sum/n)*(sum/n)); }
Some remarks:
I used the Agner Fog's asmlib to have a function that returns CPU cycles
The compilation command is g++ -O3 -march=native ../Test.cpp -o ./Test -laelf64
The gcc version is 7.3.1
The CPU is Intel(R) Core(TM) i7-6700HQ CPU # 2.60GHz
I compute some dummy checksum to compare the results of the different tests
Now the results:
------------------------------------------------------------------------------------------------------------
name | time in msec : mean (sd) | cycles/row : mean (sd) | frequency in GHz | checksum
------------------------------------------------------------------------------------------------------------
naive (transpo) | 4548 ( 36.5) | 43.91 (0.35) | 2.592 | 0x9affeb5a6
naive (row by row) | 3033 ( 11.0) | 29.29 (0.11) | 2.592 | 0x9affeb5a6
AVX2 (transpo) | 767 ( 12.8) | 7.40 (0.12) | 2.592 | 0x9affeb5a6
AVX2 (row by row) | 130 ( 4.0) | 1.25 (0.04) | 2.591 | 0x9affeb5a6
So it seems that the "row by row" in AVX2 is the best so far.
Note that when I saw this result (less than 2 cycles per row), I made no more effort to optimize the AVX2 "transpose+popcount" method, which should be feasable by computing several popcounts in parallel (I may test it later).

I eventually wrote another implementation, following the high entropy SWAR approach proposed by Peter Cordes. This implementation is recursive and relies on C++ template specialization.
The global idea is to fill N-bit accumulators to their maximum without carry overflow (this is where recursion is used). When these accumulators are filled, we update the grand totals and we start again with new N-bit accumulators to fill until all rows have been processed.
Here is the code (see function test_SWAR_recursive):
#include <immintrin.h>
#include <cassert>
#include <chrono>
#include <cinttypes>
#include <cmath>
#include <cstdio>
#include <cstring>
using namespace std;
using namespace std::chrono;
// avoid the #include <asmlib.h>
extern "C" u_int64_t ReadTSC();
static double deviation (double n, double sum2, double sum) { return sqrt (sum2/n - (sum/n)*(sum/n)); }
////////////////////////////////////////////////////////////////////////////////
// Recursive SWAR approach (with template specialization)
////////////////////////////////////////////////////////////////////////////////
template<int DEPTH>
struct RecursiveSWAR
{
// Number of accumulators for current depth
static const int N = 1<<DEPTH;
// Array of N-bit accumulators
typedef __m256i Array[N];
// Magic numbers (0x55555555, 0x33333333, ...) computed recursively
static const u_int32_t MAGIC_NUMBER =
RecursiveSWAR<DEPTH-1>::MAGIC_NUMBER
* (1 + (1<<(1<<(DEPTH-1))))
/ (1 + (1<<(1<<(DEPTH+0))));
static void fillAccumulators (u_int32_t*& begin, const u_int32_t* end, Array accumulators)
{
// We reset the N-bit accumulators
for (int i=0; i<N; i++) { accumulators[i] = _mm256_setzero_si256(); }
// We check (only for depth big enough) that we have still rows to process
if (DEPTH>=3) if (begin>=end) { return; }
typename RecursiveSWAR<DEPTH-1>::Array accumulatorsMinusOne;
// We load a register with the mask
__m256i mask = _mm256_set1_epi32 (RecursiveSWAR<DEPTH-1>::MAGIC_NUMBER);
// We fill the N-bit accumulators to their maximum capacity without carry overflow
for (int i=0; i<N+1; i++)
{
// We fill (N-1)-bit accumulators recursively
RecursiveSWAR<DEPTH-1>::fillAccumulators (begin, end, accumulatorsMinusOne);
// We update the N-bit accumulators from the (N-1)-bit accumulators
for (int j=0; j<RecursiveSWAR<DEPTH-1>::N; j++)
{
// LOW part
accumulators[2*j+0] = _mm256_add_epi32 (
accumulators[2*j+0],
_mm256_and_si256 (
accumulatorsMinusOne[j],
mask
)
);
// HIGH part
accumulators[2*j+1] = _mm256_add_epi32 (
accumulators[2*j+1],
_mm256_and_si256 (
_mm256_srli_epi32 (
accumulatorsMinusOne[j],
RecursiveSWAR<DEPTH-1>::N
),
mask
)
);
}
}
}
};
// Template specialization for DEPTH=0
template<>
struct RecursiveSWAR<0>
{
static const int N = 1;
typedef __m256i Array[N];
static const u_int32_t MAGIC_NUMBER = 0x55555555;
static void fillAccumulators (u_int32_t*& begin, const u_int32_t* end, Array result)
{
// We just load 8 rows in the AVX2 register
result[0] = _mm256_loadu_si256 ((__m256i*)begin);
// We update the iterator
begin += 1*sizeof(__m256i)/sizeof(u_int32_t);
}
};
template<int DEPTH> struct TypeInfo { };
template<> struct TypeInfo<3> { typedef u_int8_t Type; };
template<> struct TypeInfo<4> { typedef u_int16_t Type; };
template<> struct TypeInfo<5> { typedef u_int32_t Type; };
unsigned char reversebits (unsigned char b)
{
return ((b * 0x80200802ULL) & 0x0884422110ULL) * 0x0101010101ULL >> 32;
}
void test_SWAR_recursive (uint64_t nbRows, const uint32_t* bitmap, uint32_t* globalSums)
{
static const int DEPTH = 4;
RecursiveSWAR<DEPTH>::Array accumulators;
uint32_t* begin = (uint32_t*) bitmap;
const uint32_t* end = bitmap + nbRows;
// We reset the grand totals
for (int i=0; i<32; i++) { globalSums[i] = 0; }
while (begin < end)
{
// We fill the N-bit accumulators to the maximum without overflow
RecursiveSWAR<DEPTH>::fillAccumulators (begin, end, accumulators);
// We update grand totals from the filled N-bit accumulators
for (int i=0; i<RecursiveSWAR<DEPTH>::N; i++)
{
int r = reversebits(i) >> (8-DEPTH);
u_int32_t* sums = globalSums+r;
TypeInfo<DEPTH>::Type* values = (TypeInfo<DEPTH>::Type*) (accumulators+i);
for (int j=0; j<8*(1<<(5-DEPTH)); j++)
{
sums[(j*RecursiveSWAR<DEPTH>::N) % 32] += values[j];
}
}
}
}
////////////////////////////////////////////////////////////////////////////////
void execute (
const char* name,
void (*fct)(uint64_t nbRows, const uint32_t* bitmap, uint32_t* globalSums),
size_t nbRuns,
uint64_t nbRows,
u_int32_t* bitmap
)
{
uint32_t sums[32];
double timeTotal=0;
double cycleTotal=0;
double timeTotal2=0;
double cycleTotal2=0;
uint64_t check=0;
for (size_t n=0; n<nbRuns; n++)
{
memset(sums,0,sizeof(sums));
// We want both time and cpu cycles information
milliseconds t0 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
uint64_t c0 = ReadTSC();
// We run the test
(*fct) (nbRows, bitmap, sums);
uint64_t c1 = ReadTSC();
milliseconds t1 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
timeTotal += (t1-t0).count();
cycleTotal += (double)(c1-c0) / nbRows;
timeTotal2 += (t1-t0).count() * (t1-t0).count();
cycleTotal2 += ((double)(c1-c0) / nbRows) * ((double)(c1-c0) / nbRows);
// We compute some dummy checksum
for (size_t k=0; k<32; k++) { check += (k+1)*sums[k]; }
}
printf ("%-21s | %5.0lf (%5.1lf) | %5.2lf (%5.3lf) | %.3lf | 0x%lx\n",
name,
timeTotal / nbRuns,
deviation (nbRuns, timeTotal2, timeTotal),
cycleTotal/nbRuns,
deviation (nbRuns, cycleTotal2, cycleTotal),
nbRows * cycleTotal / timeTotal / 1000000.0,
check/nbRuns
);
}
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
// We set rows number as 2^n where n is the provided argument
// For simplification, we assume that the rows number is a multiple of 32
uint64_t nbRows = 1ULL << (argc>1 ? atoi(argv[1]) : 28);
size_t nbRuns = argc>2 ? atoi(argv[2]) : 10;
// We build an bitmap of size nbRows*32
uint64_t actualNbRows = nbRows + 100000;
uint32_t* bitmap = (uint32_t*)_mm_malloc(sizeof(uint32_t)*actualNbRows, 256);
if (bitmap==nullptr)
{
fprintf(stderr, "unable to allocate the bitmap\n");
exit(1);
}
memset (bitmap, 0, sizeof(u_int32_t)*actualNbRows);
// We fill the bitmap with random values
// srand(time(nullptr));
for (uint64_t i=0; i<nbRows; i++) { bitmap[i] = rand() & 0xFFFFFFFF; }
printf ("\n");
printf ("nbRows=%ld nbRuns=%ld\n", nbRows, nbRuns);
printf ("------------------------------------------------------------------------------------------------------------\n");
printf ("name | time in msec : mean (sd) | cycles/row : mean (sd) | frequency in GHz | checksum\n");
printf ("------------------------------------------------------------------------------------------------------------\n");
// We launch the benchmark
execute ("AVX2 (SWAR rec) ", test_SWAR_recursive, nbRuns, nbRows, bitmap);
printf ("\n");
// Some clean up
_mm_free (bitmap);
return EXIT_SUCCESS;
}
The size of the accumulators is 2DEPTH in this code. Note that this implementation is valid up to DEPTH=5. For DEPTH=4, here are the performance results compared to the implementation of Peter Cordes (named high entropy SWAR):
The graph gives the number of cycles required to process a row (of 32 items) as a function of the number of rows of the matrix. As expected, the results are pretty similar since the main idea is the same. It is interesting to note the three parts of the graph:
constant value for log2(n)<=20
increasing value for log2(n) between 20 and 22
constant value for log2(n)>=22
I guess that CPU caches properties can explain this behaviour.

OpenMp parallel for

I have the following method called pgain which calls the method dist that I am trying to parallize:
/******************************************************************************/
/* For a given point x, find the cost of the following operation:
* -- open a facility at x if there isn't already one there,
* -- for points y such that the assignment distance of y exceeds dist(y, x),
* make y a member of x,
* -- for facilities y such that reassigning y and all its members to x
* would save cost, realize this closing and reassignment.
*
* If the cost of this operation is negative (i.e., if this entire operation
* saves cost), perform this operation and return the amount of cost saved;
* otherwise, do nothing.
*/
/* numcenters will be updated to reflect the new number of centers */
/* z is the facility cost, x is the number of this point in the array
points */
double pgain ( long x, Points *points, double z, long int *numcenters )
{
int i;
int number_of_centers_to_close = 0;
static double *work_mem;
static double gl_cost_of_opening_x;
static int gl_number_of_centers_to_close;
int stride = *numcenters + 2;
//make stride a multiple of CACHE_LINE
int cl = CACHE_LINE/sizeof ( double );
if ( stride % cl != 0 ) {
stride = cl * ( stride / cl + 1 );
}
int K = stride - 2 ; // K==*numcenters
//my own cost of opening x
double cost_of_opening_x = 0;
work_mem = ( double* ) malloc ( 2 * stride * sizeof ( double ) );
gl_cost_of_opening_x = 0;
gl_number_of_centers_to_close = 0;
/*
* For each center, we have a *lower* field that indicates
* how much we will save by closing the center.
*/
int count = 0;
for ( int i = 0; i < points->num; i++ ) {
if ( is_center[i] ) {
center_table[i] = count++;
}
}
work_mem[0] = 0;
//now we finish building the table. clear the working memory.
memset ( switch_membership, 0, points->num * sizeof ( bool ) );
memset ( work_mem, 0, stride*sizeof ( double ) );
memset ( work_mem+stride,0,stride*sizeof ( double ) );
//my *lower* fields
double* lower = &work_mem[0];
//global *lower* fields
double* gl_lower = &work_mem[stride];
#pragma omp parallel for
for ( i = 0; i < points->num; i++ ) {
float x_cost = dist ( points->p[i], points->p[x], points->dim ) * points->p[i].weight;
float current_cost = points->p[i].cost;
if ( x_cost < current_cost ) {
// point i would save cost just by switching to x
// (note that i cannot be a median,
// or else dist(p[i], p[x]) would be 0)
switch_membership[i] = 1;
cost_of_opening_x += x_cost - current_cost;
} else {
// cost of assigning i to x is at least current assignment cost of i
// consider the savings that i's **current** median would realize
// if we reassigned that median and all its members to x;
// note we've already accounted for the fact that the median
// would save z by closing; now we have to subtract from the savings
// the extra cost of reassigning that median and its members
int assign = points->p[i].assign;
lower[center_table[assign]] += current_cost - x_cost;
}
}
// at this time, we can calculate the cost of opening a center
// at x; if it is negative, we'll go through with opening it
for ( int i = 0; i < points->num; i++ ) {
if ( is_center[i] ) {
double low = z + work_mem[center_table[i]];
gl_lower[center_table[i]] = low;
if ( low > 0 ) {
// i is a median, and
// if we were to open x (which we still may not) we'd close i
// note, we'll ignore the following quantity unless we do open x
++number_of_centers_to_close;
cost_of_opening_x -= low;
}
}
}
//use the rest of working memory to store the following
work_mem[K] = number_of_centers_to_close;
work_mem[K+1] = cost_of_opening_x;
gl_number_of_centers_to_close = ( int ) work_mem[K];
gl_cost_of_opening_x = z + work_mem[K+1];
// Now, check whether opening x would save cost; if so, do it, and
// otherwise do nothing
if ( gl_cost_of_opening_x < 0 ) {
// we'd save money by opening x; we'll do it
for ( int i = 0; i < points->num; i++ ) {
bool close_center = gl_lower[center_table[points->p[i].assign]] > 0 ;
if ( switch_membership[i] || close_center ) {
// Either i's median (which may be i itself) is closing,
// or i is closer to x than to its current median
points->p[i].cost = points->p[i].weight * dist ( points->p[i], points->p[x], points->dim );
points->p[i].assign = x;
}
}
for ( int i = 0; i < points->num; i++ ) {
if ( is_center[i] && gl_lower[center_table[i]] > 0 ) {
is_center[i] = false;
}
}
if ( x >= 0 && x < points->num ) {
is_center[x] = true;
}
*numcenters = *numcenters + 1 - gl_number_of_centers_to_close;
} else {
gl_cost_of_opening_x = 0; // the value we'll return
}
free ( work_mem );
return -gl_cost_of_opening_x;
}
The function that I am trying to parallelize:
/* compute Euclidean distance squared between two points */
float dist ( Point p1, Point p2, int dim )
{
float result=0.0;
#pragma omp parallel for reduction(+:result)
for (int i=0; i<dim; i++ ){
result += ( p1.coord[i] - p2.coord[i] ) * ( p1.coord[i] - p2.coord[i] );
}
return ( result );
}
With Point being this:
/* this structure represents a point */
/* these will be passed around to avoid copying coordinates */
typedef struct {
float weight;
float *coord;
long assign; /* number of point where this one is assigned */
float cost; /* cost of that assignment, weight*distance */
} Point;
I have a large application of streamcluster(815 lines of code) that produces real time numbers and sorts them in a specific way. I have used scalasca tool on Linux so I can measure the methods that take up most of the time and I have found that method dist listed above is the most time-consuming. I am trying to use openMP tools but the time that the parallelized code runs is more than the time the serial code. If serial code runs in 1,5 sec the parallelized takes 20 but the results are the same. And I am wondering is it that I can't parallelize this part of code for some reason or that I don't do it correctly.
The method I am trying to parallelize its in a call tree: main->pkmedian->pFL->pgain->dist (-> means that calls the following method)

The code you've chosen to parallelize:
float result=0.0;
#pragma omp parallel for reduction(+:result)
for (int i=0; i<dim; i++ ){
result += ( p1.coord[i] - p2.coord[i] ) * ( p1.coord[i] - p2.coord[i] );
}
is a poor candidate to benefit from parallelization. You should not use parallel for here. You should probably not use parallelization on an inner loop. If you can parallelize some outer loop, you're much more like to see gains.
There is an overhead to coordinate the thread team to start the parallel region and another overhead for performing the reduction afterwards. Meanwhile, the parallel region's contents take essentially no time to run. Given that, you'd need dim to be extremely large before you'd expect this to give a performance benefit.
To express that point more graphically, consider that the math you're doing will take nanoseconds and compare it against this chart showing the overhead of various OpenMP directives.
If you need this to run faster, your first stop should be to use appropriate compilation flags, followed by looking into SIMD operations: SSE and AVX are good keywords. Your compiler might even invoke them automatically.
I've built some test code (see below) and compiled it with various optimizations enabled, as listed below, and run it on arrays of 100,000 elements. Note that enabling -O3 results in a run-time that is on the order of the OpenMP directives. This implies that you'd want arrays of about 400,000 before you'd want to think about using OpenMP and probably more like 1,000,000, to be safe.
No optimizations. Run-time is ~1900μs.
-O3: Enables many optimizations. Run-time is ~200μs.
-ffast-math: You want this, unless you're doing some very tricky things. Run-time is about the same.
-march=native: Compile code to use the full capabilities of your CPU, rather than a generic instruction set that would work on many CPUs. Run-time is ~100μs.
So there we go, strategic use of compiler options (-march=native) can double the speed of the code in question without having to muck about in parallelism.
Here is a handy slide presentation with some tips explaining how to use OpenMP in a performant manner.
Test code:
#include <vector>
#include <cstdlib>
#include <chrono>
#include <iostream>
int main(){
std::vector<double> a;
std::vector<double> b;
for(int i=0;i<100000;i++){
a.push_back(rand()/(double)RAND_MAX);
b.push_back(rand()/(double)RAND_MAX);
}
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
float result = 0.0;
//#pragma omp parallel for reduction(+:result)
for (unsigned int i=0; i<a.size(); i++ )
result += ( a[i] - b[i] ) * ( a[i] - b[i] );
std::chrono::steady_clock::time_point end= std::chrono::steady_clock::now();
std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " microseconds"<<std::endl;
}

Entrywise addition of two double arrays using AVX

I need a function to entrywise add the elements of two double arrays and store the result in a third array. Currently I use (simplified)
void add( double* result, const double* a, const double* b, size_t size) {
memcpy(result, a, size*sizeof(double));
for(size_t i = 0; i < size; ++i) {
result[i] += b[i];
}
}
As far as I know the memcpy function uses AVX. In order to improve the performance I would like to also enforce AVX use for the addition. This should be one of the most basic examples for AVX, however I couldn't find any description how to do this in C/C++. I would like to avoid the use of external libraries if possible.

You'll need something like this, assuming AVX-512:
void add( double* result, const double* a, const double* b, size_t size)
{
size_t i = 0;
// Note we are doing as many blocks of 8 as we can. If the size is not divisible by 8
// then we will have some left over that will then be performed serially.
// AVX-512 loop
for( ; i < (size & ~0x7); i += 8)
{
const __m512d kA8 = _mm512_load_pd( &a[i] );
const __m512d kB8 = _mm512_load_pd( &b[i] );
const __m512d kRes = _mm512_add_pd( kA8, kB8 );
_mm512_stream_pd( &res[i], kRes );
}
// AVX loop
for ( ; i < (size & ~0x3); i += 4 )
{
const __m256d kA4 = _mm256_load_pd( &a[i] );
const __m256d kB4 = _mm256_load_pd( &b[i] );
const __m256d kRes = _mm256_add_pd( kA4, kB4 );
_mm256_stream_pd( &res[i], kRes );
}
// SSE2 loop
for ( ; i < (size & ~0x1); i += 2 )
{
const __m128d kA2 = _mm_load_pd( &a[i] );
const __m128d kB2 = _mm_load_pd( &b[i] );
const __m128d kRes = _mm_add_pd( kA2, kB2 );
_mm_stream_pd( &res[i], kRes );
}
// Serial loop
for( ; i < size; i++ )
{
result[i] = a[i] + b[i];
}
}
(Though be warned I've just thrown that together off the top of my head).
Something to note form the above code is that I essentially process the remaining values using the next best parallel code. Primarily this is for illustration of the 3 possible ways you could do it parallely. The loops will work perfectly well on their own. For example if you can't support AVX-512 then you'd jump straight to the AVX loop. If you can't support AVX even then if you jump straight to the SSE2 loop then you'll be using the most performant loop that your hardware can support.
For best performance your arrays should be aligned to the relevant size used in the load. So for AVX-512 you would want 512-bit of 64 byte alignment. For AVX, 256-bit or 32 byte alignment. For SSE2 128-bit or 16 byte alignment. If you use 64 byte alignment for all your arrays then you will always have good alignment, though you may want to go for 128 byte alignment to ease moving over to AVX-1024 when that appears ;)

Parallelizing a for loop gives no performance gain

I have an algorithm which converts a bayer image channel to RGB. In my implementation I have a single nested for loop which iterates over the bayer channel, calculates the rgb index from the bayer index and then sets that pixel's value from the bayer channel.
The main thing to notice here is that each pixel can be calculated independently from other pixels (doesn't rely on previous calculations) and so the algorithm is a natural candidate for paralleization. The calculation does however rely on some preset arrays which all threads will be accessing in the same time but will not change.
However, when I tried parallelizing the main forwith MS's cuncurrency::parallel_for I gained no boost in performance. In fact, for an input of size 3264X2540 running over a 4-core CPU, the non parallelized version ran in ~34ms and the parallelized version ran in ~69ms (averaged over 10 runs). I confirmed that the operation was indeed parallelized (3 new threads were created for the task).
Using Intel's compiler with tbb::parallel_for gave near exact results.
For comparison, I started out with this algorithm implemented in C# in which I also used parallel_for loops and there I encountered near X4 performance gains (I opted for C++ because for this particular task C++ was faster even with a single core).
Any ideas what is preventing my code from parallelizing well?
My code:
template<typename T>
void static ConvertBayerToRgbImageAsIs(T* BayerChannel, T* RgbChannel, int Width, int Height, ColorSpace ColorSpace)
{
//Translates index offset in Bayer image to channel offset in RGB image
int offsets[4];
//calculate offsets according to color space
switch (ColorSpace)
{
case ColorSpace::BGGR:
offsets[0] = 2;
offsets[1] = 1;
offsets[2] = 1;
offsets[3] = 0;
break;
...other color spaces
}
memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
parallel_for(0, Height, [&] (int row)
{
for (auto col = 0, bayerIndex = row * Width; col < Width; col++, bayerIndex++)
{
auto offset = (row%2)*2 + (col%2); //0...3
auto rgbIndex = bayerIndex * 3 + offsets[offset];
RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
}
});
}

First of all, your algorithm is memory bandwidth bounded. That is memory load/store would outweigh any index calculations you do.
Vector operations like SSE/AVX would not help either - you are not doing any intensive calculations.
Increasing work amount per iteration is also useless - both PPL and TBB are smart enough, to not create thread per iteration, they would use some good partition, which would additionaly try to preserve locality. For instance, here is quote from TBB::parallel_for:
When worker threads are available, parallel_for executes iterations is non-deterministic order. Do not rely upon any particular execution order for correctness. However, for efficiency, do expect parallel_for to tend towards operating on consecutive runs of values.
What really matters is to reduce memory operations. Any superfluous traversal over input or output buffer is poison for performance, so you should try to remove your memset or do it in parallel too.
You are fully traversing input and output data. Even if you skip something in output - that doesn't mater, because memory operations are happening by 64 byte chunks at modern hardware. So, calculate size of your input and output, measure time of algorithm, divide size/time and compare result with maximal characteristics of your system (for instance, measure with benchmark).
I have made test for Microsoft PPL, OpenMP and Native for, results are (I used 8x of your height):
Native_For 0.21 s
OpenMP_For 0.15 s
Intel_TBB_For 0.15 s
MS_PPL_For 0.15 s
If remove memset then:
Native_For 0.15 s
OpenMP_For 0.09 s
Intel_TBB_For 0.09 s
MS_PPL_For 0.09 s
As you can see memset (which is highly optimized) is responsoble for significant amount of execution time, which shows how your algorithm is memory bounded.
FULL SOURCE CODE:
#include <boost/exception/detail/type_info.hpp>
#include <boost/mpl/for_each.hpp>
#include <boost/mpl/vector.hpp>
#include <boost/progress.hpp>
#include <tbb/tbb.h>
#include <iostream>
#include <ostream>
#include <vector>
#include <string>
#include <omp.h>
#include <ppl.h>
using namespace boost;
using namespace std;
const auto Width = 3264;
const auto Height = 2540*8;
struct MS_PPL_For
{
template<typename F,typename Index>
void operator()(Index first,Index last,F f) const
{
concurrency::parallel_for(first,last,f);
}
};
struct Intel_TBB_For
{
template<typename F,typename Index>
void operator()(Index first,Index last,F f) const
{
tbb::parallel_for(first,last,f);
}
};
struct Native_For
{
template<typename F,typename Index>
void operator()(Index first,Index last,F f) const
{
for(; first!=last; ++first) f(first);
}
};
struct OpenMP_For
{
template<typename F,typename Index>
void operator()(Index first,Index last,F f) const
{
#pragma omp parallel for
for(auto i=first; i<last; ++i) f(i);
}
};
template<typename T>
struct ConvertBayerToRgbImageAsIs
{
const T* BayerChannel;
T* RgbChannel;
template<typename For>
void operator()(For for_)
{
cout << type_name<For>() << "\t";
progress_timer t;
int offsets[] = {2,1,1,0};
//memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
for_(0, Height, [&] (int row)
{
for (auto col = 0, bayerIndex = row * Width; col < Width; col++, bayerIndex++)
{
auto offset = (row % 2)*2 + (col % 2); //0...3
auto rgbIndex = bayerIndex * 3 + offsets[offset];
RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
}
});
}
};
int main()
{
vector<float> bayer(Width*Height);
vector<float> rgb(Width*Height*3);
ConvertBayerToRgbImageAsIs<float> work = {&bayer[0],&rgb[0]};
for(auto i=0;i!=4;++i)
{
mpl::for_each<mpl::vector<Native_For, OpenMP_For,Intel_TBB_For,MS_PPL_For>>(work);
cout << string(16,'_') << endl;
}
}

Synchronization overhead
I would guess that the amount of work done per iteration of the loop is too small. Had you split the image into four parts and ran the computation in parallel, you would have noticed a large gain. Try to design the loop in a way that would case less iterations and more work per iteration. The reasoning behind this is that there is too much synchronization done.
Cache usage
An important factor may be how the data is split (partitioned) for the processing. If the proceessed rows are separated as in the bad case below, then more rows will cause a cache miss. This effect will become more important with each additional thread, because the distance between rows will be greater. If you are certain that the parallelizing function performs reasonable partitioning, then manual work-splitting will not give any results
bad good
****** t1 ****** t1
****** t2 ****** t1
****** t1 ****** t1
****** t2 ****** t1
****** t1 ****** t2
****** t2 ****** t2
****** t1 ****** t2
****** t2 ****** t2
Also make sure that you access your data in the same way it is aligned; it is possible that each call to offset[] and BayerChannel[] is a cache miss. Your algorithm is very memory intensive. Almost all operations are either accessing a memory segment or writing to it. Preventing cache misses and minimizing memory access is crucial.
Code optimizations
the optimizations shown below may be done by the compiler and may not give better results. It is worth knowing that they can be done.
// is the memset really necessary?
//memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
parallel_for(0, Height, [&] (int row)
{
int rowMod = (row & 1) << 1;
for (auto col = 0, bayerIndex = row * Width, tripleBayerIndex=row*Width*3; col < Width; col+=2, bayerIndex+=2, tripleBayerIndex+=6)
{
auto rgbIndex = tripleBayerIndex + offsets[rowMod];
RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
//unrolled the loop to save col & 1 operation
rgbIndex = tripleBayerIndex + 3 + offsets[rowMod+1];
RgbChannel[rgbIndex] = BayerChannel[bayerIndex+1];
}
});

Here comes my suggestion:
Computer larger chunks in parallel
get rid of modulo/multiplication
unroll inner loop to compute one full pixel (simplifies code)
template<typename T> void static ConvertBayerToRgbImageAsIsNew(T* BayerChannel, T* RgbChannel, int Width, int Height)
{
// convert BGGR->RGB
// have as many threads as the hardware concurrency is
parallel_for(0, Height, static_cast<int>(Height/(thread::hardware_concurrency())), [&] (int stride)
{
for (auto row = stride; row<2*stride; row++)
{
for (auto col = row*Width, rgbCol =row*Width; col < row*Width+Width; rgbCol +=3, col+=4)
{
RgbChannel[rgbCol+0] = BayerChannel[col+3];
RgbChannel[rgbCol+1] = BayerChannel[col+1];
// RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted
RgbChannel[rgbCol+2] = BayerChannel[col+0];
}
}
});
}
This code is 60% faster than the original version but still only half as fast as the non parallelized version on my laptop. This seemed to be due to the memory boundedness of the algorithm as others have pointed out already.
edit: But I was not happy with that. I could greatly improve the parallel performance when going from parallel_for to std::async:
int hc = thread::hardware_concurrency();
future<void>* res = new future<void>[hc];
for (int i = 0; i<hc; ++i)
{
res[i] = async(Converter<char>(bayerChannel, rgbChannel, rows, cols, rows/hc*i, rows/hc*(i+1)));
}
for (int i = 0; i<hc; ++i)
{
res[i].wait();
}
delete [] res;
with converter being a simple class:
template <class T> class Converter
{
public:
Converter(T* BayerChannel, T* RgbChannel, int Width, int Height, int startRow, int endRow) :
BayerChannel(BayerChannel), RgbChannel(RgbChannel), Width(Width), Height(Height), startRow(startRow), endRow(endRow)
{
}
void operator()()
{
// convert BGGR->RGB
for(int row = startRow; row < endRow; row++)
{
for (auto col = row*Width, rgbCol =row*Width; col < row*Width+Width; rgbCol +=3, col+=4)
{
RgbChannel[rgbCol+0] = BayerChannel[col+3];
RgbChannel[rgbCol+1] = BayerChannel[col+1];
// RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted
RgbChannel[rgbCol+2] = BayerChannel[col+0];
}
};
}
private:
T* BayerChannel;
T* RgbChannel;
int Width;
int Height;
int startRow;
int endRow;
};
This is now 3.5 times faster than the non parallelized version. From what I have seen in the profiler so far, I assume that the work stealing approach of parallel_for incurs a lot of waiting and synchronization overhead.

I have not used tbb::parallel_for not cuncurrency::parallel_for, but if your numbers are correct they seem to carry too much overhead. However, I strongly advice you to run more that 10 iterations when testing, and also be sure to do as many warmup iterations before timing.
I tested your code exactly using three different methods, averaged over 1000 tries.
Serial: 14.6 += 1.0 ms
std::async: 13.6 += 1.6 ms
workers: 11.8 += 1.2 ms
The first is serial calculation. The second is done using four calls to std::async. The last is done by sending four jobs to four already started (but sleeping) background threads.
The gains aren't big, but at least they are gains. I did the test on a 2012 MacBook Pro, with dual hyper threaded cores = 4 logical cores.
For reference, here's my std::async parallel for:
template<typename Int=int, class Fun>
void std_par_for(Int beg, Int end, const Fun& fun)
{
auto N = std::thread::hardware_concurrency();
std::vector<std::future<void>> futures;
for (Int ti=0; ti<N; ++ti) {
Int b = ti * (end - beg) / N;
Int e = (ti+1) * (end - beg) / N;
if (ti == N-1) { e = end; }
futures.emplace_back( std::async([&,b,e]() {
for (Int ix=b; ix<e; ++ix) {
fun( ix );
}
}));
}
for (auto&& f : futures) {
f.wait();
}
}

Things to check or do
Are you using a Core 2 or older processor? They have a very narrow memory bus that's easy to saturate with code like this. In contrast, 4-channel Sandy Bridge-E processors require multiple threads to saturate the memory bus (it's not possible for a single memory-bound thread to fully saturate it).
Have you populated all of your memory channels? E.g. if you have a dual-channel CPU but have just one RAM card installed or two that are on the same channel, you're getting half the available bandwidth.
How are you timing your code?
The timing should be done inside the application like Evgeny Panasyuk suggests.
You should do multiple runs within the same application. Otherwise, you may be timing one-time startup code to launch the thread pools, etc.
Remove the superfluous memset, as others have explained.
As ogni42 and others have suggested, unroll your inner loop (I didn't bother checking the correctness of that solution, but if it's wrong, you should be able to fix it). This is orthogonal to the main question of parallelization, but it's a good idea anyway.
Make sure your machine is otherwise idle when doing performance testing.
Additional timings
I've merged the suggestions of Evgeny Panasyuk and ogni42 in a bare-bones C++03 Win32 implementation:
#include "stdafx.h"
#include <omp.h>
#include <vector>
#include <iostream>
#include <stdio.h>
using namespace std;
const int Width = 3264;
const int Height = 2540*8;
class Timer {
private:
string name;
LARGE_INTEGER start;
LARGE_INTEGER stop;
LARGE_INTEGER frequency;
public:
Timer(const char *name) : name(name) {
QueryPerformanceFrequency(&frequency);
QueryPerformanceCounter(&start);
}
~Timer() {
QueryPerformanceCounter(&stop);
LARGE_INTEGER time;
time.QuadPart = stop.QuadPart - start.QuadPart;
double elapsed = ((double)time.QuadPart /(double)frequency.QuadPart);
printf("%-20s : %5.2f\n", name.c_str(), elapsed);
}
};
static const int offsets[] = {2,1,1,0};
template <typename T>
void Inner_Orig(const T* BayerChannel, T* RgbChannel, int row)
{
for (int col = 0, bayerIndex = row * Width;
col < Width; col++, bayerIndex++)
{
int offset = (row % 2)*2 + (col % 2); //0...3
int rgbIndex = bayerIndex * 3 + offsets[offset];
RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
}
}
// adapted from ogni42's answer
template <typename T>
void Inner_Unrolled(const T* BayerChannel, T* RgbChannel, int row)
{
for (int col = row*Width, rgbCol =row*Width;
col < row*Width+Width; rgbCol +=3, col+=4)
{
RgbChannel[rgbCol+0] = BayerChannel[col+3];
RgbChannel[rgbCol+1] = BayerChannel[col+1];
// RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted
RgbChannel[rgbCol+2] = BayerChannel[col+0];
}
}
int _tmain(int argc, _TCHAR* argv[])
{
vector<float> bayer(Width*Height);
vector<float> rgb(Width*Height*3);
for(int i = 0; i < 4; ++i)
{
{
Timer t("serial_orig");
for(int row = 0; row < Height; ++row) {
Inner_Orig<float>(&bayer[0], &rgb[0], row);
}
}
{
Timer t("omp_dynamic_orig");
#pragma omp parallel for
for(int row = 0; row < Height; ++row) {
Inner_Orig<float>(&bayer[0], &rgb[0], row);
}
}
{
Timer t("omp_static_orig");
#pragma omp parallel for schedule(static)
for(int row = 0; row < Height; ++row) {
Inner_Orig<float>(&bayer[0], &rgb[0], row);
}
}
{
Timer t("serial_unrolled");
for(int row = 0; row < Height; ++row) {
Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
}
}
{
Timer t("omp_dynamic_unrolled");
#pragma omp parallel for
for(int row = 0; row < Height; ++row) {
Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
}
}
{
Timer t("omp_static_unrolled");
#pragma omp parallel for schedule(static)
for(int row = 0; row < Height; ++row) {
Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
}
}
printf("-----------------------------\n");
}
return 0;
}
Here are the timings I see on a triple-channel 8-way hyperthreaded Core i7-950 box:
serial_orig : 0.13
omp_dynamic_orig : 0.10
omp_static_orig : 0.10
serial_unrolled : 0.06
omp_dynamic_unrolled : 0.04
omp_static_unrolled : 0.04
The "static" versions tell the compiler to evenly divide up the work between threads at loop entry. This avoids the overhead of attempting to do work stealing or other dynamic load balancing. For this code snippet, it doesn't seem to make a difference, even though the workload is very uniform across threads.

The performance reduction might be happening because your are trying to distribute for loop on "row" number of cores, which wont be available and hence again it become like a sequential execution with the overhead of parallelism.

Not very familiar with parallel for loops but it seems to me the contention is in the memory access. It appears your threads are overlapping access to the same pages.
Can you break up your array access into 4k chunks somewhat align with the page boundary?

There is no point talking about parallel performance before not having optimized the for loop for serial code. Here is my attempt at that (some good compilers may be able to obtain similarly optimized versions, but I'd rather not rely on that)
parallel_for(0, Height, [=] (int row) noexcept
{
for (auto col=0, bayerindex=row*Width,
rgb0=3*bayerindex+offset[(row%2)*2],
rgb1=3*bayerindex+offset[(row%2)*2+1];
col < Width; col+=2, bayerindex+=2, rgb0+=6, rgb1+=6 )
{
RgbChannel[rgb0] = BayerChannel[bayerindex ];
RgbChannel[rgb1] = BayerChannel[bayerindex+1];
}
});

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js