Casting AVX512 mask types

Casting AVX512 mask types - c++

I'm trying to figure out how to use masked loads and stores for the last few elements to be processed. My use case involves converting a packed 10 bit data stream to 16 bit which means loading 5 bytes before storing 4 shorts. This results in different masks of different types.
The main loop itself is not a problem. But at the end I'm left with up to 19 bytes input / 15 shorts output which I thought I could process in up to two loop iterations using the 128 bit vectors. Here is the outline of the code.
#include <immintrin.h>
#include <stddef.h>
#include <stdint.h>
void convert(uint16_t* out, ptrdiff_t n, const uint8_t* in)
{
uint16_t* const out_end = out + n;
for(uint16_t* out32_end = out + (n & -32); out < out32_end; in += 40, out += 32) {
/*
* insert main loop here using ZMM vectors
*/
}
if(out_end - out >= 16) {
/*
* insert half-sized iteration here using YMM vectors
*/
in += 20;
out += 16;
}
// up to 19 byte input remaining, up to 15 shorts output
const unsigned out_remain = out_end - out;
const unsigned in_remain = (out_remain * 10 + 7) / 8;
unsigned in_mask = (1 << in_remain) - 1;
unsigned out_mask = (1 << out_remain) - 1;
while(out_mask) {
__mmask16 load_mask = _cvtu32_mask16(in_mask);
__m128i packed = _mm_maskz_loadu_epi8(load_mask, in);
/* insert computation here. No masks required */
__mmask8 store_mask = _cvtu32_mask8(out_mask);
_mm_mask_storeu_epi16(out, store_mask, packed);
in += 10;
out += 8;
in_mask >>= 10;
out_mask >>= 8;
}
}
(Compile with -O3 -mavx2 -mavx512f -mavx512bw -mavx512vl -mavx512dq)
My idea was to create a bit mask from the number of remaining elements (since I know it fits comfortably in an integer / mask register), then shift values out of the mask as they are processed.
I have two issues with this approach:
I'm re-setting the masks from GP registers each iteration instead of using the kshift family of instructions
_cvtu32_mask8 (kmovb) is the only instruction in this code that requires AVX512DQ. Limiting the number of suitable hardware platforms just for that seems weird
What I'm wondering about:
Can I cast mmask32 to mmask16 and mmask8?
If I can, I could set it once from the GP register, then shift it in its own register. Like this:
__mmask32 load_mask = _cvtu32_mask32(in_mask);
__mmask32 store_mask = _cvtu32_mask32(out_mask);
while(out < out_end) {
__m128i packed = _mm_maskz_loadu_epi8((__mmask16) load_mask, in);
/* insert computation here. No masks required */
_mm_mask_storeu_epi16(out, (__mmask8) store_mask, packed);
load_mask = _kshiftri_mask32(load_mask, 10);
store_mask = _kshiftri_mask32(store_mask, 8);
in += 10;
out += 8;
}
GCC seems to be fine with this pattern. But Clang and MSVC create worse code, moving the mask in and out of GP registers without any apparent reason.

Related

Arduino c+ >> operation

I am adapting the example for the Arduino AutoAnalogAudio library entitled
SDAudioWavPlayer
which can be found in Examples->AutoAnalogAudio->SDAudio->SDAudioWavPlayer
This example uses interrupts to repeatedly call the function
void loadBuffer(). The code for that is below
/* Function called from DAC interrupt after dacHandler(). Loads data into the dacBuffer */
void loadBuffer() {
if (myFile) {
if (myFile.available()) {
if (aaAudio.dacBitsPerSample == 8) {
//Load 32 samples into the 8-bit dacBuffer
myFile.read((byte*)aaAudio.dacBuffer, MAX_BUFFER_SIZE);
}else{
//Load 32 samples (64 bytes) into the 16-bit dacBuffer
myFile.read((byte*)aaAudio.dacBuffer16, MAX_BUFFER_SIZE * 2);
//Convert the 16-bit samples to 12-bit
for (int i = 0; i < MAX_BUFFER_SIZE; i++) {
aaAudio.dacBuffer16[i] = (aaAudio.dacBuffer16[i] + 0x8000) >> 4;
}
}
}else{
#if defined (AUDIO_DEBUG)
Serial.println("File close");
#endif
myFile.close();
aaAudio.disableDAC();
}
}
}
The specific part I am concerned with is the second part of the if statement
{
//Load 32 samples (64 bytes) into the 16-bit dacBuffer
myFile.read((byte*)aaAudio.dacBuffer16, MAX_BUFFER_SIZE * 2);
//Convert the 16-bit samples to 12-bit
for (int i = 0; i < MAX_BUFFER_SIZE; i++) {
aaAudio.dacBuffer16[i] = (aaAudio.dacBuffer16[i] + 0x8000) >> 4;
}
}
Despite the comment MAX_BUFFER_SIZE is 256 so 512 bytes are read into
aaAudio.dacBuffer16. That data was originally 16 bit signed integers (+/- 32k) and dacBuffer16 is an array of 16bit unsigned integers (0-64K). The negative sign is removed by going through the array and adding 2^15 (0x8000) to each element. This makes the negative numbers overflow leaving the positive part of the negative number. Positive numbers are just increased by 2^15. thus the values are rescalled to lie in 0 -64K. The result is then shifted 4 places right so that only the highest 12 bits remain which is what the Arduino DAC can handle. This all happens in the line
aaAudio.dacBuffer16[i] = (aaAudio.dacBuffer16[i] + 0x8000) >> 4;
So far so good.
Now I want to be able to programmatically reduce the volume. As far as I can find the library does not provide a function to do that so I thought that the simplest
thing to do was to change the '4' to 'N' and increase the amount of shifting to 5,6,7.. etc
eg
aaAudio.dacBuffer16[i] = (aaAudio.dacBuffer16[i] + 0x8000) >> N;
where N is an integer. I tried this but I got a terribly distorted result which I did not understand.
While fiddling around trying different things I tried the following which works
uint16_t sample;
int N = 5;
for (int i = 0; i < MAX_BUFFER_SIZE; i++)
{
sample = (aaAudio.dacBuffer16[i] + 0x8000);
sample = sample >> N;
// sample = sample / 40;
aaAudio.dacBuffer16[i] = sample;
}
You can also see that I have commented out simply dividing by a number which works if I want finer control.
My problem is I do not see what the difference is between the two bits of code.
Can anybody enlighten me ?

Matrix transpose and population count

I have a square boolean matrix M of size N, stored by rows and I want to count the number of bits set to 1 for each column.
For instance for n=4:
1101
0101
0001
1001
M stored as { { 1,1,0,1}, {0,1,0,1}, {0,0,0,1}, {1,0,0,1} };
result = { 2, 2, 0, 4};
I can obviously
transpose the matrix M into a matrix M'
popcount each row of M'.
Good algorithms exist for matrix transposition and popcounting through bit manipulation.
My question is: would it be possible to "merge" such algorithms into a single one ?
Note that N could be quite large (say 1024 and more) regarding 64 bits architecture.

Related: Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2 and https://github.com/mklarqvist/positional-popcount
I had another idea which I haven't finished writing up nicely.
Godbolt link to messy work-in-progress which doesn't have correct loop bounds / cleanup, but for large buffers runs ~3x faster than #edrezen's version on my Skylake i7-6700k, with g++7.3 -O3 -march=native. See the test_SWAR_avx2 function. (I know it doesn't compile on Godbolt; Agner Fog's asmlib.h isn't present.)
I might have some columns in the wrong order, too, but from stepping through the asm I think it's doing the right amount of work. i.e. any necessary bugfixes won't slow it down.
I used 16-bit accumulators, so another outer loop might be necessary if you care about inputs large enough to overflow 16-bit per-column counters.
Interesting observation: An earlier buggy version of my loop used sum0123 twice in store_globalsums_from_vec16, leaving sum4567 unused, so it optimized away in the main loop. With less work, gcc fully unrolled the large for(int i=0 ; i<5 ; i++) loop, and the code ran slower, like about 1 cycle per byte instead of 0.5. The loop was probably too big for the uop cache or something (I didn't profile yet but a front-end decode bottleneck would explain it). For some reason #edrezen's version is only running at about 1.5c/B for me, not the ~1.25 reported in the answer. My CPU is actually running 3.9GHz, but Agner Fog's library detects it at 4.0, but that's not enough to explain it.
Also, gcc spills sum4567_16bit to the stack, so we're already pushing the boundary of register pressure without AVX512. It's updated infrequently and isn't a problem, but needing more accumulators in the inner loop could be.
Your data layout isn't clear about when the number of columns isn't 32.
It seems that for each uint32_t chunk of 32 columns, you have all the rows stored contiguously in memory. i.e. looping over the rows for a column is efficient. If you had more than 32 columns, the rows for columns 32..63 will be contiguous and come after all the rows for columns 0..31.
(If instead you have all the columns for a single row contiguous, you could still use this idea, but might need to spill/reload some accumulators to memory, or let the compiler do that for you if it makes good choices.)
So loading a 32-byte (8 dword) vector gets 8 rows of data for one column chunk. That's extremely convenient, and allows widening from 1-bit (in memory) to 2-bit accumulators, then grab more data before we widen to 4-bit, and so on, summing along the way so we get significant work done while the data is still dense. (Rather than only adding 1 bit (0 or 1) per byte to vector accumulators.)
The more we unroll, the more data we can grab from memory to make better use of the coding space in our vectors. i.e. our variables have higher entropy. Throwing around more data (in terms of bits of memory that contributed to it) per vpaddb/w/d/q or unpack/shuffle instruction is a Good Thing.
Accumulators narrower than 1 byte within a SIMD vector is basically an https://en.wikipedia.org/wiki/SWAR technique, where you have to AND away bits that you shift past an element boundary, because we don't have SIMD element boundaries to do it for us. (And we avoid overflow anyway, so ADD carrying into the next element isn't a problem.)
Each inner loop iteration:
take a vector of data from the same columns in each of 2 or 3 (groups of) rows. So you either have 3 * 8 rows from one chunk of 32 columns, or 3 rows of 256 columns.
mask them with set1(0b01010101) to get the even (low) bits, and with (vec>>1) & mask (_mm256_srli_epi32(v,1)) to get the odd (high) bits. Use _mm256_add_epi8 to accumulate within those 2-bit accumulators. They can't overflow with only 3 ones, so carry-propagation boundaries don't actually matter.
Each byte of your vector has 4 separate vertical sums, and you have two vectors (odd/even).
Repeat the above again, to get another pair of vectors from 3 vectors of data from memory.
Combine again to get 4 vectors of 4-bit accumulators (with possible values 0..6). Still without mixing bits from within a single 32-bit element, of course, because we must never do that. Shifts only move bits for odd / high columns to the bottom of the 2-bit or 4-bit unit that contains them so they can be added with bits that were moved the same way in other vectors.
_mm256_unpacklo/hi_epi8 and mask or shift+mask to get 8-bit accumulators
Put the above in a loop that runs up to 5 times, so the 0..12 accumulator values go up to 0..60 (i.e. leaving 2 bits of headroom for unpacking the 8-bit accumulators, using all their coding space.)
If you have the data layout from your answer, then we can add data from dword elements within the same vector. We can do that so we don't run out of registers when widening our accumulators up to 16-bit (because x86-64 only has 16 YMM registers, and we need some for constants.)
_mm256_unpacklo/hi_epi16 and add, to interleave pairs of 8-bit counters so a group of counters for the same column has expanded from a dword to a qword.
Repeat this general idea to reduce the number of registers (or __m256i variables) your accumulators are spread over.
Efficiently handling the lack of a lane-crossing 2-input byte or word shuffle is inconvenient, but it's a pretty small part of the total work. vextracti128 / vpaddb xmm -> vpmovzxbw worked well enough.

I made some benchmark between the two approaches:
transpose + popcount
update row by row
I wrote a naive version and an AVX2 one for both approaches. I used some functions (found on stackoverflow or elsewhere) for the AVX2 "transpose+popcount" approach.
In my test, I make the assumption that the input is a nbRowsx32 matrix in a bits packed format (nbRows itself being a multiple of 32); the matrix is therefore stored as an array of uint32_t.
The code is the following:
#include <cinttypes>
#include <cstdio>
#include <cstring>
#include <cmath>
#include <cassert>
#include <chrono>
#include <immintrin.h>
#include <asmlib.h>
using namespace std;
using namespace std::chrono;
// see https://stackoverflow.com/questions/24225786/fastest-way-to-unpack-32-bits-to-a-32-byte-simd-vector
static __m256i expand_bits_to_bytes (uint32_t x);
// see https://mischasan.wordpress.com/2011/10/03/the-full-sse2-bit-matrix-transpose-routine/
static void sse_trans(char const *inp, char *out);
static double deviation (double n, double sum2, double sum);
////////////////////////////////////////////////////////////////////////////////
// Naive approach (matrix transposition)
////////////////////////////////////////////////////////////////////////////////
void test_transpose_popcnt_naive (uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums)
{
assert (nbRows%32==0);
uint8_t transpo[32][32]; memset (transpo, 0, sizeof(transpo));
for (uint64_t k=0; k<nbRows; k+=32)
{
// We unpack and transpose the input into a 32x32 bytes matrix
for (size_t row=0; row<32; row++)
{
for (size_t col=0; col<32; col++) { transpo[col][row] = (bitmap[k+row] >> col) & 1 ; }
}
for (size_t row=0; row<32; row++)
{
// We popcount the current row
u_int8_t sum=0;
for (size_t col=0; col<32; col++) { sum += transpo[row][col]; }
// We update the corresponding global sum
globalSums[row] += sum;
}
}
}
////////////////////////////////////////////////////////////////////////////////
// Naive approach (row by row)
////////////////////////////////////////////////////////////////////////////////
void test_update_row_by_row_naive (uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums)
{
for (uint64_t row=0; row<nbRows; row++)
{
for (size_t col=0; col<32; col++)
{
globalSums[col] += (bitmap[row] >> col) & 1;
}
}
}
////////////////////////////////////////////////////////////////////////////////
// AVX2 (matrix transposition + popcount)
////////////////////////////////////////////////////////////////////////////////
void test_transpose_popcnt_avx2 (uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums)
{
assert (nbRows%32==0);
uint32_t transpo[32];
const uint32_t* loop = bitmap;
for (uint64_t k=0; k<nbRows; loop+=32, k+=32)
{
// We transpose the input as a 32x32 bytes matrix
sse_trans ((const char*)loop, (char*)transpo);
// We update the global sums
for (size_t i=0; i<32; i++)
{
globalSums[i] += __builtin_popcount (transpo[i]);
}
}
}
////////////////////////////////////////////////////////////////////////////////
// AVX2 approach (update totals row by row)
////////////////////////////////////////////////////////////////////////////////
// Note: we use template specialization to unroll some portions of a loop
template<int N>
void UpdateLocalSums (__m256i& localSums, const uint32_t* bitmap, uint64_t& k)
{
// We update the local sums with the current row
localSums = _mm256_sub_epi8 (localSums, expand_bits_to_bytes (bitmap[k++]));
// Go recursively
UpdateLocalSums<N-1>(localSums, bitmap, k);
}
template<>
void UpdateLocalSums<0> (__m256i& localSums, const uint32_t* bitmap, uint64_t& k)
{
}
// Dillon Davis proposal: use 4 registers holding uint32_t values and update them from local sums with AVX2
#define USE_AVX2_FOR_GRAND_TOTALS 1
void test_update_row_by_row_avx2 (uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums)
{
union U256i { __m256i v; uint8_t a[32]; uint32_t b[8]; };
// We use 1 register for updating local totals
__m256i localSums = _mm256_setzero_si256();
#ifdef USE_AVX2_FOR_GRAND_TOTALS
// Dillon Davis proposal: use 4 registers holding uint32_t values and update them from local sums with AVX2
__m256i globalSumsReg[4]; for (size_t r=0; r<4; r++) { globalSumsReg[r] = _mm256_setzero_si256(); }
#endif
uint64_t steps = nbRows / 255;
uint64_t k=0;
const int divisorOf255 = 5;
// We iterate over all rows
for (uint64_t i=0; i<steps; i++)
{
// we update the local totals (255*32=8160 additions)
for (int j=0; j<255/divisorOf255; j++)
{
// unroll some portion of the 255 loop through template specialization
UpdateLocalSums<divisorOf255>(localSums, bitmap, k);
}
#ifdef USE_AVX2_FOR_GRAND_TOTALS
// Dillon Davis proposal: use 4 registers holding uint32_t values and update them from local sums
// We take the 128 high bits of the local sums
__m256i localSums2 = _mm256_broadcastsi128_si256(_mm256_extracti128_si256(localSums,1));
globalSumsReg[0] = _mm256_add_epi32 (globalSumsReg[0],
_mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums, 0)))
);
globalSumsReg[1] = _mm256_add_epi32 (globalSumsReg[1],
_mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums, 8)))
);
globalSumsReg[2] = _mm256_add_epi32 (globalSumsReg[2],
_mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums2, 0)))
);
globalSumsReg[3] = _mm256_add_epi32 (globalSumsReg[3],
_mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums2, 8)))
);
#else
// we update the global totals
U256i tmp = { localSums };
for (size_t k=0; k<32; k++) { globalSums[k] += tmp.a[k]; }
#endif
// we reset the local totals
localSums = _mm256_setzero_si256();
}
#ifdef USE_AVX2_FOR_GRAND_TOTALS
// We update the global totals into the final uint32_t array
for (size_t r=0; r<4; r++)
{
U256i tmp = { globalSumsReg[r] };
for (size_t k=0; k<8; k++) { globalSums[r*8+k] += tmp.b[k]; }
}
#endif
// we update the remaining local totals
for (uint64_t i=steps*255; i<nbRows; i++)
{
UpdateLocalSums<1>(localSums, bitmap, k);
}
// we update the global totals
U256i tmp = { localSums };
for (size_t k=0; k<32; k++) { globalSums[k] += tmp.a[k]; }
}
////////////////////////////////////////////////////////////////////////////////
void execute (
const char* name,
void (*fct)(uint64_t nbRows, const uint32_t* bitmap, uint64_t* globalSums),
size_t nbRuns,
uint64_t nbRows,
u_int32_t* bitmap
)
{
uint64_t sums[32];
double timeTotal=0;
double cycleTotal=0;
double timeTotal2=0;
double cycleTotal2=0;
uint64_t check=0;
for (size_t n=0; n<nbRuns; n++)
{
memset(sums,0,sizeof(sums));
// We want both time and cpu cycles information
milliseconds t0 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
uint64_t c0 = ReadTSC();
// We run the test
(*fct) (nbRows, bitmap, sums);
uint64_t c1 = ReadTSC();
milliseconds t1 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
timeTotal += (t1-t0).count();
cycleTotal += (double)(c1-c0) / nbRows;
timeTotal2 += (t1-t0).count() * (t1-t0).count();
cycleTotal2 += ((double)(c1-c0) / nbRows) * ((double)(c1-c0) / nbRows);
// We compute some dummy checksum
for (size_t k=0; k<32; k++) { check += sums[k]; }
}
printf ("%-21s | %5.0lf (%5.1lf) | %5.2lf (%4.2lf) | %.3lf | 0x%lx\n",
name,
timeTotal / nbRuns,
deviation (nbRuns, timeTotal2, timeTotal),
cycleTotal/nbRuns,
deviation (nbRuns, cycleTotal2, cycleTotal),
check,
nbRows * cycleTotal / timeTotal / 1000000.0
);
}
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
// We set rows number as 2^n where n is the provided argument
// For simplification, we assume that the rows number is a multiple of 32
uint64_t nbRows = 1ULL << (argc>1 ? atoi(argv[1]) : 28);
size_t nbRuns = argc>2 ? atoi(argv[2]) : 10;
// We build an bitmap of size nbRows*32
uint32_t* bitmap = new uint32_t[nbRows];
if (bitmap==nullptr)
{
fprintf(stderr, "unable to allocate the bitmap\n");
exit(1);
}
// We fill the bitmap with random values
srand(time(nullptr));
for (uint64_t i=0; i<nbRows; i++) { bitmap[i] = rand() & 0xFFFFFFFF; }
printf ("\n");
printf ("nbRows=%ld nbRuns=%ld\n", nbRows, nbRuns);
printf ("------------------------------------------------------------------------------------------------------------\n");
printf ("name | time in msec : mean (sd) | cycles/row : mean (sd) | frequency in GHz | checksum\n");
printf ("------------------------------------------------------------------------------------------------------------\n");
// We launch the benchmark
execute ("naive (transpo) ", test_transpose_popcnt_naive, nbRuns, nbRows, bitmap);
execute ("naive (row by row)", test_update_row_by_row_naive, nbRuns, nbRows, bitmap);
execute ("AVX2 (transpo) ", test_transpose_popcnt_avx2, nbRuns, nbRows, bitmap);
execute ("AVX2 (row by row)", test_update_row_by_row_avx2, nbRuns, nbRows, bitmap);
printf ("\n");
// Some clean up
delete[] bitmap;
return EXIT_SUCCESS;
}
////////////////////////////////////////////////////////////////////////////////
__m256i expand_bits_to_bytes(uint32_t x)
{
__m256i xbcast = _mm256_set1_epi32(x);
// Each byte gets the source byte containing the corresponding bit
__m256i shufmask = _mm256_set_epi64x(
0x0303030303030303, 0x0202020202020202,
0x0101010101010101, 0x0000000000000000);
__m256i shuf = _mm256_shuffle_epi8(xbcast, shufmask);
__m256i andmask = _mm256_set1_epi64x(0x8040201008040201); // every 8 bits -> 8 bytes, pattern repeats.
__m256i isolated_inverted = _mm256_and_si256(shuf, andmask);
// Avoid an _mm256_add_epi8 thanks to Peter Cordes's comment
return _mm256_cmpeq_epi8(isolated_inverted, andmask);
}
////////////////////////////////////////////////////////////////////////////////
void sse_trans(char const *inp, char *out)
{
#define INP(x,y) inp[(x)*4 + (y)/8]
#define OUT(x,y) out[(y)*4 + (x)/8]
int rr, cc, i, h;
union { __m256i x; uint8_t b[32]; } tmp;
for (cc = 0; cc < 32; cc += 8)
{
for (i = 0; i < 32; ++i)
tmp.b[i] = INP(i, cc);
for (i = 8; i--; tmp.x = _mm256_slli_epi64(tmp.x, 1))
*(uint32_t*)&OUT(0, cc + i) = _mm256_movemask_epi8(tmp.x);
}
}
////////////////////////////////////////////////////////////////////////////////
double deviation (double n, double sum2, double sum) { return sqrt (sum2/n - (sum/n)*(sum/n)); }
Some remarks:
I used the Agner Fog's asmlib to have a function that returns CPU cycles
The compilation command is g++ -O3 -march=native ../Test.cpp -o ./Test -laelf64
The gcc version is 7.3.1
The CPU is Intel(R) Core(TM) i7-6700HQ CPU # 2.60GHz
I compute some dummy checksum to compare the results of the different tests
Now the results:
------------------------------------------------------------------------------------------------------------
name | time in msec : mean (sd) | cycles/row : mean (sd) | frequency in GHz | checksum
------------------------------------------------------------------------------------------------------------
naive (transpo) | 4548 ( 36.5) | 43.91 (0.35) | 2.592 | 0x9affeb5a6
naive (row by row) | 3033 ( 11.0) | 29.29 (0.11) | 2.592 | 0x9affeb5a6
AVX2 (transpo) | 767 ( 12.8) | 7.40 (0.12) | 2.592 | 0x9affeb5a6
AVX2 (row by row) | 130 ( 4.0) | 1.25 (0.04) | 2.591 | 0x9affeb5a6
So it seems that the "row by row" in AVX2 is the best so far.
Note that when I saw this result (less than 2 cycles per row), I made no more effort to optimize the AVX2 "transpose+popcount" method, which should be feasable by computing several popcounts in parallel (I may test it later).

I eventually wrote another implementation, following the high entropy SWAR approach proposed by Peter Cordes. This implementation is recursive and relies on C++ template specialization.
The global idea is to fill N-bit accumulators to their maximum without carry overflow (this is where recursion is used). When these accumulators are filled, we update the grand totals and we start again with new N-bit accumulators to fill until all rows have been processed.
Here is the code (see function test_SWAR_recursive):
#include <immintrin.h>
#include <cassert>
#include <chrono>
#include <cinttypes>
#include <cmath>
#include <cstdio>
#include <cstring>
using namespace std;
using namespace std::chrono;
// avoid the #include <asmlib.h>
extern "C" u_int64_t ReadTSC();
static double deviation (double n, double sum2, double sum) { return sqrt (sum2/n - (sum/n)*(sum/n)); }
////////////////////////////////////////////////////////////////////////////////
// Recursive SWAR approach (with template specialization)
////////////////////////////////////////////////////////////////////////////////
template<int DEPTH>
struct RecursiveSWAR
{
// Number of accumulators for current depth
static const int N = 1<<DEPTH;
// Array of N-bit accumulators
typedef __m256i Array[N];
// Magic numbers (0x55555555, 0x33333333, ...) computed recursively
static const u_int32_t MAGIC_NUMBER =
RecursiveSWAR<DEPTH-1>::MAGIC_NUMBER
* (1 + (1<<(1<<(DEPTH-1))))
/ (1 + (1<<(1<<(DEPTH+0))));
static void fillAccumulators (u_int32_t*& begin, const u_int32_t* end, Array accumulators)
{
// We reset the N-bit accumulators
for (int i=0; i<N; i++) { accumulators[i] = _mm256_setzero_si256(); }
// We check (only for depth big enough) that we have still rows to process
if (DEPTH>=3) if (begin>=end) { return; }
typename RecursiveSWAR<DEPTH-1>::Array accumulatorsMinusOne;
// We load a register with the mask
__m256i mask = _mm256_set1_epi32 (RecursiveSWAR<DEPTH-1>::MAGIC_NUMBER);
// We fill the N-bit accumulators to their maximum capacity without carry overflow
for (int i=0; i<N+1; i++)
{
// We fill (N-1)-bit accumulators recursively
RecursiveSWAR<DEPTH-1>::fillAccumulators (begin, end, accumulatorsMinusOne);
// We update the N-bit accumulators from the (N-1)-bit accumulators
for (int j=0; j<RecursiveSWAR<DEPTH-1>::N; j++)
{
// LOW part
accumulators[2*j+0] = _mm256_add_epi32 (
accumulators[2*j+0],
_mm256_and_si256 (
accumulatorsMinusOne[j],
mask
)
);
// HIGH part
accumulators[2*j+1] = _mm256_add_epi32 (
accumulators[2*j+1],
_mm256_and_si256 (
_mm256_srli_epi32 (
accumulatorsMinusOne[j],
RecursiveSWAR<DEPTH-1>::N
),
mask
)
);
}
}
}
};
// Template specialization for DEPTH=0
template<>
struct RecursiveSWAR<0>
{
static const int N = 1;
typedef __m256i Array[N];
static const u_int32_t MAGIC_NUMBER = 0x55555555;
static void fillAccumulators (u_int32_t*& begin, const u_int32_t* end, Array result)
{
// We just load 8 rows in the AVX2 register
result[0] = _mm256_loadu_si256 ((__m256i*)begin);
// We update the iterator
begin += 1*sizeof(__m256i)/sizeof(u_int32_t);
}
};
template<int DEPTH> struct TypeInfo { };
template<> struct TypeInfo<3> { typedef u_int8_t Type; };
template<> struct TypeInfo<4> { typedef u_int16_t Type; };
template<> struct TypeInfo<5> { typedef u_int32_t Type; };
unsigned char reversebits (unsigned char b)
{
return ((b * 0x80200802ULL) & 0x0884422110ULL) * 0x0101010101ULL >> 32;
}
void test_SWAR_recursive (uint64_t nbRows, const uint32_t* bitmap, uint32_t* globalSums)
{
static const int DEPTH = 4;
RecursiveSWAR<DEPTH>::Array accumulators;
uint32_t* begin = (uint32_t*) bitmap;
const uint32_t* end = bitmap + nbRows;
// We reset the grand totals
for (int i=0; i<32; i++) { globalSums[i] = 0; }
while (begin < end)
{
// We fill the N-bit accumulators to the maximum without overflow
RecursiveSWAR<DEPTH>::fillAccumulators (begin, end, accumulators);
// We update grand totals from the filled N-bit accumulators
for (int i=0; i<RecursiveSWAR<DEPTH>::N; i++)
{
int r = reversebits(i) >> (8-DEPTH);
u_int32_t* sums = globalSums+r;
TypeInfo<DEPTH>::Type* values = (TypeInfo<DEPTH>::Type*) (accumulators+i);
for (int j=0; j<8*(1<<(5-DEPTH)); j++)
{
sums[(j*RecursiveSWAR<DEPTH>::N) % 32] += values[j];
}
}
}
}
////////////////////////////////////////////////////////////////////////////////
void execute (
const char* name,
void (*fct)(uint64_t nbRows, const uint32_t* bitmap, uint32_t* globalSums),
size_t nbRuns,
uint64_t nbRows,
u_int32_t* bitmap
)
{
uint32_t sums[32];
double timeTotal=0;
double cycleTotal=0;
double timeTotal2=0;
double cycleTotal2=0;
uint64_t check=0;
for (size_t n=0; n<nbRuns; n++)
{
memset(sums,0,sizeof(sums));
// We want both time and cpu cycles information
milliseconds t0 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
uint64_t c0 = ReadTSC();
// We run the test
(*fct) (nbRows, bitmap, sums);
uint64_t c1 = ReadTSC();
milliseconds t1 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
timeTotal += (t1-t0).count();
cycleTotal += (double)(c1-c0) / nbRows;
timeTotal2 += (t1-t0).count() * (t1-t0).count();
cycleTotal2 += ((double)(c1-c0) / nbRows) * ((double)(c1-c0) / nbRows);
// We compute some dummy checksum
for (size_t k=0; k<32; k++) { check += (k+1)*sums[k]; }
}
printf ("%-21s | %5.0lf (%5.1lf) | %5.2lf (%5.3lf) | %.3lf | 0x%lx\n",
name,
timeTotal / nbRuns,
deviation (nbRuns, timeTotal2, timeTotal),
cycleTotal/nbRuns,
deviation (nbRuns, cycleTotal2, cycleTotal),
nbRows * cycleTotal / timeTotal / 1000000.0,
check/nbRuns
);
}
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
// We set rows number as 2^n where n is the provided argument
// For simplification, we assume that the rows number is a multiple of 32
uint64_t nbRows = 1ULL << (argc>1 ? atoi(argv[1]) : 28);
size_t nbRuns = argc>2 ? atoi(argv[2]) : 10;
// We build an bitmap of size nbRows*32
uint64_t actualNbRows = nbRows + 100000;
uint32_t* bitmap = (uint32_t*)_mm_malloc(sizeof(uint32_t)*actualNbRows, 256);
if (bitmap==nullptr)
{
fprintf(stderr, "unable to allocate the bitmap\n");
exit(1);
}
memset (bitmap, 0, sizeof(u_int32_t)*actualNbRows);
// We fill the bitmap with random values
// srand(time(nullptr));
for (uint64_t i=0; i<nbRows; i++) { bitmap[i] = rand() & 0xFFFFFFFF; }
printf ("\n");
printf ("nbRows=%ld nbRuns=%ld\n", nbRows, nbRuns);
printf ("------------------------------------------------------------------------------------------------------------\n");
printf ("name | time in msec : mean (sd) | cycles/row : mean (sd) | frequency in GHz | checksum\n");
printf ("------------------------------------------------------------------------------------------------------------\n");
// We launch the benchmark
execute ("AVX2 (SWAR rec) ", test_SWAR_recursive, nbRuns, nbRows, bitmap);
printf ("\n");
// Some clean up
_mm_free (bitmap);
return EXIT_SUCCESS;
}
The size of the accumulators is 2DEPTH in this code. Note that this implementation is valid up to DEPTH=5. For DEPTH=4, here are the performance results compared to the implementation of Peter Cordes (named high entropy SWAR):
The graph gives the number of cycles required to process a row (of 32 items) as a function of the number of rows of the matrix. As expected, the results are pretty similar since the main idea is the same. It is interesting to note the three parts of the graph:
constant value for log2(n)<=20
increasing value for log2(n) between 20 and 22
constant value for log2(n)>=22
I guess that CPU caches properties can explain this behaviour.

Why is this slower than memcmp

I am trying to compare two rows of pixels.
A pixel is defined as a struct containing 4 float values (RGBA).
The reason I am not using memcmp is because I need to return the position of the 1st different pixel, which memcmp does not do.
My first implementation uses SSE intrinsics, and is ~30% slower than memcmp:
inline int PixelMemCmp(const Pixel* a, const Pixel* b, int count)
{
for (int i = 0; i < count; i++)
{
__m128 x = _mm_load_ps((float*)(a + i));
__m128 y = _mm_load_ps((float*)(b + i));
__m128 cmp = _mm_cmpeq_ps(x, y);
if (_mm_movemask_ps(cmp) != 15) return i;
}
return -1;
}
I then found that treating the values as integers instead of floats sped things up a bit, and is now only ~20% slower than memcmp.
inline int PixelMemCmp(const Pixel* a, const Pixel* b, int count)
{
for (int i = 0; i < count; i++)
{
__m128i x = _mm_load_si128((__m128i*)(a + i));
__m128i y = _mm_load_si128((__m128i*)(b + i));
__m128i cmp = _mm_cmpeq_epi32(x, y);
if (_mm_movemask_epi8(cmp) != 0xffff) return i;
}
return -1;
}
From what I've read on other questions, the MS implementation of memcmp is also implemented using SSE. My question is what other tricks does the MS implementation have up it's sleeve that I don't? How is it still faster even though it does a byte-by-byte comparison?
Is alignment an issue? If the pixel contains 4 floats, won't an array of pixels already be allocated on a 16 byte boundary?
I am compiling with /o2 and all the optimization flags.

I have written strcmp/memcmp optimizations with SSE (and MMX/3DNow!), and the first step is to ensure that the arrays are as aligned as possible - you may find that you have to do the first and/or last bytes "one at a time".
If you can align the data before it gets to the loop [if your code does the allocation], then that's ideal.
The second part is to unroll the loop, so you don't get so many "if loop isn't at the end, jump back to beginning of loop" - assuming the loop is quite long.
You may find that preloading the next data of the input before doing the "do we leave now" condition helps too.
Edit: The last paragraph may need an example. This code assumes an unrolled loop of at least two:
__m128i x = _mm_load_si128((__m128i*)(a));
__m128i y = _mm_load_si128((__m128i*)(b));
for(int i = 0; i < count; i+=2)
{
__m128i cmp = _mm_cmpeq_epi32(x, y);
__m128i x1 = _mm_load_si128((__m128i*)(a + i + 1));
__m128i y1 = _mm_load_si128((__m128i*)(b + i + 1));
if (_mm_movemask_epi8(cmp) != 0xffff) return i;
cmp = _mm_cmpeq_epi32(x1, y1);
__m128i x = _mm_load_si128((__m128i*)(a + i + 2));
__m128i y = _mm_load_si128((__m128i*)(b + i + 2));
if (_mm_movemask_epi8(cmp) != 0xffff) return i + 1;
}
Roughly something like that.

You might want to check this memcmp SSE implementation, specifically the __sse_memcmp function, it starts with some sanity checks and then checks if the pointers are aligned or not:
aligned_a = ( (unsigned long)a & (sizeof(__m128i)-1) );
aligned_b = ( (unsigned long)b & (sizeof(__m128i)-1) );
If they are not aligned it compares the pointers byte by byte until the start of an aligned address:
while( len && ( (unsigned long) a & ( sizeof(__m128i)-1) ) )
{
if(*a++ != *b++) return -1;
--len;
}
And then compares the remaining memory with SSE instructions similar to your code:
if(!len) return 0;
while( len && !(len & 7 ) )
{
__m128i x = _mm_load_si128( (__m128i*)&a[i]);
__m128i y = _mm_load_si128( (__m128i*)&b[i]);
....

I cannot help you directly because I'm using Mac, but there's an easy way to figure out what happens:
You just step into memcpy in the debug mode and switch to Disassembly view. As the memcpy is a simple little function, you will easily figure out all the implementation tricks.

Faster alpha blending using a lookup table?

I have made a lookup table that allows you to blend two single-byte channels (256 colors per channel) using a single-byte alpha channel using no floating point values (hence no float to int conversions). Each index in the lookup table corresponds to the value of 256ths of a channel, as related to an alpha value.
In all, to fully calculate a 3-channel RGB blend, it would require two lookups into the array per channel, plus an addition. This is a total of 6 lookups and 3 additions. In the example below, I split the colors into separate values for ease of demonstration. This example shows how to blend three channels, R G and B by an alpha value ranging from 0 to 256.
BYTE r1, r2, rDest;
BYTE g1, g2, gDest;
BYTE b1, b2, bDest;
BYTE av; // Alpha value
BYTE rem = 255 - av; // Remaining fraction
rDest = _lookup[r1][rem] + _lookup[r2][av];
gDest = _lookup[g1][rem] + _lookup[g2][av];
bDest = _lookup[b1][rem] + _lookup[b2][av];
It works great. Precise as you can get using 256 color channels. In fact, you would get the same exact values using the actual floating point calculations. The lookup table was calculated using doubles to begin with. The lookup table is too big to fit in this post (65536 bytes). (If you would like a copy of it, email me at ten.turtle.toes#gmail.com, but don't expect a reply until tomorrow because I am going to sleep now.)
So... what do you think? Is it worth it or not?

I would be interested in seeing some benchmarks.
There is an algorithm that can do perfect alpha blending without any floating point calculations or lookup tables. You can find more info in the following document (the algorithm and code is described at the end)
I also did an SSE implementation of this long ago, if you are interested...
void PreOver_SSE2(void* dest, const void* source1, const void* source2, size_t size)
{
static const size_t STRIDE = sizeof(__m128i)*4;
static const u32 PSD = 64;
static const __m128i round = _mm_set1_epi16(128);
static const __m128i lomask = _mm_set1_epi32(0x00FF00FF);
assert(source1 != NULL && source2 != NULL && dest != NULL);
assert(size % STRIDE == 0);
const __m128i* source128_1 = reinterpret_cast<const __m128i*>(source1);
const __m128i* source128_2 = reinterpret_cast<const __m128i*>(source2);
__m128i* dest128 = reinterpret_cast<__m128i*>(dest);
__m128i d, s, a, rb, ag, t;
for(size_t k = 0, length = size/STRIDE; k < length; ++k)
{
// TODO: put prefetch between calculations?(R.N)
_mm_prefetch(reinterpret_cast<const s8*>(source128_1+PSD), _MM_HINT_NTA);
_mm_prefetch(reinterpret_cast<const s8*>(source128_2+PSD), _MM_HINT_NTA);
// work on entire cacheline before next prefetch
for(int n = 0; n < 4; ++n, ++dest128, ++source128_1, ++source128_2)
{
// TODO: assembly optimization use PSHUFD on moves before calculations, lower latency than MOVDQA (R.N) http://software.intel.com/en-us/articles/fast-simd-integer-move-for-the-intel-pentiumr-4-processor/
// TODO: load entire cacheline at the same time? are there enough registers? 32 bit mode (special compile for 64bit?) (R.N)
s = _mm_load_si128(source128_1); // AABGGRR
d = _mm_load_si128(source128_2); // AABGGRR
// PRELERP(S, D) = S+D - ((S*D[A]+0x80)>>8)+(S*D[A]+0x80))>>8
// T = S*D[A]+0x80 => PRELERP(S,D) = S+D - ((T>>8)+T)>>8
// set alpha to lo16 from dest_
a = _mm_srli_epi32(d, 24); // 000000AA
rb = _mm_slli_epi32(a, 16); // 00AA0000
a = _mm_or_si128(rb, a); // 00AA00AA
rb = _mm_and_si128(lomask, s); // 00BB00RR
rb = _mm_mullo_epi16(rb, a); // BBBBRRRR
rb = _mm_add_epi16(rb, round); // BBBBRRRR
t = _mm_srli_epi16(rb, 8);
t = _mm_add_epi16(t, rb);
rb = _mm_srli_epi16(t, 8); // 00BB00RR
ag = _mm_srli_epi16(s, 8); // 00AA00GG
ag = _mm_mullo_epi16(ag, a); // AAAAGGGG
ag = _mm_add_epi16(ag, round);
t = _mm_srli_epi16(ag, 8);
t = _mm_add_epi16(t, ag);
ag = _mm_andnot_si128(lomask, t); // AA00GG00
rb = _mm_or_si128(rb, ag); // AABGGRR pack
rb = _mm_sub_epi8(s, rb); // sub S-[(D[A]*S)/255]
d = _mm_add_epi8(d, rb); // add D+[S-(D[A]*S)/255]
_mm_stream_si128(dest128, d);
}
}
_mm_mfence(); //ensure last WC buffers get flushed to memory
}

Today's processors can do a lot of calculation in the time it takes to get one value from memory, especially if it's not in cache. That makes it especially important to benchmark possible solutions, because you can't easily reason what the result will be.
I don't know why you're concerned about floating point conversions, this can all be done as integers.
BYTE r1, r2, rDest;
BYTE g1, g2, gDest;
BYTE b1, b2, bDest;
BYTE av; // Alpha value BYTE
rem = 255 - av; // Remaining fraction
rDest = (r1*rem + r2*av) / 255;
gDest = (g1*rem + g2*av) / 255;
bDest = (b1*rem + b2*av) / 255;
If you want to get really clever, you can replace the divide with a multiply followed by a right-shift.
Edit: Here's the version using a right-shift. Adding a constant might reduce any truncation errors, but I'll leave that as an exercise for the reader.
BYTE r1, r2, rDest;
BYTE g1, g2, gDest;
BYTE b1, b2, bDest;
BYTE av; // Alpha value BYTE
int aMult = 0x10000 * av / 255;
rem = 0x10000 - aMult; // Remaining fraction
rDest = (r1*rem + r2*aMult) >> 16;
gDest = (g1*rem + g2*aMult) >> 16;
bDest = (b1*rem + b2*aMult) >> 16;

Mixing colors is computationally trivial. I would be surprised if this yielded a significant benefit, but as always, you must first prove that this calculation is a bottleneck in performance in the first place. And I would suggest that a better solution is to use hardware accelerated blending.

Any way to make this relatively simple (nested for memory copy) C++ code more efficient?

I realize this is kind of a goofy question, for lack of a better term. I'm just kind of looking for any outside idea on increasing the efficiency of this code, as it's bogging down the system very badly (it has to perform this function a lot) and I'm running low on ideas.
What it's doing it loading two image containers (imgRGB for a full color img and imgBW for a b&w image) pixel-by-individual-pixel of an image that's stored in "unsigned char *pImage".
Both imgRGB and imgBW are containers for accessing individual pixels as necessary.
// input is in the form of an unsigned char
// unsigned char *pImage
for (int y=0; y < 640; y++) {
for (int x=0; x < 480; x++) {
imgRGB[y][x].blue = *pImage;
pImage++;
imgRGB[y][x].green = *pImage;
imgBW[y][x] = *pImage;
pImage++;
imgRGB[y][x].red = *pImage;
pImage++;
}
}
Like I said, I was just kind of looking for fresh input and ideas on better memory management and/or copy than this. Sometimes I look at my own code so much I get tunnel vision... a bit of a mental block. If anyone wants/needs more information, by all means let me know.

The obvious question is, do you need to copy the data in the first place? Can't you just define accessor functions to extract the R, G and B values for any given pixel from the original input array?
If the image data is transient so you have to keep a copy of it, you could just make a raw copy of it without any reformatting, and again define accessors to index into each pixel/channel on that.
Assuming the copy you outlined is necessary, unrolling the loop a few times may prove to help.
I think the best approach will be to unroll the loop enough times to ensure that each iteration processes a chunk of data divisible by 4 bytes (so in each iteration, the loop can simply read a small number of ints, rather than a large number of chars)
Of course this requires you to mask out bits of these ints when writing, but that's a fast operation, and most importantly, it is done in registers, without burdening the memory subsystem or the CPU cache:
// First, we need to treat the input image as an array of ints. This is a bit nasty and technically unportable, but you get the idea)
unsigned int* img = reinterpret_cast<unsigned int*>(pImage);
for (int y = 0; y < 640; ++y)
{
for (int x = 0; x < 480; x += 4)
{
// At the start of each iteration, read 3 ints. That's 12 bytes, enough to write exactly 4 pixels.
unsigned int i0 = *img;
unsigned int i1 = *(img+1);
unsigned int i2 = *(img+2);
img += 3;
// This probably won't make a difference, but keeping a reference to the found pixel saves some typing, and it may assist the compiler in avoiding aliasing.
ImgRGB& pix0 = imgRGB[y][x];
pix0.blue = i0 & 0xff;
pix0.green = (i0 >> 8) & 0xff;
pix0.red = (i0 >> 16) & 0xff;
imgBW[y][x] = (i0 >> 8) & 0xff;
ImgRGB& pix1 = imgRGB[y][x+1];
pix1.blue = (i0 >> 24) & 0xff;
pix1.green = i1 & 0xff;
pix1.red = (i0 >> 8) & 0xff;
imgBW[y][x+1] = i1 & 0xff;
ImgRGB& pix2 = imgRGB[y][x+2];
pix2.blue = (i1 >> 16) & 0xff;
pix2.green = (i1 >> 24) & 0xff;
pix2.red = i2 & 0xff;
imgBW[y][x+2] = (i1 >> 24) & 0xff;
ImgRGB& pix3 = imgRGB[y][x+3];
pix3.blue = (i2 >> 8) & 0xff;
pix3.green = (i2 >> 16) & 0xff;
pix3.red = (i2 >> 24) & 0xff;
imgBW[y][x+3] = (i2 >> 16) & 0xff;
}
}
it is also very likely that you're better off filling a temporary ImgRGB value, and then writing that entire struct to memory at once, meaning that the first block would look like this instead: (the following blocks would be similar, of course)
ImgRGB& pix0 = imgRGB[y][x];
ImgRGB tmpPix0;
tmpPix0.blue = i0 & 0xff;
tmpPix0.green = (i0 >> 8) & 0xff;
tmpPix0.red = (i0 >> 16) & 0xff;
imgBW[y][x] = (i0 >> 8) & 0xff;
pix0 = tmpPix0;
Depending on how clever the compiler is, this may cut down dramatically on the required number of reads.
Assuming the original code is naively compiled (which is probably unlikely, but will serve as an example), this will get you from 3 reads and 4 writes per pixel (read RGB channel, and write RGB + BW) to 3/4 reads per pixel and 2 writes. (one write for the RGB struct, and one for the BW value)
You could also accumulate the 4 writes to the BW image in a single int, and then write that in one go too, something like this:
bw |= (i0 >> 8) & 0xff;
bw |= (i1 & 0xff) << 8;
bw |= ((i1 >> 24) & 0xff) << 16;
bw |= ((i2 >> 16) & 0xff) << 24;
*(imgBW + y*480+x/4) = bw; // Assuming you can treat imgBW as an array of integers
This would cut down on the number of writes to 1.25 per pixel (1 per RGB struct, and 1 for every 4 BW values)
Again, the benefit will probably be a lot smaller (or even nonexistent), but it may be worth a shot.
Taking this a step further, the same could be done without too much trouble using the SSE instructions, allowing you to process 4 times as many values per iteration. (Assuming you're running on x86)
Of course, an important disclaimer here is that the above is nonportable. The reinterpret_cast is probably an academic point (it'll most likely work no matter what, especially if you can ensure that the original array is aligned on a 32-bit boundary, which will typically be the case for large allocations on all platforms)
A bigger issue is that the bit-twiddling depends on the CPU's endianness.
But in practice, this should work on x86. and with small changes, it should work on big-endian machines too. (modulo any bugs in my code, of course. I haven't tested or even compiled any of it ;))
But no matter how you solve it, you're going to see the biggest speed improvements from minimizing the number of reads and writes, and trying to accumulate as much data in the CPU's registers as possible. Read all you can in large chunks, like ints, reorder it in the registers (accumulate it into a number of ints, or write it into temporary instances of the RGB struct), and then write those combined value out to memory.
Depending on how much you know about low-level optimizations, it may be surprising to you, but temporary variables are fine, while direct memory to memory access can be slow (for example your pointer dereferencing assigned directly into the array). The problem with this is that you may get more memory accesses than necessary, and it's harder for the compiler to guarantee that no aliasing will occur, and so it may be unable to reorder or combine the memory accesses. You're generally better off writing as much as you can early on (top of the loop), doing as much as possible in temporaries (because the compiler can keep everything in registers), and then write everything out at the end. That also gives the compiler as much leeway as possible to wait for the initially slow reads.
Finally, adding a 4th dummy value to the RGB struct (so it has a total size of 32bit) will most likely help a lot too (because then writing such a struct is a single 32-bit write, which is simpler and more efficient than the current 24-bit)
When deciding how much to unroll the loop (you could do the above twice or more in each iteration), keep in mind how many registers your CPU has. Spilling out into the cache will probably hurt you as there are plenty of memory accesses already, but on the other hand, unroll as much as you can afford given the number of registers available (the above uses 3 registers for keeping the input data, and one to accumulate the BW values. It may need one or two more to compute the necessary addresses, so on x86, doubling the above might be pushing it a bit (you have 8 registers total, and some of them have special meanings). On the other hand, modern CPU's do a lot to compensate for register pressure, by using a much larger number of registers behind the scenes, so further unrolling might still be a total performance win.
As always, measure measure measure. It's impossible to say what's fast and what isn't until you've tested it.
Another general point to keep in mind is that data dependencies are bad. This won't be a big deal as long as you're only dealing with integral values, but it still inhibits instruction reordering, and superscalar execution.
In the above, I've tried to keep dependency chains as short as possible. Rather than continually incrementing the same pointer (which means that each increment is dependant on the previous one), adding a different offset to the same base address means that every address can be computed independently, again giving more freedom to the compiler to reorder and reschedule instructions.

I think the array accesses (are they real array accesses or operator []?) are going to kill you. Each one represents a multiply.
Basically, you want something like this:
for (int y=0; y < height; y++) {
unsigned char *destBgr = imgRgb.GetScanline(y); // inline methods are better
unsigned char *destBW = imgBW.GetScanline(y);
for (int x=0; x < width; x++) {
*destBgr++ = *pImage++;
*destBW++ = *destBgr++ = *pImage++; // do this in one shot - don't double deref
*destBgr++ = *pImage++;
}
}
This will do two multiplies per scanline. You code was doing 4 multiplies per PIXEL.

What I like to do in situations like this is go into the debugger and step through the disassembly to see what it is really doing (or have the compiler generate an assembly listing). This can give you a lot of clues about where inefficencies are. They are often not where you think!
By implementing the changes suggested by Assaf and David Lee above, you can get a before and after instruction count. This really helps me in optimizing tight inner loops.

You could optimize away some of the pointer arithmetic you're doing over and over with the subscript operators [][] and use an iterator instead (that is, advance a pointer).

Memory bandwidth is your bottleneck here. There is a theoretical minimum time required to transfer all the data to and from system memory. I wrote a little test to compare the OP's version with some simple assembler to see how good the compiler was. I'm using VS2005 with default release mode settings. Here's the code:
#include <windows.h>
#include <iostream>
using namespace std;
const int
c_width = 640,
c_height = 480;
typedef struct _RGBData
{
unsigned char
r,
g,
b;
// I'm assuming there's no padding byte here
} RGBData;
// similar to the code given
void SimpleTest
(
unsigned char *src,
RGBData *rgb,
unsigned char *bw
)
{
for (int y = 0 ; y < c_height ; ++y)
{
for (int x = 0 ; x < c_width ; ++x)
{
rgb [x + y * c_width].b = *src;
src++;
rgb [x + y * c_width].g = *src;
bw [x + y * c_width] = *src;
src++;
rgb [x + y * c_width].r = *src;
src++;
}
}
}
// the assembler version
void ASM
(
unsigned char *src,
RGBData *rgb,
unsigned char *bw
)
{
const int
count = 3 * c_width * c_height / 12;
_asm
{
push ebp
mov esi,src
mov edi,bw
mov ecx,count
mov ebp,rgb
l1:
mov eax,[esi]
mov ebx,[esi+4]
mov edx,[esi+8]
mov [ebp],eax
shl eax,16
mov [ebp+4],ebx
rol ebx,16
mov [ebp+8],edx
shr edx,24
and eax,0xff000000
and ebx,0x00ffff00
and edx,0x000000ff
or eax,ebx
or eax,edx
add esi,12
bswap eax
add ebp,12
stosd
loop l1
pop ebp
}
}
// timing framework
LONGLONG TimeFunction
(
void (*function) (unsigned char *src, RGBData *rgb, unsigned char *bw),
char *description,
unsigned char *src,
RGBData *rgb,
unsigned char *bw
)
{
LARGE_INTEGER
start,
end;
cout << "Testing '" << description << "'...";
memset (rgb, 0, sizeof *rgb * c_width * c_height);
memset (bw, 0, c_width * c_height);
QueryPerformanceCounter (&start);
function (src, rgb, bw);
QueryPerformanceCounter (&end);
bool
ok = true;
unsigned char
*bw_check = bw,
i = 0;
RGBData
*rgb_check = rgb;
for (int count = 0 ; count < c_width * c_height ; ++count)
{
if (bw_check [count] != i || rgb_check [count].r != i || rgb_check [count].g != i || rgb_check [count].b != i)
{
ok = false;
break;
}
++i;
}
cout << (end.QuadPart - start.QuadPart) << (ok ? " OK" : " Failed") << endl;
return end.QuadPart - start.QuadPart;
}
int main
(
int argc,
char *argv []
)
{
unsigned char
*source_data = new unsigned char [c_width * c_height * 3];
RGBData
*rgb = new RGBData [c_width * c_height];
unsigned char
*bw = new unsigned char [c_width * c_height];
int
v = 0;
for (unsigned char *dest = source_data ; dest < &source_data [c_width * c_height * 3] ; ++dest)
{
*dest = v++ / 3;
}
LONGLONG
totals [2] = {0, 0};
for (int i = 0 ; i < 10 ; ++i)
{
cout << "Iteration: " << i << endl;
totals [0] += TimeFunction (SimpleTest, "Initial Copy", source_data, rgb, bw);
totals [1] += TimeFunction ( ASM, " ASM Copy", source_data, rgb, bw);
}
LARGE_INTEGER
freq;
QueryPerformanceFrequency (&freq);
freq.QuadPart /= 100000;
cout << totals [0] / freq.QuadPart << "ns" << endl;
cout << totals [1] / freq.QuadPart << "ns" << endl;
delete [] bw;
delete [] rgb;
delete [] source_data;
return 0;
}
And the ratio between C and assembler I was getting was about 2.5:1, i.e. C was 2.5 times the time of the assembler version.
I've just noticed the original data was in BGR order. If the copy swapped the B and R components then it does make the assembler code a bit more complex. But it would also make the C code more complex too.
Ideally, you need to work out what the theoretical minimum time is and compare it to what you're actually getting. To do that, you need to know the memory frequency and the type of memory and the workings of the CPU's MMU.

You might try using a simple cast to get your RGB data, and just recompute the grayscale data:
#pragma pack(1)
typedef unsigned char bw_t;
typedef struct {
unsigned char blue;
unsigned char green;
unsigned char red;
} rgb_t;
#pragma pack(pop)
rgb_t *imageRGB = (rgb_t*)pImage;
bw_t *imageBW = (bw_t*)calloc(640*480, sizeof(bw_t));
// RGB(X,Y) = imageRGB[Y*480 + X]
// BW(X,Y) = imageBW[Y*480 + X]
for (int y = 0; y < 640; ++y)
{
// try and pull some larger number of bytes from pImage (24 is arbitrary)
// 24 / sizeof(rgb_t) = 8
for (int x = 0; x < 480; x += 24)
{
imageBW[y*480 + x ] = GRAYSCALE(imageRGB[y*480 + x ]);
imageBW[y*480 + x + 1] = GRAYSCALE(imageRGB[y*480 + x + 1]);
imageBW[y*480 + x + 2] = GRAYSCALE(imageRGB[y*480 + x + 2]);
imageBW[y*480 + x + 3] = GRAYSCALE(imageRGB[y*480 + x + 3]);
imageBW[y*480 + x + 4] = GRAYSCALE(imageRGB[y*480 + x + 4]);
imageBW[y*480 + x + 5] = GRAYSCALE(imageRGB[y*480 + x + 5]);
imageBW[y*480 + x + 6] = GRAYSCALE(imageRGB[y*480 + x + 6]);
imageBW[y*480 + x + 7] = GRAYSCALE(imageRGB[y*480 + x + 7]);
}
}

Several steps you can take. Result at the end of this answer.
First, use pointers.
const unsigned char *pImage;
RGB *rgbOut = imgRGB;
unsigned char *bwOut = imgBW;
for (int y=0; y < 640; ++y) {
for (int x=0; x < 480; ++x) {
rgbOut->blue = *pImage;
++pImage;
unsigned char tmp = *pImage; // Save to reduce amount of reads.
rgbOut->green = tmp;
*bwOut = tmp;
++pImage;
rgbOut->red = *pImage;
++pImage;
++rgbOut;
++bwOut;
}
}
If imgRGB and imgBW are declared as:
unsigned char imgBW[480][640];
RGB imgRGB[480][640];
You can combine the two loops:
const unsigned char *pImage;
RGB *rgbOut = imgRGB;
unsigned char *bwOut = imgBW;
for (int i=0; i < 640 * 480; ++i) {
rgbOut->blue = *pImage;
++pImage;
unsigned char tmp = *pImage; // Save to reduce amount of reads.
rgbOut->green = tmp;
*bwOut = tmp;
++pImage;
rgbOut->red = *pImage;
++pImage;
++rgbOut;
++bwOut;
}
You can exploit the fact that word reads are faster than four char reads. We will use a helper macro for this. Note this example assumes a little-endian target system.
const unsigned char *pImage;
RGB *rgbOut = imgRGB;
unsigned char *bwOut = imgBW;
const uint32_t *curPixelGroup = pImage;
for (int i=0; i < 640 * 480; ++i) {
uint64_t pixels = 0;
#define WRITE_PIXEL \
rgbOut->blue = pixels; \
pixels >>= 8; \
\
rgbOut->green = pixels; \
*bwOut = pixels; \
pixels >>= 8; \
\
rgbOut->red = pixels; \
pixels >>= 8; \
\
++rgbOut; \
++bwOut;
#define READ_PIXEL(shift) \
pixels |= (*curPixelGroup++) << (shift * 8);
READ_PIXEL(0); WRITE_PIXEL;
READ_PIXEL(1); WRITE_PIXEL;
READ_PIXEL(2); WRITE_PIXEL;
READ_PIXEL(3); WRITE_PIXEL;
/* Remaining */ WRITE_PIXEL;
#undef COPY_PIXELS
}
(Your compiler will probably optimize away the redundant or operation in the first READ_PIXEL. It will also optimize shifts, removing the redundant << 0, too.)
If the structure of RGB is thus:
struct RGB {
unsigned char blue, green, red;
};
You can optimize even further, copy to the struct directly, instead of through its members (red, green, blue). This can be done using anonymous structs (or casting, but that makes the code a bit more messy and probably more prone to error). (Again, this is dependant on little-endian systems, etc. etc.):
union RGB {
struct {
unsigned char blue, green, red;
};
uint32_t rgb:24; // Make sure it's a bitfield, otherwise the union will strech and ruin the ++ operator.
};
const unsigned char *pImage;
RGB *rgbOut = imgRGB;
unsigned char *bwOut = imgBW;
const uint32_t *curPixelGroup = pImage;
for (int i=0; i < 640 * 480; ++i) {
uint64_t pixels = 0;
#define WRITE_PIXEL \
rgbOut->rgb = pixels; \
pixels >>= 8; \
\
*bwOut = pixels; \
pixels >>= 16; \
\
++rgbOut; \
++bwOut;
#define READ_PIXEL(shift) \
pixels |= (*curPixelGroup++) << (shift * 8);
READ_PIXEL(0); WRITE_PIXEL;
READ_PIXEL(1); WRITE_PIXEL;
READ_PIXEL(2); WRITE_PIXEL;
READ_PIXEL(3); WRITE_PIXEL;
/* Remaining */ WRITE_PIXEL;
#undef COPY_PIXELS
}
You can optimize writing the pixel similarly as we did with reading (writing in words rather than 24-bits). In fact, that'd be a pretty good idea, and will be a great next step in optimization. Too tired to code it, though. =]
Of course, you can write the routine in assembly language. This makes it less portable than it already is, however.

I'm assuming the following at the moment, so please let me know if my assumptions are wrong:
a) imgRGB is a structure of the type
struct ImgRGB
{
unsigned char blue;
unsigned char green;
unsigned char red;
};
or at least something similar.
b) imgBW looks something like this:
struct ImgBW
{
unsigned char BW;
};
c) The code is single threaded
Assuming the above, I see several problems with your code:
You put the assignment to the BW part right in the middle of the assignments to the other containers. If you're working on a modern CPU, chances are that with the size of your data your L1 cache gets invalidated every time you're switching containers and you're looking at reloading or switching a cache line. Caches are optimised for linear access these days so hopping to and fro doesn't help. Accessing main memory is a lot slower, so that would be a noticeable performance hit. To verify if this is a problem, temporarily I'd remove the assignment to imgBW and measure if there is a noticeable speedup.
The array access doesn't help and it'll potentially slow down the code a little, although a decent optimiser should take care of that. I'd probably write the loop along these lines instead, but would not expect a big performance gain. Maybe a couple percent.
for (int y=0; y blue = *pImage;
...
}
}
For consistency I would change from using postfix to prefix increment but I would not expect to see a big gain.
If you can waste a little storage (well, 25%) you might gain from adding a fourth dummy unsigned char to the structure ImgRGB provided that this would increase the size of the structure to the size of an int. Native ints are usually fastest to access and if you're looking at a structure of chars that are not filling up an int completely, you're potentially running into all sorts of interesting access issues that can slow your code down noticeably because the compiler might have to generate additional instructions to extract the unsigned chars. Again, try this and measure the result - it might make a noticeable difference or none at all. In the same vein, upping the size of the structure members from unsigned char to unsigned int might waste lots of space but potentially can speed up the code. Nevertheless as long as pImage is a pointer to an unsigned char, you would only eliminate half the problem.
All in all you are down to making your loop fit to your underlying hardware, so for specific optimisation techniques you might have to read up on what your hardware does well and what it does badly.

Make sure pImage, imgRGB, and imgBW are marked __restrict.
Use SSE and do it sixteen bytes at a time.
Actually from what you're doing there it looks like you could use a simple memcpy() to copy pImage into imgRGB (since imgRGB is in row-major format and apparently in the same order as pImage). You could fill out imgBW by using a series of SSE swizzle and store ops to pack down the green values but it might be cumbersome since you'd need to work on ( 3*16 =) 48 bytes at a time.
Are you sure pImage and your output arrays are all in dcache when you start this? Try using a prefetch hint to fetch 128 bytes ahead and measure to see if that improves things.
Edit If you're not on x86, replace "SSE" with the appropriate SIMD instruction set for your hardware, of course. (That'd be VMX, Altivec, SPU, VLIW, HLSL, etc.)

If possible, fix this at a higher level then bit or instruction twiddling!
You could specialize the the B&W image class to one that references the green channel of the color image class (thus saving a copy per pixel). If you always create them in pair, you might not even need the naive imgBW class at all.
By taking care about how your store the data in imgRGB, you could copy a triplet at a time from the input data. Better, you might copy the whole thing, or even just store a reference (which makes the previous suggestion easy as well).
If you don't control the implementation of everything here, you might be stuck, then:
Last resort: unroll the loop (cue someone mentioning Duff's device, or just ask the compiler to do it for you...), though I don't think you'll see much improvement...

It seems that you defined each pixel as some kind of structure or object. Using a primitive type (say, int) could be faster. As others have mentioned, the compiler is likely to optimize the array access using pointer increments. If the compile doesn't do that for you, you can do that yourself to avoid multiplications when you use array[][].
Since you only need 3 bytes per pixel, you could pack one pixel into one int. By doing that, you could copy 3 bytes a time instead of byte-by-byte. The only tricky thing is when you want to read individual color components of a pixel, you will need some bit masking and shifting. This could give you more overhead than that saved by using an int.
Or you can use 3 int arrays for 3 color components respectively. You will need a lot more storage, though.

Here is one very tiny, very simple optimization:
You are referring to imageRGB[y][x] repeatedly, and that likely needs to be re-calculated at each step.
Instead, calculate it once, and see if that makes some improvement:
Pixel* apixel;
for (int y=0; y < 640; y++) {
for (int x=0; x < 480; x++) {
apixel = &imgRGB[y][x];
apixel->blue = *pImage;
pImage++;
apixel->green = *pImage;
imgBW[y][x] = *pImage;
pImage++;
apixel->red = *pImage;
pImage++;
}
}

If pImage is already entirely in memory, why do you need to massage the data? I mean if it is already in pseudo-RGB format, why can't you just write some inline routines/macros that can spit out the values on demand instead of copying it around?
If rearranging the pixel data is important for later operations, consider block operations and/or cache line optimization.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js