Why is this slower than memcmp - c++

I am trying to compare two rows of pixels.
A pixel is defined as a struct containing 4 float values (RGBA).
The reason I am not using memcmp is because I need to return the position of the 1st different pixel, which memcmp does not do.
My first implementation uses SSE intrinsics, and is ~30% slower than memcmp:
inline int PixelMemCmp(const Pixel* a, const Pixel* b, int count)
{
for (int i = 0; i < count; i++)
{
__m128 x = _mm_load_ps((float*)(a + i));
__m128 y = _mm_load_ps((float*)(b + i));
__m128 cmp = _mm_cmpeq_ps(x, y);
if (_mm_movemask_ps(cmp) != 15) return i;
}
return -1;
}
I then found that treating the values as integers instead of floats sped things up a bit, and is now only ~20% slower than memcmp.
inline int PixelMemCmp(const Pixel* a, const Pixel* b, int count)
{
for (int i = 0; i < count; i++)
{
__m128i x = _mm_load_si128((__m128i*)(a + i));
__m128i y = _mm_load_si128((__m128i*)(b + i));
__m128i cmp = _mm_cmpeq_epi32(x, y);
if (_mm_movemask_epi8(cmp) != 0xffff) return i;
}
return -1;
}
From what I've read on other questions, the MS implementation of memcmp is also implemented using SSE. My question is what other tricks does the MS implementation have up it's sleeve that I don't? How is it still faster even though it does a byte-by-byte comparison?
Is alignment an issue? If the pixel contains 4 floats, won't an array of pixels already be allocated on a 16 byte boundary?
I am compiling with /o2 and all the optimization flags.

I have written strcmp/memcmp optimizations with SSE (and MMX/3DNow!), and the first step is to ensure that the arrays are as aligned as possible - you may find that you have to do the first and/or last bytes "one at a time".
If you can align the data before it gets to the loop [if your code does the allocation], then that's ideal.
The second part is to unroll the loop, so you don't get so many "if loop isn't at the end, jump back to beginning of loop" - assuming the loop is quite long.
You may find that preloading the next data of the input before doing the "do we leave now" condition helps too.
Edit: The last paragraph may need an example. This code assumes an unrolled loop of at least two:
__m128i x = _mm_load_si128((__m128i*)(a));
__m128i y = _mm_load_si128((__m128i*)(b));
for(int i = 0; i < count; i+=2)
{
__m128i cmp = _mm_cmpeq_epi32(x, y);
__m128i x1 = _mm_load_si128((__m128i*)(a + i + 1));
__m128i y1 = _mm_load_si128((__m128i*)(b + i + 1));
if (_mm_movemask_epi8(cmp) != 0xffff) return i;
cmp = _mm_cmpeq_epi32(x1, y1);
__m128i x = _mm_load_si128((__m128i*)(a + i + 2));
__m128i y = _mm_load_si128((__m128i*)(b + i + 2));
if (_mm_movemask_epi8(cmp) != 0xffff) return i + 1;
}
Roughly something like that.

You might want to check this memcmp SSE implementation, specifically the __sse_memcmp function, it starts with some sanity checks and then checks if the pointers are aligned or not:
aligned_a = ( (unsigned long)a & (sizeof(__m128i)-1) );
aligned_b = ( (unsigned long)b & (sizeof(__m128i)-1) );
If they are not aligned it compares the pointers byte by byte until the start of an aligned address:
while( len && ( (unsigned long) a & ( sizeof(__m128i)-1) ) )
{
if(*a++ != *b++) return -1;
--len;
}
And then compares the remaining memory with SSE instructions similar to your code:
if(!len) return 0;
while( len && !(len & 7 ) )
{
__m128i x = _mm_load_si128( (__m128i*)&a[i]);
__m128i y = _mm_load_si128( (__m128i*)&b[i]);
....

I cannot help you directly because I'm using Mac, but there's an easy way to figure out what happens:
You just step into memcpy in the debug mode and switch to Disassembly view. As the memcpy is a simple little function, you will easily figure out all the implementation tricks.

Related

SIMD -> uint16_t array to float array work on float then back to uint16_t

I am currently working on a project that manipulates images. To speed up the process (and increase my knowledge), I decided to write some of the basic functions using SIMD instructions.
The code using for loops is
int idx;
uint16_t* A, B, C;
float gAlpha = 0.8;
float alpha = 0.2;
for (size_t rw = 0; rw < height; rw++) {
for (size_t cl = 0; cl < width; cl++) {
idx = rw * width + height;
C[idx] = static_cast<uint16_t>(gAlpha * static_cast<float>(A[idx]) + alpha * static_cast<float>(B[idx]));
}
}
}
This loop is probably not perfect but it makes its job perfectly and my unit test gives me the expected results.
As I said, I am trying to convert these loops using SIMD intrinsic. This is my working code and, as you will see, it is not very pretty... We do have access to intrinsic up to AVX2.
size_t n_pixels = height * width;
for (size_t px = 0; px < n_pixels; px += 8) {
__m128i xlo = _mm_unpacklo_epi16(_mm_load_si128((__m128i*)&A[px]), _mm_set1_epi16(0));
__m128i xhi = _mm_unpackhi_epi16(_mm_load_si128((__m128i*)&A[px]), _mm_set1_epi16(0));
__m128 ylo = _mm_cvtepi32_ps(xlo);
__m128 yhi = _mm_cvtepi32_ps(xhi);
__m256 pxMinFl = _mm256_castps128_ps256(ylo);
pxMinFl = _mm256_insertf128_ps(pxMinFl, yhi, 1);
xlo = _mm_unpacklo_epi16(_mm_load_si128((__m128i*)&B[px]), _mm_set1_epi16(0));
xhi = _mm_unpackhi_epi16(_mm_load_si128((__m128i*)&B[px]), _mm_set1_epi16(0));
ylo = _mm_cvtepi32_ps(xlo);
yhi = _mm_cvtepi32_ps(xhi);
__m256 pxMaxFl = _mm256_castps128_ps256(ylo);
pxMaxFl = _mm256_insertf128_ps(pxMaxFl, yhi, 1);
__m256 avGain1 = _mm256_set1_ps(gAlpha);
__m256 avGain2 = _mm256_set1_ps(alpha);
__m256 prodUp = _mm256_mul_ps(prodUp, avGain1);
__m256 prodBt = _mm256_mul_ps(prodBt, avGain2);
__m256 pxOutFl = _mm256_add_ps(prodUp, prodBt);
__m128 ylo_ps = _mm256_castps256_ps128(pxOutFl);
__m128 yhi_ps = _mm256_extractf128_ps(pxOutFl, 1);
__m128i xlo_ep = _mm_cvtps_epi32(ylo_ps);
__m128i xhi_ep = _mm_cvtps_epi32(yhi_ps); <- POINT 1
int* xl = reinterpret_cast<int*>(&xlo_ep); <- POINT 2
for (int i=0; i < 8; i++) { <- POINT 2
C[px + i] = static_cast<uint16_t>(xl[i]); <- POINT 2
}
}
There are probably tons of optimization that could be done on this code but I have checked that the output of pxOutFl corresponds to the expected value. Where is start to look like black magic to me is when I looked at how I had to save the data back into the output array C. First of all, the code doesn't work if I comment the line at POINT 1 even if, as you can read, I don't use the variable. Secondly, I would have guessed that there is a better solution than the trick I used to store the data back into the uint16_t array (POINT 2) but I can't find one that is working.
Could someone point me into the correct direction? What am I missing? How could I improve this code?
Thanks in advance!
PS: We use the Intel compiler 2017 for the parallel studio professional edition 2117 on Linux (Fedora 25).
You can re-write all of POINT 2 as:
_mm_storeu_si128((__m128i *)&C[px], xlo_ep);
Also note that all instances of _mm_load_si128 should probably be _mm_loadu_si128, since you don't seem to be guaranteeing alignment anywhere.

access violation _mm_store_si128 SSE Intrinsics

I want to create a histogram of vertical gradients in an 8 bit gray image.
The vertical distance to calculate the gradient can be specified.
I already managed to speed up another part of my code using Intrinsics, but it does not work here.
The code runs without exception if the _mm_store_si128 is commented out.
When it is not commented, I get an access violation.
What is going wrong here?
#define _mm_absdiff_epu8(a,b) _mm_adds_epu8(_mm_subs_epu8(a, b), _mm_subs_epu8(b, a)) //from opencv
void CreateAbsDiffHistogramUnmanaged(void* source, unsigned int sourcestride, unsigned int height, unsigned int verticalDistance, unsigned int histogram[])
{
unsigned int xcount = sourcestride / 16;
__m128i absdiffData;
unsigned char* bytes = (unsigned char*) _aligned_malloc(16, 16);
__m128i* absdiffresult = (__m128i*) bytes;
__m128i* sourceM = (__m128i*) source;
__m128i* sourceVOffset = (__m128i*)source + verticalDistance * sourcestride;
for (unsigned int y = 0; y < (height - verticalDistance); y++)
{
for (unsigned int x = 0; x < xcount; x++, ++sourceM, ++sourceVOffset)
{
absdiffData = _mm_absdiff_epu8(*sourceM, *sourceVOffset);
_mm_store_si128(absdiffresult, absdiffData);
//unroll loop
histogram[bytes[0]]++;
histogram[bytes[1]]++;
histogram[bytes[2]]++;
histogram[bytes[3]]++;
histogram[bytes[4]]++;
histogram[bytes[5]]++;
histogram[bytes[6]]++;
histogram[bytes[7]]++;
histogram[bytes[8]]++;
histogram[bytes[9]]++;
histogram[bytes[10]]++;
histogram[bytes[11]]++;
histogram[bytes[12]]++;
histogram[bytes[13]]++;
histogram[bytes[14]]++;
histogram[bytes[15]]++;
}
}
_aligned_free(bytes);
}
Your function crashed while loading because the input data was not aligned properly. In order to solve this problem you have to change your code from:
absdiffData = _mm_absdiff_epu8(*sourceM, *sourceVOffset);
to:
absdiffData = _mm_absdiff_epu8(_mm_loadu_si128(sourceM), _mm_loadu_si128(sourceVOffset));
Here I use unaligned loading.
P.S. I have implemented a similar function (SimdAbsSecondDerivativeHistogram) in Simd Library. It has SSE2, AVX2, NEON and Altivec implementations. I hope that it will help you.
P.P.S. Also I would strongly recommended to check this line:
__m128i* sourceVOffset = (__m128i*)source + verticalDistance * sourcestride);
It may result in a crash (access to memory outside of the input array bounds). Maybe you had in mind this:
__m128i* sourceVOffset = (__m128i*)((char*)source + verticalDistance * sourcestride);

Performance AVX/SSE assembly vs. intrinsics

I'm just trying to check the optimum approach to optimizing some basic routines. In this case I tried very simply example of multiplying 2 float vectors together:
void Mul(float *src1, float *src2, float *dst)
{
for (int i=0; i<cnt; i++) dst[i] = src1[i] * src2[i];
};
Plain C implementation is very slow. I did some external ASM using AVX and also tried using intrinsics. These are the test results (time, smaller is better):
ASM: 0.110
IPP: 0.125
Intrinsics: 0.18
Plain C++: 4.0
(compiled using MSVC 2013, SSE2, tried Intel Compiler, results were pretty much the same)
As you can see my ASM code beaten even Intel Performance Primitives (probably because I did lots of branches to ensure I can use the AVX aligned instructions). But I'd personally like to utilize the intrinsic approach, it's simply easier to manage and I was thinking the compiler should do the best job optimizing all the branches and stuff (my ASM code sucks in that matter imho, yet it is faster). So here's the code using intrinsics:
int i;
for (i=0; (MINTEGER)(dst + i) % 32 != 0 && i < cnt; i++) dst[i] = src1[i] * src2[i];
if ((MINTEGER)(src1 + i) % 32 == 0)
{
if ((MINTEGER)(src2 + i) % 32 == 0)
{
for (; i<cnt-8; i+=8)
{
__m256 x = _mm256_load_ps( src1 + i);
__m256 y = _mm256_load_ps( src2 + i);
__m256 z = _mm256_mul_ps(x, y);
_mm256_store_ps(dst + i, z);
};
}
else
{
for (; i<cnt-8; i+=8)
{
__m256 x = _mm256_load_ps( src1 + i);
__m256 y = _mm256_loadu_ps( src2 + i);
__m256 z = _mm256_mul_ps(x, y);
_mm256_store_ps(dst + i, z);
};
};
}
else
{
for (; i<cnt-8; i+=8)
{
__m256 x = _mm256_loadu_ps( src1 + i);
__m256 y = _mm256_loadu_ps( src2 + i);
__m256 z = _mm256_mul_ps(x, y);
_mm256_store_ps(dst + i, z);
};
};
for (; i<cnt; i++) dst[i] = src1[i] * src2[i];
Simple: First get to an address where dst is aligned to 32 bytes, then branch to check which sources are aligned.
One problem is that the C++ implementations in the beginning and at the end are not using AVX unless I enable AVX in the compiler, which I do NOT want, because this should be just AVX specialization, but the software should work even on a platform, where AVX is not available. And sadly there seems to be no intrinsics for instructions such as vmovss, so there's probably a penalty for mixing AVX code with SSE, which the compiler uses. However even if I enabled AVX in the compiler, it still didn't get below 0.14.
Any ideas how to optimize this to make the instrisics reach the speed of the ASM code?
Your implementation with intrinsics is not the same function as your implementation in straight C: e.g. what if your function was called with arguments Mul(p, p, p+1)? You'll get different results. The pure C version is slow because the compiler is ensuring that the code does exactly what you said.
If you want the compiler to make optimizations based on the assumption that the three arrays do not overlap, you need to make that explicit:
void Mul(float *src1, float *src2, float *__restrict__ dst)
or even better
void Mul(const float *src1, const float *src2, float *__restrict__ dst)
(I think it's enough to have __restrict__ just on the output pointer, although it wouldn't hurt to add it to the input pointers too)
On CPUs with AVX there is very little penalty for using misaligned loads - I would suggest trading this small penalty off against all the extra logic you're using to check for alignment etc and just have a single loop + scalar code to handle any residual elements:
for (i = 0; i <= cnt - 8; i += 8)
{
__m256 x = _mm256_loadu_ps(src1 + i);
__m256 y = _mm256_loadu_ps(src2 + i);
__m256 z = _mm256_mul_ps(x, y);
_mm256_storeu_ps(dst + i, z);
}
for ( ; i < cnt; i++)
{
dst[i] = src1[i] * src2[i];
}
Better still, make sure that your buffers are all 32 byte aligned in the first place and then just use aligned loads/stores.
Note that performing a single arithmetic operation in a loop like this is generally a bad approach with SIMD - execution time will be largely dominated by loads and stores - you should try to combine this multiplication with other SIMD operations to mitigate the load/store cost.

Entrywise addition of two double arrays using AVX

I need a function to entrywise add the elements of two double arrays and store the result in a third array. Currently I use (simplified)
void add( double* result, const double* a, const double* b, size_t size) {
memcpy(result, a, size*sizeof(double));
for(size_t i = 0; i < size; ++i) {
result[i] += b[i];
}
}
As far as I know the memcpy function uses AVX. In order to improve the performance I would like to also enforce AVX use for the addition. This should be one of the most basic examples for AVX, however I couldn't find any description how to do this in C/C++. I would like to avoid the use of external libraries if possible.
You'll need something like this, assuming AVX-512:
void add( double* result, const double* a, const double* b, size_t size)
{
size_t i = 0;
// Note we are doing as many blocks of 8 as we can. If the size is not divisible by 8
// then we will have some left over that will then be performed serially.
// AVX-512 loop
for( ; i < (size & ~0x7); i += 8)
{
const __m512d kA8 = _mm512_load_pd( &a[i] );
const __m512d kB8 = _mm512_load_pd( &b[i] );
const __m512d kRes = _mm512_add_pd( kA8, kB8 );
_mm512_stream_pd( &res[i], kRes );
}
// AVX loop
for ( ; i < (size & ~0x3); i += 4 )
{
const __m256d kA4 = _mm256_load_pd( &a[i] );
const __m256d kB4 = _mm256_load_pd( &b[i] );
const __m256d kRes = _mm256_add_pd( kA4, kB4 );
_mm256_stream_pd( &res[i], kRes );
}
// SSE2 loop
for ( ; i < (size & ~0x1); i += 2 )
{
const __m128d kA2 = _mm_load_pd( &a[i] );
const __m128d kB2 = _mm_load_pd( &b[i] );
const __m128d kRes = _mm_add_pd( kA2, kB2 );
_mm_stream_pd( &res[i], kRes );
}
// Serial loop
for( ; i < size; i++ )
{
result[i] = a[i] + b[i];
}
}
(Though be warned I've just thrown that together off the top of my head).
Something to note form the above code is that I essentially process the remaining values using the next best parallel code. Primarily this is for illustration of the 3 possible ways you could do it parallely. The loops will work perfectly well on their own. For example if you can't support AVX-512 then you'd jump straight to the AVX loop. If you can't support AVX even then if you jump straight to the SSE2 loop then you'll be using the most performant loop that your hardware can support.
For best performance your arrays should be aligned to the relevant size used in the load. So for AVX-512 you would want 512-bit of 64 byte alignment. For AVX, 256-bit or 32 byte alignment. For SSE2 128-bit or 16 byte alignment. If you use 64 byte alignment for all your arrays then you will always have good alignment, though you may want to go for 128 byte alignment to ease moving over to AVX-1024 when that appears ;)

Any way to make this relatively simple (nested for memory copy) C++ code more efficient?

I realize this is kind of a goofy question, for lack of a better term. I'm just kind of looking for any outside idea on increasing the efficiency of this code, as it's bogging down the system very badly (it has to perform this function a lot) and I'm running low on ideas.
What it's doing it loading two image containers (imgRGB for a full color img and imgBW for a b&w image) pixel-by-individual-pixel of an image that's stored in "unsigned char *pImage".
Both imgRGB and imgBW are containers for accessing individual pixels as necessary.
// input is in the form of an unsigned char
// unsigned char *pImage
for (int y=0; y < 640; y++) {
for (int x=0; x < 480; x++) {
imgRGB[y][x].blue = *pImage;
pImage++;
imgRGB[y][x].green = *pImage;
imgBW[y][x] = *pImage;
pImage++;
imgRGB[y][x].red = *pImage;
pImage++;
}
}
Like I said, I was just kind of looking for fresh input and ideas on better memory management and/or copy than this. Sometimes I look at my own code so much I get tunnel vision... a bit of a mental block. If anyone wants/needs more information, by all means let me know.
The obvious question is, do you need to copy the data in the first place? Can't you just define accessor functions to extract the R, G and B values for any given pixel from the original input array?
If the image data is transient so you have to keep a copy of it, you could just make a raw copy of it without any reformatting, and again define accessors to index into each pixel/channel on that.
Assuming the copy you outlined is necessary, unrolling the loop a few times may prove to help.
I think the best approach will be to unroll the loop enough times to ensure that each iteration processes a chunk of data divisible by 4 bytes (so in each iteration, the loop can simply read a small number of ints, rather than a large number of chars)
Of course this requires you to mask out bits of these ints when writing, but that's a fast operation, and most importantly, it is done in registers, without burdening the memory subsystem or the CPU cache:
// First, we need to treat the input image as an array of ints. This is a bit nasty and technically unportable, but you get the idea)
unsigned int* img = reinterpret_cast<unsigned int*>(pImage);
for (int y = 0; y < 640; ++y)
{
for (int x = 0; x < 480; x += 4)
{
// At the start of each iteration, read 3 ints. That's 12 bytes, enough to write exactly 4 pixels.
unsigned int i0 = *img;
unsigned int i1 = *(img+1);
unsigned int i2 = *(img+2);
img += 3;
// This probably won't make a difference, but keeping a reference to the found pixel saves some typing, and it may assist the compiler in avoiding aliasing.
ImgRGB& pix0 = imgRGB[y][x];
pix0.blue = i0 & 0xff;
pix0.green = (i0 >> 8) & 0xff;
pix0.red = (i0 >> 16) & 0xff;
imgBW[y][x] = (i0 >> 8) & 0xff;
ImgRGB& pix1 = imgRGB[y][x+1];
pix1.blue = (i0 >> 24) & 0xff;
pix1.green = i1 & 0xff;
pix1.red = (i0 >> 8) & 0xff;
imgBW[y][x+1] = i1 & 0xff;
ImgRGB& pix2 = imgRGB[y][x+2];
pix2.blue = (i1 >> 16) & 0xff;
pix2.green = (i1 >> 24) & 0xff;
pix2.red = i2 & 0xff;
imgBW[y][x+2] = (i1 >> 24) & 0xff;
ImgRGB& pix3 = imgRGB[y][x+3];
pix3.blue = (i2 >> 8) & 0xff;
pix3.green = (i2 >> 16) & 0xff;
pix3.red = (i2 >> 24) & 0xff;
imgBW[y][x+3] = (i2 >> 16) & 0xff;
}
}
it is also very likely that you're better off filling a temporary ImgRGB value, and then writing that entire struct to memory at once, meaning that the first block would look like this instead: (the following blocks would be similar, of course)
ImgRGB& pix0 = imgRGB[y][x];
ImgRGB tmpPix0;
tmpPix0.blue = i0 & 0xff;
tmpPix0.green = (i0 >> 8) & 0xff;
tmpPix0.red = (i0 >> 16) & 0xff;
imgBW[y][x] = (i0 >> 8) & 0xff;
pix0 = tmpPix0;
Depending on how clever the compiler is, this may cut down dramatically on the required number of reads.
Assuming the original code is naively compiled (which is probably unlikely, but will serve as an example), this will get you from 3 reads and 4 writes per pixel (read RGB channel, and write RGB + BW) to 3/4 reads per pixel and 2 writes. (one write for the RGB struct, and one for the BW value)
You could also accumulate the 4 writes to the BW image in a single int, and then write that in one go too, something like this:
bw |= (i0 >> 8) & 0xff;
bw |= (i1 & 0xff) << 8;
bw |= ((i1 >> 24) & 0xff) << 16;
bw |= ((i2 >> 16) & 0xff) << 24;
*(imgBW + y*480+x/4) = bw; // Assuming you can treat imgBW as an array of integers
This would cut down on the number of writes to 1.25 per pixel (1 per RGB struct, and 1 for every 4 BW values)
Again, the benefit will probably be a lot smaller (or even nonexistent), but it may be worth a shot.
Taking this a step further, the same could be done without too much trouble using the SSE instructions, allowing you to process 4 times as many values per iteration. (Assuming you're running on x86)
Of course, an important disclaimer here is that the above is nonportable. The reinterpret_cast is probably an academic point (it'll most likely work no matter what, especially if you can ensure that the original array is aligned on a 32-bit boundary, which will typically be the case for large allocations on all platforms)
A bigger issue is that the bit-twiddling depends on the CPU's endianness.
But in practice, this should work on x86. and with small changes, it should work on big-endian machines too. (modulo any bugs in my code, of course. I haven't tested or even compiled any of it ;))
But no matter how you solve it, you're going to see the biggest speed improvements from minimizing the number of reads and writes, and trying to accumulate as much data in the CPU's registers as possible. Read all you can in large chunks, like ints, reorder it in the registers (accumulate it into a number of ints, or write it into temporary instances of the RGB struct), and then write those combined value out to memory.
Depending on how much you know about low-level optimizations, it may be surprising to you, but temporary variables are fine, while direct memory to memory access can be slow (for example your pointer dereferencing assigned directly into the array). The problem with this is that you may get more memory accesses than necessary, and it's harder for the compiler to guarantee that no aliasing will occur, and so it may be unable to reorder or combine the memory accesses. You're generally better off writing as much as you can early on (top of the loop), doing as much as possible in temporaries (because the compiler can keep everything in registers), and then write everything out at the end. That also gives the compiler as much leeway as possible to wait for the initially slow reads.
Finally, adding a 4th dummy value to the RGB struct (so it has a total size of 32bit) will most likely help a lot too (because then writing such a struct is a single 32-bit write, which is simpler and more efficient than the current 24-bit)
When deciding how much to unroll the loop (you could do the above twice or more in each iteration), keep in mind how many registers your CPU has. Spilling out into the cache will probably hurt you as there are plenty of memory accesses already, but on the other hand, unroll as much as you can afford given the number of registers available (the above uses 3 registers for keeping the input data, and one to accumulate the BW values. It may need one or two more to compute the necessary addresses, so on x86, doubling the above might be pushing it a bit (you have 8 registers total, and some of them have special meanings). On the other hand, modern CPU's do a lot to compensate for register pressure, by using a much larger number of registers behind the scenes, so further unrolling might still be a total performance win.
As always, measure measure measure. It's impossible to say what's fast and what isn't until you've tested it.
Another general point to keep in mind is that data dependencies are bad. This won't be a big deal as long as you're only dealing with integral values, but it still inhibits instruction reordering, and superscalar execution.
In the above, I've tried to keep dependency chains as short as possible. Rather than continually incrementing the same pointer (which means that each increment is dependant on the previous one), adding a different offset to the same base address means that every address can be computed independently, again giving more freedom to the compiler to reorder and reschedule instructions.
I think the array accesses (are they real array accesses or operator []?) are going to kill you. Each one represents a multiply.
Basically, you want something like this:
for (int y=0; y < height; y++) {
unsigned char *destBgr = imgRgb.GetScanline(y); // inline methods are better
unsigned char *destBW = imgBW.GetScanline(y);
for (int x=0; x < width; x++) {
*destBgr++ = *pImage++;
*destBW++ = *destBgr++ = *pImage++; // do this in one shot - don't double deref
*destBgr++ = *pImage++;
}
}
This will do two multiplies per scanline. You code was doing 4 multiplies per PIXEL.
What I like to do in situations like this is go into the debugger and step through the disassembly to see what it is really doing (or have the compiler generate an assembly listing). This can give you a lot of clues about where inefficencies are. They are often not where you think!
By implementing the changes suggested by Assaf and David Lee above, you can get a before and after instruction count. This really helps me in optimizing tight inner loops.
You could optimize away some of the pointer arithmetic you're doing over and over with the subscript operators [][] and use an iterator instead (that is, advance a pointer).
Memory bandwidth is your bottleneck here. There is a theoretical minimum time required to transfer all the data to and from system memory. I wrote a little test to compare the OP's version with some simple assembler to see how good the compiler was. I'm using VS2005 with default release mode settings. Here's the code:
#include <windows.h>
#include <iostream>
using namespace std;
const int
c_width = 640,
c_height = 480;
typedef struct _RGBData
{
unsigned char
r,
g,
b;
// I'm assuming there's no padding byte here
} RGBData;
// similar to the code given
void SimpleTest
(
unsigned char *src,
RGBData *rgb,
unsigned char *bw
)
{
for (int y = 0 ; y < c_height ; ++y)
{
for (int x = 0 ; x < c_width ; ++x)
{
rgb [x + y * c_width].b = *src;
src++;
rgb [x + y * c_width].g = *src;
bw [x + y * c_width] = *src;
src++;
rgb [x + y * c_width].r = *src;
src++;
}
}
}
// the assembler version
void ASM
(
unsigned char *src,
RGBData *rgb,
unsigned char *bw
)
{
const int
count = 3 * c_width * c_height / 12;
_asm
{
push ebp
mov esi,src
mov edi,bw
mov ecx,count
mov ebp,rgb
l1:
mov eax,[esi]
mov ebx,[esi+4]
mov edx,[esi+8]
mov [ebp],eax
shl eax,16
mov [ebp+4],ebx
rol ebx,16
mov [ebp+8],edx
shr edx,24
and eax,0xff000000
and ebx,0x00ffff00
and edx,0x000000ff
or eax,ebx
or eax,edx
add esi,12
bswap eax
add ebp,12
stosd
loop l1
pop ebp
}
}
// timing framework
LONGLONG TimeFunction
(
void (*function) (unsigned char *src, RGBData *rgb, unsigned char *bw),
char *description,
unsigned char *src,
RGBData *rgb,
unsigned char *bw
)
{
LARGE_INTEGER
start,
end;
cout << "Testing '" << description << "'...";
memset (rgb, 0, sizeof *rgb * c_width * c_height);
memset (bw, 0, c_width * c_height);
QueryPerformanceCounter (&start);
function (src, rgb, bw);
QueryPerformanceCounter (&end);
bool
ok = true;
unsigned char
*bw_check = bw,
i = 0;
RGBData
*rgb_check = rgb;
for (int count = 0 ; count < c_width * c_height ; ++count)
{
if (bw_check [count] != i || rgb_check [count].r != i || rgb_check [count].g != i || rgb_check [count].b != i)
{
ok = false;
break;
}
++i;
}
cout << (end.QuadPart - start.QuadPart) << (ok ? " OK" : " Failed") << endl;
return end.QuadPart - start.QuadPart;
}
int main
(
int argc,
char *argv []
)
{
unsigned char
*source_data = new unsigned char [c_width * c_height * 3];
RGBData
*rgb = new RGBData [c_width * c_height];
unsigned char
*bw = new unsigned char [c_width * c_height];
int
v = 0;
for (unsigned char *dest = source_data ; dest < &source_data [c_width * c_height * 3] ; ++dest)
{
*dest = v++ / 3;
}
LONGLONG
totals [2] = {0, 0};
for (int i = 0 ; i < 10 ; ++i)
{
cout << "Iteration: " << i << endl;
totals [0] += TimeFunction (SimpleTest, "Initial Copy", source_data, rgb, bw);
totals [1] += TimeFunction ( ASM, " ASM Copy", source_data, rgb, bw);
}
LARGE_INTEGER
freq;
QueryPerformanceFrequency (&freq);
freq.QuadPart /= 100000;
cout << totals [0] / freq.QuadPart << "ns" << endl;
cout << totals [1] / freq.QuadPart << "ns" << endl;
delete [] bw;
delete [] rgb;
delete [] source_data;
return 0;
}
And the ratio between C and assembler I was getting was about 2.5:1, i.e. C was 2.5 times the time of the assembler version.
I've just noticed the original data was in BGR order. If the copy swapped the B and R components then it does make the assembler code a bit more complex. But it would also make the C code more complex too.
Ideally, you need to work out what the theoretical minimum time is and compare it to what you're actually getting. To do that, you need to know the memory frequency and the type of memory and the workings of the CPU's MMU.
You might try using a simple cast to get your RGB data, and just recompute the grayscale data:
#pragma pack(1)
typedef unsigned char bw_t;
typedef struct {
unsigned char blue;
unsigned char green;
unsigned char red;
} rgb_t;
#pragma pack(pop)
rgb_t *imageRGB = (rgb_t*)pImage;
bw_t *imageBW = (bw_t*)calloc(640*480, sizeof(bw_t));
// RGB(X,Y) = imageRGB[Y*480 + X]
// BW(X,Y) = imageBW[Y*480 + X]
for (int y = 0; y < 640; ++y)
{
// try and pull some larger number of bytes from pImage (24 is arbitrary)
// 24 / sizeof(rgb_t) = 8
for (int x = 0; x < 480; x += 24)
{
imageBW[y*480 + x ] = GRAYSCALE(imageRGB[y*480 + x ]);
imageBW[y*480 + x + 1] = GRAYSCALE(imageRGB[y*480 + x + 1]);
imageBW[y*480 + x + 2] = GRAYSCALE(imageRGB[y*480 + x + 2]);
imageBW[y*480 + x + 3] = GRAYSCALE(imageRGB[y*480 + x + 3]);
imageBW[y*480 + x + 4] = GRAYSCALE(imageRGB[y*480 + x + 4]);
imageBW[y*480 + x + 5] = GRAYSCALE(imageRGB[y*480 + x + 5]);
imageBW[y*480 + x + 6] = GRAYSCALE(imageRGB[y*480 + x + 6]);
imageBW[y*480 + x + 7] = GRAYSCALE(imageRGB[y*480 + x + 7]);
}
}
Several steps you can take. Result at the end of this answer.
First, use pointers.
const unsigned char *pImage;
RGB *rgbOut = imgRGB;
unsigned char *bwOut = imgBW;
for (int y=0; y < 640; ++y) {
for (int x=0; x < 480; ++x) {
rgbOut->blue = *pImage;
++pImage;
unsigned char tmp = *pImage; // Save to reduce amount of reads.
rgbOut->green = tmp;
*bwOut = tmp;
++pImage;
rgbOut->red = *pImage;
++pImage;
++rgbOut;
++bwOut;
}
}
If imgRGB and imgBW are declared as:
unsigned char imgBW[480][640];
RGB imgRGB[480][640];
You can combine the two loops:
const unsigned char *pImage;
RGB *rgbOut = imgRGB;
unsigned char *bwOut = imgBW;
for (int i=0; i < 640 * 480; ++i) {
rgbOut->blue = *pImage;
++pImage;
unsigned char tmp = *pImage; // Save to reduce amount of reads.
rgbOut->green = tmp;
*bwOut = tmp;
++pImage;
rgbOut->red = *pImage;
++pImage;
++rgbOut;
++bwOut;
}
You can exploit the fact that word reads are faster than four char reads. We will use a helper macro for this. Note this example assumes a little-endian target system.
const unsigned char *pImage;
RGB *rgbOut = imgRGB;
unsigned char *bwOut = imgBW;
const uint32_t *curPixelGroup = pImage;
for (int i=0; i < 640 * 480; ++i) {
uint64_t pixels = 0;
#define WRITE_PIXEL \
rgbOut->blue = pixels; \
pixels >>= 8; \
\
rgbOut->green = pixels; \
*bwOut = pixels; \
pixels >>= 8; \
\
rgbOut->red = pixels; \
pixels >>= 8; \
\
++rgbOut; \
++bwOut;
#define READ_PIXEL(shift) \
pixels |= (*curPixelGroup++) << (shift * 8);
READ_PIXEL(0); WRITE_PIXEL;
READ_PIXEL(1); WRITE_PIXEL;
READ_PIXEL(2); WRITE_PIXEL;
READ_PIXEL(3); WRITE_PIXEL;
/* Remaining */ WRITE_PIXEL;
#undef COPY_PIXELS
}
(Your compiler will probably optimize away the redundant or operation in the first READ_PIXEL. It will also optimize shifts, removing the redundant << 0, too.)
If the structure of RGB is thus:
struct RGB {
unsigned char blue, green, red;
};
You can optimize even further, copy to the struct directly, instead of through its members (red, green, blue). This can be done using anonymous structs (or casting, but that makes the code a bit more messy and probably more prone to error). (Again, this is dependant on little-endian systems, etc. etc.):
union RGB {
struct {
unsigned char blue, green, red;
};
uint32_t rgb:24; // Make sure it's a bitfield, otherwise the union will strech and ruin the ++ operator.
};
const unsigned char *pImage;
RGB *rgbOut = imgRGB;
unsigned char *bwOut = imgBW;
const uint32_t *curPixelGroup = pImage;
for (int i=0; i < 640 * 480; ++i) {
uint64_t pixels = 0;
#define WRITE_PIXEL \
rgbOut->rgb = pixels; \
pixels >>= 8; \
\
*bwOut = pixels; \
pixels >>= 16; \
\
++rgbOut; \
++bwOut;
#define READ_PIXEL(shift) \
pixels |= (*curPixelGroup++) << (shift * 8);
READ_PIXEL(0); WRITE_PIXEL;
READ_PIXEL(1); WRITE_PIXEL;
READ_PIXEL(2); WRITE_PIXEL;
READ_PIXEL(3); WRITE_PIXEL;
/* Remaining */ WRITE_PIXEL;
#undef COPY_PIXELS
}
You can optimize writing the pixel similarly as we did with reading (writing in words rather than 24-bits). In fact, that'd be a pretty good idea, and will be a great next step in optimization. Too tired to code it, though. =]
Of course, you can write the routine in assembly language. This makes it less portable than it already is, however.
I'm assuming the following at the moment, so please let me know if my assumptions are wrong:
a) imgRGB is a structure of the type
struct ImgRGB
{
unsigned char blue;
unsigned char green;
unsigned char red;
};
or at least something similar.
b) imgBW looks something like this:
struct ImgBW
{
unsigned char BW;
};
c) The code is single threaded
Assuming the above, I see several problems with your code:
You put the assignment to the BW part right in the middle of the assignments to the other containers. If you're working on a modern CPU, chances are that with the size of your data your L1 cache gets invalidated every time you're switching containers and you're looking at reloading or switching a cache line. Caches are optimised for linear access these days so hopping to and fro doesn't help. Accessing main memory is a lot slower, so that would be a noticeable performance hit. To verify if this is a problem, temporarily I'd remove the assignment to imgBW and measure if there is a noticeable speedup.
The array access doesn't help and it'll potentially slow down the code a little, although a decent optimiser should take care of that. I'd probably write the loop along these lines instead, but would not expect a big performance gain. Maybe a couple percent.
for (int y=0; y blue = *pImage;
...
}
}
For consistency I would change from using postfix to prefix increment but I would not expect to see a big gain.
If you can waste a little storage (well, 25%) you might gain from adding a fourth dummy unsigned char to the structure ImgRGB provided that this would increase the size of the structure to the size of an int. Native ints are usually fastest to access and if you're looking at a structure of chars that are not filling up an int completely, you're potentially running into all sorts of interesting access issues that can slow your code down noticeably because the compiler might have to generate additional instructions to extract the unsigned chars. Again, try this and measure the result - it might make a noticeable difference or none at all. In the same vein, upping the size of the structure members from unsigned char to unsigned int might waste lots of space but potentially can speed up the code. Nevertheless as long as pImage is a pointer to an unsigned char, you would only eliminate half the problem.
All in all you are down to making your loop fit to your underlying hardware, so for specific optimisation techniques you might have to read up on what your hardware does well and what it does badly.
Make sure pImage, imgRGB, and imgBW are marked __restrict.
Use SSE and do it sixteen bytes at a time.
Actually from what you're doing there it looks like you could use a simple memcpy() to copy pImage into imgRGB (since imgRGB is in row-major format and apparently in the same order as pImage). You could fill out imgBW by using a series of SSE swizzle and store ops to pack down the green values but it might be cumbersome since you'd need to work on ( 3*16 =) 48 bytes at a time.
Are you sure pImage and your output arrays are all in dcache when you start this? Try using a prefetch hint to fetch 128 bytes ahead and measure to see if that improves things.
Edit If you're not on x86, replace "SSE" with the appropriate SIMD instruction set for your hardware, of course. (That'd be VMX, Altivec, SPU, VLIW, HLSL, etc.)
If possible, fix this at a higher level then bit or instruction twiddling!
You could specialize the the B&W image class to one that references the green channel of the color image class (thus saving a copy per pixel). If you always create them in pair, you might not even need the naive imgBW class at all.
By taking care about how your store the data in imgRGB, you could copy a triplet at a time from the input data. Better, you might copy the whole thing, or even just store a reference (which makes the previous suggestion easy as well).
If you don't control the implementation of everything here, you might be stuck, then:
Last resort: unroll the loop (cue someone mentioning Duff's device, or just ask the compiler to do it for you...), though I don't think you'll see much improvement...
It seems that you defined each pixel as some kind of structure or object. Using a primitive type (say, int) could be faster. As others have mentioned, the compiler is likely to optimize the array access using pointer increments. If the compile doesn't do that for you, you can do that yourself to avoid multiplications when you use array[][].
Since you only need 3 bytes per pixel, you could pack one pixel into one int. By doing that, you could copy 3 bytes a time instead of byte-by-byte. The only tricky thing is when you want to read individual color components of a pixel, you will need some bit masking and shifting. This could give you more overhead than that saved by using an int.
Or you can use 3 int arrays for 3 color components respectively. You will need a lot more storage, though.
Here is one very tiny, very simple optimization:
You are referring to imageRGB[y][x] repeatedly, and that likely needs to be re-calculated at each step.
Instead, calculate it once, and see if that makes some improvement:
Pixel* apixel;
for (int y=0; y < 640; y++) {
for (int x=0; x < 480; x++) {
apixel = &imgRGB[y][x];
apixel->blue = *pImage;
pImage++;
apixel->green = *pImage;
imgBW[y][x] = *pImage;
pImage++;
apixel->red = *pImage;
pImage++;
}
}
If pImage is already entirely in memory, why do you need to massage the data? I mean if it is already in pseudo-RGB format, why can't you just write some inline routines/macros that can spit out the values on demand instead of copying it around?
If rearranging the pixel data is important for later operations, consider block operations and/or cache line optimization.