SSE: reinterpret_cast<__m128*> instead of _mm_load_ps - c++

I am in the process of coding up a simple convolution function in C++, starting from the very basic "sliding-window" convolution with regular products (no FFT stuff for now), up to SEE, AVX and possibly OpenCL. I ran into a problem with SSE though. My code looks like this:
for (x = 0; x < SIZEX - KSIZEX + 1; ++x)
{
for (y = 0; y < SIZEY - KSIZEY + 1; ++y)
{
tmp = 0.0f;
float fDPtmp = 0.0f;
float *Kp = &K[0];
for (xi = 0; xi < KSIZEX; ++xi, Kp=Kp+4)
{
float *Cp = &C[(x+xi)*SIZEY + y];
__m128 *KpSSE = reinterpret_cast<__m128*>(&K);
__m128 *CpSSE = reinterpret_cast<__m128*>(&C[(x + xi)*SIZEY + y]);
__m128 DPtmp = _mm_dp_ps(*KpSSE, *CpSSE, 0xFF);
_mm_store_ss(&fDPtmp, DPtmp);
tmp += fDPtmp;
}
R[k] = tmp;
++k;
}
}
The necessary matrices are initialized like this (the size of those is considerd ok because the simpler implementations work just fine):
__declspec(align(16)) float *C = ReadMatrix("E:\\Code\\conv\\C.bin");
__declspec(align(16)) float *K = ReadMatrix("E:\\Code\\conv\\K.bin");
__declspec(align(16)) float *R = new float[CSIZEX*CSIZEY];
The code crashes at y=1 so I feel there might be a mistake with the way I handle the pointers. The interesting thing is that if I replace the reinterpret_casts with _mm_set_ps, i.e.
__m128 KpSSE = _mm_set_ps(Kp[0], Kp[1], Kp[2], Kp[3]);
__m128 CpSSE = _mm_set_ps(Cp[0], Cp[1], Cp[2], Cp[3]);
__m128 DPtmp = _mm_dp_ps(KpSSE, CpSSE, 0xFF);
_mm_store_ss(&fDPtmp, DPtmp);
the whole thing works just fine although slower, which I blame on all the copy operations.
Can anybody please point me to what exactly I am doing wrong here?
Thank you very much
Pat
Update: Ok, so as pointed out by Paul the problem lies with ReadMatrix (or another solution would be to use _mm_loadu_ps). As for ReadMatrix(), it looks like this:
__declspec(align(16)) float* ReadMatrix(string path)
{
streampos size;
ifstream file(path, ios::in | ios::binary | ios::ate);
if (file.is_open())
{
size = file.tellg();
__declspec(align(16)) float *C = new float[size];
file.seekg(0, ios::beg);
file.read(reinterpret_cast<char*>(&C[0]), size);
file.close();
return C;
}
else cout << "Unable to open file" << endl;
}
It does not do the trick. Is there any other way of doing this elegantly rather than being forced to read the file piece by piece and perform memcpy, which I assume should work?!
Update:
Still does not seem to want to work after
__declspec(align(16)) float* ReadMatrix(string path)
{
streampos size;
ifstream file(path, ios::in | ios::binary | ios::ate);
if (file.is_open())
{
size = file.tellg();
__declspec(align(16)) float *C = static_cast<__declspec(align(16)) float*>(_aligned_malloc(size * sizeof(*C), 16));
file.seekg(0, ios::beg);
file.read(reinterpret_cast<char*>(&C[0]), size);
file.close();
return C;
}
else cout << "Unable to open file" << endl;
}
I added the static_cast up there since it seemed necessary to get Paul's code to compile (i.e. _aligned_malloc returns a void pointer). I am getting close to just read chunks of the file with fread and memcpy them into an alligned array. :/ Yet again I am finding myself asking for advice. Thank you very much all.
Pat
PS: Non-SSE code works fine with these data structures. _mm_loadu_ps is slower than using the non-SSE code.

This doesn't do what you think it does:
__declspec(align(16)) float *C = ReadMatrix("E:\\Code\\conv\\C.bin");
All that the alignment directive achieves here is to align the pointer itself (i.e. C) to a 16 byte boundary, not the contents of the pointer.
You either need to fix ReadMatrix so that it returns suitably aligned data, or use _mm_loadu_ps, as others have already suggested.
Do not use _mm_set_ps as this will tend to generate a lot of instructions under the hood, unlike _mm_loadu_ps, which maps to a single instruction.
UPDATE
You have repeated the same mistake in ReadMatrix:
__declspec(align(16)) float *C = new float[size];
again this does not guarantee the alignment of the data, only of the pointer C itself. To fix this allocation you can use _mm_malloc or _aligned_malloc:
float *C = _mm_malloc(size * sizeof(*C), 16);
or
float *C = _aligned_malloc(size * sizeof(*C), 16);

In ReadMatrix, you have no guarantee whatsoever that the new expression returns a properly aligned pointer. It doesn't matter that you assign to an aligned pointer (and I'm not even sure if your syntax means the pointer itself is aligned, or what it points to).
You need to use _mm_align, or _mm_malloc, or some other aligned allocation facility.

You can't use reinterpret_cast here, and I understand _mmloadu_ps is slow. But there is another way. Unroll your loop, read in aligned data, and shift and mask in the new value before you perform operations on it. This will be fast and correct. That is, you can do some like this in your inner loop:
__m128i x = _mm_load_ps(p);
__m128i y = _mm_load_ps(p + sizeof(float));
__m128i z;
// do your operation on x 1st time this iteration here
z = _mm_slli_si128(y, sizeof(float) * 3);
x = _mm_srli_si128(x, sizeof(float));
x = _mm_or_si128(x, z);
// do your operation on x 2nd time this iteration here
z = _mm_slli_si128(y, sizeof(float) * 2);
x = _mm_srli_si128(x, sizeof(float) * 2);
x = _mm_or_si128(x, z);
// do your operation on x 3rd time this iteration here
z = _mm_slli_si128(y, sizeof(float));
x = _mm_srli_si128(x, sizeof(float) * 3);
x = _mm_or_si128(x, z);
// do your operation on x 4th time this iteration here
x = y; // don’t need to read in x next iteration, only y
loopCounter += 4 * sizeof(float);

Related

access violation _mm_store_si128 SSE Intrinsics

I want to create a histogram of vertical gradients in an 8 bit gray image.
The vertical distance to calculate the gradient can be specified.
I already managed to speed up another part of my code using Intrinsics, but it does not work here.
The code runs without exception if the _mm_store_si128 is commented out.
When it is not commented, I get an access violation.
What is going wrong here?
#define _mm_absdiff_epu8(a,b) _mm_adds_epu8(_mm_subs_epu8(a, b), _mm_subs_epu8(b, a)) //from opencv
void CreateAbsDiffHistogramUnmanaged(void* source, unsigned int sourcestride, unsigned int height, unsigned int verticalDistance, unsigned int histogram[])
{
unsigned int xcount = sourcestride / 16;
__m128i absdiffData;
unsigned char* bytes = (unsigned char*) _aligned_malloc(16, 16);
__m128i* absdiffresult = (__m128i*) bytes;
__m128i* sourceM = (__m128i*) source;
__m128i* sourceVOffset = (__m128i*)source + verticalDistance * sourcestride;
for (unsigned int y = 0; y < (height - verticalDistance); y++)
{
for (unsigned int x = 0; x < xcount; x++, ++sourceM, ++sourceVOffset)
{
absdiffData = _mm_absdiff_epu8(*sourceM, *sourceVOffset);
_mm_store_si128(absdiffresult, absdiffData);
//unroll loop
histogram[bytes[0]]++;
histogram[bytes[1]]++;
histogram[bytes[2]]++;
histogram[bytes[3]]++;
histogram[bytes[4]]++;
histogram[bytes[5]]++;
histogram[bytes[6]]++;
histogram[bytes[7]]++;
histogram[bytes[8]]++;
histogram[bytes[9]]++;
histogram[bytes[10]]++;
histogram[bytes[11]]++;
histogram[bytes[12]]++;
histogram[bytes[13]]++;
histogram[bytes[14]]++;
histogram[bytes[15]]++;
}
}
_aligned_free(bytes);
}
Your function crashed while loading because the input data was not aligned properly. In order to solve this problem you have to change your code from:
absdiffData = _mm_absdiff_epu8(*sourceM, *sourceVOffset);
to:
absdiffData = _mm_absdiff_epu8(_mm_loadu_si128(sourceM), _mm_loadu_si128(sourceVOffset));
Here I use unaligned loading.
P.S. I have implemented a similar function (SimdAbsSecondDerivativeHistogram) in Simd Library. It has SSE2, AVX2, NEON and Altivec implementations. I hope that it will help you.
P.P.S. Also I would strongly recommended to check this line:
__m128i* sourceVOffset = (__m128i*)source + verticalDistance * sourcestride);
It may result in a crash (access to memory outside of the input array bounds). Maybe you had in mind this:
__m128i* sourceVOffset = (__m128i*)((char*)source + verticalDistance * sourcestride);

Why is my CUDA kernel returning old values?

Kind of almost at the point of ripping my hair out over this issue.
I have a CUDA kernel that does some math on data stored in a 3D array. While testing this, I used to assign some values (non-zero) to the array and observe results. I commented out those lines since, but the result is still the same. It is as if it is completely ignoring the fact that I'm doing a memset to 0.
The code works correctly when I step through it in Debug... But not in Release! My guess is I have a memory leak from this matrix.
I allocate this array as:
cudaExtent m_extent = make_cudaExtent(sizeof(float)*matdim.x, matdim.y, matdim.z); // width, height, depth
cudaPitchedPtr m_device;
cudaMalloc3D(&m_device, m_extent);
cudaMemset3D(m_device, 0, m_extent);
I call the kernel in a loop like this:
for (int iter = 0; iter < gpu_iterations; iter++)
{
PF_iteration_kernel<<<grids,threads>>>(m_device, m_extent, matdim);
cudaDeviceSynchronize();
}
After which I release the m_device pitched pointer:
cudaFree(m_device.ptr);
matdim is just matrix dimensions held by a dim3.
Within the kernel I do the following (well, I commented everything functional out...):
__global__ void PF_iteration_kernel(cudaPitchedPtr mPtr, cudaExtent mExt, dim3 matrix_dimensions)
{
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
// Find location within the pitched memory
char *m = (char*)mPtr.ptr;
int sof = sizeof(float);
size_t pitch = mPtr.pitch;
size_t slice_pitch = pitch*mExt.height;
char* m_addroff = m + y * pitch + x * sof;
printf("m(%d,%d) is %f \n", x, y, *m_addroff); // display the slice
*m_addroff = 0; // WILL THIS RESET IT?!
__syncthreads();
}
That should be just showing 0s, but it displays my old values (25, 26, 27, 28, etc).
I have cleaned and re-cleaned and re-built everything several times. I have relaunched the IDE.
My IDE is Visual Studio 2010 With NSight 4.6 (CUDA 7.0).
I am on Windows 7 x64
Consider this
char* m_addroff = m + y * pitch + x * sof;
printf("m(%d,%d) is %f \n", x, y, *m_addroff);
The compiler will see a char and promote it to int pushed on to stack - not a float promoted to double that the format requires.
The compiler does not provide arguments to fit the format spec, but some compilers will examine the format specs and warn of problems.
I suggest you cast the argument. I risk guessing and failing, but something like this
printf("m(%d,%d) is %f \n", x, y, *(float*)m_addroff);
Herer is a simple example.
#include <stdio.h>
int main()
{
char car [4] = {0};
char *cptr = car;
printf ("Hello %f\n", *(float*)cptr);
return 0;
}

Optimize a nearest neighbor resizing algorithm for speed

I'm using the next algorithm to perform nearest neighbor resizing. Is there anyway to optimize it's speed? Input and Output buffers are in ARGB format, though images are known to be always opaque. Thank you.
void resizeNearestNeighbor(const uint8_t* input, uint8_t* output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight)
{
const int x_ratio = (int)((sourceWidth << 16) / targetWidth);
const int y_ratio = (int)((sourceHeight << 16) / targetHeight) ;
const int colors = 4;
for (int y = 0; y < targetHeight; y++)
{
int y2_xsource = ((y * y_ratio) >> 16) * sourceWidth;
int i_xdest = y * targetWidth;
for (int x = 0; x < targetWidth; x++)
{
int x2 = ((x * x_ratio) >> 16) ;
int y2_x2_colors = (y2_xsource + x2) * colors;
int i_x_colors = (i_xdest + x) * colors;
output[i_x_colors] = input[y2_x2_colors];
output[i_x_colors + 1] = input[y2_x2_colors + 1];
output[i_x_colors + 2] = input[y2_x2_colors + 2];
output[i_x_colors + 3] = input[y2_x2_colors + 3];
}
}
}
restrict keyword will help a lot, assuming no aliasing.
Another improvement is to declare another pointerToOutput and pointerToInput as uint_32_t, so that the four 8-bit copy-assignments can be combined into a 32-bit one, assuming pointers are 32bit aligned.
There's little that you can do to speed this up, as you already arranged the loops in the right order and cleverly used fixed-point arithmetic. As others suggested, try to move the 32 bits in a single go (hoping that the compiler didn't see that yet).
In case of significant enlargement, there is a possibility: you can determine how many times every source pixel needs to be replicated (you'll need to work on the properties of the relation Xd=Wd.Xs/Ws in integers), and perform a single pixel read for k writes. This also works on the y's, and you can memcpy the identical rows instead of recomputing them. You can precompute and tabulate the mappings of the X's and Y's using run-length coding.
But there is a barrier that you will not pass: you need to fill the destination image.
If you are desperately looking for speedup, there could remain the option of using vector operations (SEE or AVX) to handle several pixels at a time. Shuffle instructions are available that might enable to control the replication (or decimation) of the pixels. But due to the complicated replication pattern combined with the fixed structure of the vector registers, you will probably need to integrate a complex decision table.
The algorithm is fine, but you can utilize massive parallelization by submitting your image to the GPU. If you use opengl, simply creating a context of the new size and providing a properly sized quad can give you inherent nearest neighbor calculations. Also opengl could give you access to other resizing sampling techniques by simply changing the properties of the texture you read from (which would amount to a single gl command which could be an easy paramter to your resize function).
Also later in development, you could simply swap out a shader for other blending techniques which also keeps you utilizing your wonderful GPU processor of image processing glory.
Also, since you aren't using any fancy geometry it can become almost trivial to write the program. It would be a little more involved than your algorithm, but it could perform magnitudes faster depending on image size.
I hope I didn't break anything. This combines some of the suggestions posted thus far and is about 30% faster. I'm amazed that is all we got. I did not actually check the destination image to see if it was right.
Changes:
- remove multiplies from inner loop (10% improvement)
- uint32_t instead of uint8_t (10% improvement)
- __restrict keyword (1% improvement)
This was on an i7 x64 machine running Windows, compiled with MSVC 2013. You will have to change the __restrict keyword for other compilers.
void resizeNearestNeighbor2_32(const uint8_t* __restrict input, uint8_t* __restrict output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight)
{
const uint32_t* input32 = (const uint32_t*)input;
uint32_t* output32 = (uint32_t*)output;
const int x_ratio = (int)((sourceWidth << 16) / targetWidth);
const int y_ratio = (int)((sourceHeight << 16) / targetHeight);
int x_ratio_with_color = x_ratio;
for (int y = 0; y < targetHeight; y++)
{
int y2_xsource = ((y * y_ratio) >> 16) * sourceWidth;
int i_xdest = y * targetWidth;
int source_x_offset = 0;
int startingOffset = y2_xsource;
const uint32_t * inputLine = input32 + startingOffset;
for (int x = 0; x < targetWidth; x++)
{
i_xdest += 1;
source_x_offset += x_ratio_with_color;
int sourceOffset = source_x_offset >> 16;
output[i_xdest] = inputLine[sourceOffset];
}
}
}

Why is this slower than memcmp

I am trying to compare two rows of pixels.
A pixel is defined as a struct containing 4 float values (RGBA).
The reason I am not using memcmp is because I need to return the position of the 1st different pixel, which memcmp does not do.
My first implementation uses SSE intrinsics, and is ~30% slower than memcmp:
inline int PixelMemCmp(const Pixel* a, const Pixel* b, int count)
{
for (int i = 0; i < count; i++)
{
__m128 x = _mm_load_ps((float*)(a + i));
__m128 y = _mm_load_ps((float*)(b + i));
__m128 cmp = _mm_cmpeq_ps(x, y);
if (_mm_movemask_ps(cmp) != 15) return i;
}
return -1;
}
I then found that treating the values as integers instead of floats sped things up a bit, and is now only ~20% slower than memcmp.
inline int PixelMemCmp(const Pixel* a, const Pixel* b, int count)
{
for (int i = 0; i < count; i++)
{
__m128i x = _mm_load_si128((__m128i*)(a + i));
__m128i y = _mm_load_si128((__m128i*)(b + i));
__m128i cmp = _mm_cmpeq_epi32(x, y);
if (_mm_movemask_epi8(cmp) != 0xffff) return i;
}
return -1;
}
From what I've read on other questions, the MS implementation of memcmp is also implemented using SSE. My question is what other tricks does the MS implementation have up it's sleeve that I don't? How is it still faster even though it does a byte-by-byte comparison?
Is alignment an issue? If the pixel contains 4 floats, won't an array of pixels already be allocated on a 16 byte boundary?
I am compiling with /o2 and all the optimization flags.
I have written strcmp/memcmp optimizations with SSE (and MMX/3DNow!), and the first step is to ensure that the arrays are as aligned as possible - you may find that you have to do the first and/or last bytes "one at a time".
If you can align the data before it gets to the loop [if your code does the allocation], then that's ideal.
The second part is to unroll the loop, so you don't get so many "if loop isn't at the end, jump back to beginning of loop" - assuming the loop is quite long.
You may find that preloading the next data of the input before doing the "do we leave now" condition helps too.
Edit: The last paragraph may need an example. This code assumes an unrolled loop of at least two:
__m128i x = _mm_load_si128((__m128i*)(a));
__m128i y = _mm_load_si128((__m128i*)(b));
for(int i = 0; i < count; i+=2)
{
__m128i cmp = _mm_cmpeq_epi32(x, y);
__m128i x1 = _mm_load_si128((__m128i*)(a + i + 1));
__m128i y1 = _mm_load_si128((__m128i*)(b + i + 1));
if (_mm_movemask_epi8(cmp) != 0xffff) return i;
cmp = _mm_cmpeq_epi32(x1, y1);
__m128i x = _mm_load_si128((__m128i*)(a + i + 2));
__m128i y = _mm_load_si128((__m128i*)(b + i + 2));
if (_mm_movemask_epi8(cmp) != 0xffff) return i + 1;
}
Roughly something like that.
You might want to check this memcmp SSE implementation, specifically the __sse_memcmp function, it starts with some sanity checks and then checks if the pointers are aligned or not:
aligned_a = ( (unsigned long)a & (sizeof(__m128i)-1) );
aligned_b = ( (unsigned long)b & (sizeof(__m128i)-1) );
If they are not aligned it compares the pointers byte by byte until the start of an aligned address:
while( len && ( (unsigned long) a & ( sizeof(__m128i)-1) ) )
{
if(*a++ != *b++) return -1;
--len;
}
And then compares the remaining memory with SSE instructions similar to your code:
if(!len) return 0;
while( len && !(len & 7 ) )
{
__m128i x = _mm_load_si128( (__m128i*)&a[i]);
__m128i y = _mm_load_si128( (__m128i*)&b[i]);
....
I cannot help you directly because I'm using Mac, but there's an easy way to figure out what happens:
You just step into memcpy in the debug mode and switch to Disassembly view. As the memcpy is a simple little function, you will easily figure out all the implementation tricks.

Any way to make this relatively simple (nested for memory copy) C++ code more efficient?

I realize this is kind of a goofy question, for lack of a better term. I'm just kind of looking for any outside idea on increasing the efficiency of this code, as it's bogging down the system very badly (it has to perform this function a lot) and I'm running low on ideas.
What it's doing it loading two image containers (imgRGB for a full color img and imgBW for a b&w image) pixel-by-individual-pixel of an image that's stored in "unsigned char *pImage".
Both imgRGB and imgBW are containers for accessing individual pixels as necessary.
// input is in the form of an unsigned char
// unsigned char *pImage
for (int y=0; y < 640; y++) {
for (int x=0; x < 480; x++) {
imgRGB[y][x].blue = *pImage;
pImage++;
imgRGB[y][x].green = *pImage;
imgBW[y][x] = *pImage;
pImage++;
imgRGB[y][x].red = *pImage;
pImage++;
}
}
Like I said, I was just kind of looking for fresh input and ideas on better memory management and/or copy than this. Sometimes I look at my own code so much I get tunnel vision... a bit of a mental block. If anyone wants/needs more information, by all means let me know.
The obvious question is, do you need to copy the data in the first place? Can't you just define accessor functions to extract the R, G and B values for any given pixel from the original input array?
If the image data is transient so you have to keep a copy of it, you could just make a raw copy of it without any reformatting, and again define accessors to index into each pixel/channel on that.
Assuming the copy you outlined is necessary, unrolling the loop a few times may prove to help.
I think the best approach will be to unroll the loop enough times to ensure that each iteration processes a chunk of data divisible by 4 bytes (so in each iteration, the loop can simply read a small number of ints, rather than a large number of chars)
Of course this requires you to mask out bits of these ints when writing, but that's a fast operation, and most importantly, it is done in registers, without burdening the memory subsystem or the CPU cache:
// First, we need to treat the input image as an array of ints. This is a bit nasty and technically unportable, but you get the idea)
unsigned int* img = reinterpret_cast<unsigned int*>(pImage);
for (int y = 0; y < 640; ++y)
{
for (int x = 0; x < 480; x += 4)
{
// At the start of each iteration, read 3 ints. That's 12 bytes, enough to write exactly 4 pixels.
unsigned int i0 = *img;
unsigned int i1 = *(img+1);
unsigned int i2 = *(img+2);
img += 3;
// This probably won't make a difference, but keeping a reference to the found pixel saves some typing, and it may assist the compiler in avoiding aliasing.
ImgRGB& pix0 = imgRGB[y][x];
pix0.blue = i0 & 0xff;
pix0.green = (i0 >> 8) & 0xff;
pix0.red = (i0 >> 16) & 0xff;
imgBW[y][x] = (i0 >> 8) & 0xff;
ImgRGB& pix1 = imgRGB[y][x+1];
pix1.blue = (i0 >> 24) & 0xff;
pix1.green = i1 & 0xff;
pix1.red = (i0 >> 8) & 0xff;
imgBW[y][x+1] = i1 & 0xff;
ImgRGB& pix2 = imgRGB[y][x+2];
pix2.blue = (i1 >> 16) & 0xff;
pix2.green = (i1 >> 24) & 0xff;
pix2.red = i2 & 0xff;
imgBW[y][x+2] = (i1 >> 24) & 0xff;
ImgRGB& pix3 = imgRGB[y][x+3];
pix3.blue = (i2 >> 8) & 0xff;
pix3.green = (i2 >> 16) & 0xff;
pix3.red = (i2 >> 24) & 0xff;
imgBW[y][x+3] = (i2 >> 16) & 0xff;
}
}
it is also very likely that you're better off filling a temporary ImgRGB value, and then writing that entire struct to memory at once, meaning that the first block would look like this instead: (the following blocks would be similar, of course)
ImgRGB& pix0 = imgRGB[y][x];
ImgRGB tmpPix0;
tmpPix0.blue = i0 & 0xff;
tmpPix0.green = (i0 >> 8) & 0xff;
tmpPix0.red = (i0 >> 16) & 0xff;
imgBW[y][x] = (i0 >> 8) & 0xff;
pix0 = tmpPix0;
Depending on how clever the compiler is, this may cut down dramatically on the required number of reads.
Assuming the original code is naively compiled (which is probably unlikely, but will serve as an example), this will get you from 3 reads and 4 writes per pixel (read RGB channel, and write RGB + BW) to 3/4 reads per pixel and 2 writes. (one write for the RGB struct, and one for the BW value)
You could also accumulate the 4 writes to the BW image in a single int, and then write that in one go too, something like this:
bw |= (i0 >> 8) & 0xff;
bw |= (i1 & 0xff) << 8;
bw |= ((i1 >> 24) & 0xff) << 16;
bw |= ((i2 >> 16) & 0xff) << 24;
*(imgBW + y*480+x/4) = bw; // Assuming you can treat imgBW as an array of integers
This would cut down on the number of writes to 1.25 per pixel (1 per RGB struct, and 1 for every 4 BW values)
Again, the benefit will probably be a lot smaller (or even nonexistent), but it may be worth a shot.
Taking this a step further, the same could be done without too much trouble using the SSE instructions, allowing you to process 4 times as many values per iteration. (Assuming you're running on x86)
Of course, an important disclaimer here is that the above is nonportable. The reinterpret_cast is probably an academic point (it'll most likely work no matter what, especially if you can ensure that the original array is aligned on a 32-bit boundary, which will typically be the case for large allocations on all platforms)
A bigger issue is that the bit-twiddling depends on the CPU's endianness.
But in practice, this should work on x86. and with small changes, it should work on big-endian machines too. (modulo any bugs in my code, of course. I haven't tested or even compiled any of it ;))
But no matter how you solve it, you're going to see the biggest speed improvements from minimizing the number of reads and writes, and trying to accumulate as much data in the CPU's registers as possible. Read all you can in large chunks, like ints, reorder it in the registers (accumulate it into a number of ints, or write it into temporary instances of the RGB struct), and then write those combined value out to memory.
Depending on how much you know about low-level optimizations, it may be surprising to you, but temporary variables are fine, while direct memory to memory access can be slow (for example your pointer dereferencing assigned directly into the array). The problem with this is that you may get more memory accesses than necessary, and it's harder for the compiler to guarantee that no aliasing will occur, and so it may be unable to reorder or combine the memory accesses. You're generally better off writing as much as you can early on (top of the loop), doing as much as possible in temporaries (because the compiler can keep everything in registers), and then write everything out at the end. That also gives the compiler as much leeway as possible to wait for the initially slow reads.
Finally, adding a 4th dummy value to the RGB struct (so it has a total size of 32bit) will most likely help a lot too (because then writing such a struct is a single 32-bit write, which is simpler and more efficient than the current 24-bit)
When deciding how much to unroll the loop (you could do the above twice or more in each iteration), keep in mind how many registers your CPU has. Spilling out into the cache will probably hurt you as there are plenty of memory accesses already, but on the other hand, unroll as much as you can afford given the number of registers available (the above uses 3 registers for keeping the input data, and one to accumulate the BW values. It may need one or two more to compute the necessary addresses, so on x86, doubling the above might be pushing it a bit (you have 8 registers total, and some of them have special meanings). On the other hand, modern CPU's do a lot to compensate for register pressure, by using a much larger number of registers behind the scenes, so further unrolling might still be a total performance win.
As always, measure measure measure. It's impossible to say what's fast and what isn't until you've tested it.
Another general point to keep in mind is that data dependencies are bad. This won't be a big deal as long as you're only dealing with integral values, but it still inhibits instruction reordering, and superscalar execution.
In the above, I've tried to keep dependency chains as short as possible. Rather than continually incrementing the same pointer (which means that each increment is dependant on the previous one), adding a different offset to the same base address means that every address can be computed independently, again giving more freedom to the compiler to reorder and reschedule instructions.
I think the array accesses (are they real array accesses or operator []?) are going to kill you. Each one represents a multiply.
Basically, you want something like this:
for (int y=0; y < height; y++) {
unsigned char *destBgr = imgRgb.GetScanline(y); // inline methods are better
unsigned char *destBW = imgBW.GetScanline(y);
for (int x=0; x < width; x++) {
*destBgr++ = *pImage++;
*destBW++ = *destBgr++ = *pImage++; // do this in one shot - don't double deref
*destBgr++ = *pImage++;
}
}
This will do two multiplies per scanline. You code was doing 4 multiplies per PIXEL.
What I like to do in situations like this is go into the debugger and step through the disassembly to see what it is really doing (or have the compiler generate an assembly listing). This can give you a lot of clues about where inefficencies are. They are often not where you think!
By implementing the changes suggested by Assaf and David Lee above, you can get a before and after instruction count. This really helps me in optimizing tight inner loops.
You could optimize away some of the pointer arithmetic you're doing over and over with the subscript operators [][] and use an iterator instead (that is, advance a pointer).
Memory bandwidth is your bottleneck here. There is a theoretical minimum time required to transfer all the data to and from system memory. I wrote a little test to compare the OP's version with some simple assembler to see how good the compiler was. I'm using VS2005 with default release mode settings. Here's the code:
#include <windows.h>
#include <iostream>
using namespace std;
const int
c_width = 640,
c_height = 480;
typedef struct _RGBData
{
unsigned char
r,
g,
b;
// I'm assuming there's no padding byte here
} RGBData;
// similar to the code given
void SimpleTest
(
unsigned char *src,
RGBData *rgb,
unsigned char *bw
)
{
for (int y = 0 ; y < c_height ; ++y)
{
for (int x = 0 ; x < c_width ; ++x)
{
rgb [x + y * c_width].b = *src;
src++;
rgb [x + y * c_width].g = *src;
bw [x + y * c_width] = *src;
src++;
rgb [x + y * c_width].r = *src;
src++;
}
}
}
// the assembler version
void ASM
(
unsigned char *src,
RGBData *rgb,
unsigned char *bw
)
{
const int
count = 3 * c_width * c_height / 12;
_asm
{
push ebp
mov esi,src
mov edi,bw
mov ecx,count
mov ebp,rgb
l1:
mov eax,[esi]
mov ebx,[esi+4]
mov edx,[esi+8]
mov [ebp],eax
shl eax,16
mov [ebp+4],ebx
rol ebx,16
mov [ebp+8],edx
shr edx,24
and eax,0xff000000
and ebx,0x00ffff00
and edx,0x000000ff
or eax,ebx
or eax,edx
add esi,12
bswap eax
add ebp,12
stosd
loop l1
pop ebp
}
}
// timing framework
LONGLONG TimeFunction
(
void (*function) (unsigned char *src, RGBData *rgb, unsigned char *bw),
char *description,
unsigned char *src,
RGBData *rgb,
unsigned char *bw
)
{
LARGE_INTEGER
start,
end;
cout << "Testing '" << description << "'...";
memset (rgb, 0, sizeof *rgb * c_width * c_height);
memset (bw, 0, c_width * c_height);
QueryPerformanceCounter (&start);
function (src, rgb, bw);
QueryPerformanceCounter (&end);
bool
ok = true;
unsigned char
*bw_check = bw,
i = 0;
RGBData
*rgb_check = rgb;
for (int count = 0 ; count < c_width * c_height ; ++count)
{
if (bw_check [count] != i || rgb_check [count].r != i || rgb_check [count].g != i || rgb_check [count].b != i)
{
ok = false;
break;
}
++i;
}
cout << (end.QuadPart - start.QuadPart) << (ok ? " OK" : " Failed") << endl;
return end.QuadPart - start.QuadPart;
}
int main
(
int argc,
char *argv []
)
{
unsigned char
*source_data = new unsigned char [c_width * c_height * 3];
RGBData
*rgb = new RGBData [c_width * c_height];
unsigned char
*bw = new unsigned char [c_width * c_height];
int
v = 0;
for (unsigned char *dest = source_data ; dest < &source_data [c_width * c_height * 3] ; ++dest)
{
*dest = v++ / 3;
}
LONGLONG
totals [2] = {0, 0};
for (int i = 0 ; i < 10 ; ++i)
{
cout << "Iteration: " << i << endl;
totals [0] += TimeFunction (SimpleTest, "Initial Copy", source_data, rgb, bw);
totals [1] += TimeFunction ( ASM, " ASM Copy", source_data, rgb, bw);
}
LARGE_INTEGER
freq;
QueryPerformanceFrequency (&freq);
freq.QuadPart /= 100000;
cout << totals [0] / freq.QuadPart << "ns" << endl;
cout << totals [1] / freq.QuadPart << "ns" << endl;
delete [] bw;
delete [] rgb;
delete [] source_data;
return 0;
}
And the ratio between C and assembler I was getting was about 2.5:1, i.e. C was 2.5 times the time of the assembler version.
I've just noticed the original data was in BGR order. If the copy swapped the B and R components then it does make the assembler code a bit more complex. But it would also make the C code more complex too.
Ideally, you need to work out what the theoretical minimum time is and compare it to what you're actually getting. To do that, you need to know the memory frequency and the type of memory and the workings of the CPU's MMU.
You might try using a simple cast to get your RGB data, and just recompute the grayscale data:
#pragma pack(1)
typedef unsigned char bw_t;
typedef struct {
unsigned char blue;
unsigned char green;
unsigned char red;
} rgb_t;
#pragma pack(pop)
rgb_t *imageRGB = (rgb_t*)pImage;
bw_t *imageBW = (bw_t*)calloc(640*480, sizeof(bw_t));
// RGB(X,Y) = imageRGB[Y*480 + X]
// BW(X,Y) = imageBW[Y*480 + X]
for (int y = 0; y < 640; ++y)
{
// try and pull some larger number of bytes from pImage (24 is arbitrary)
// 24 / sizeof(rgb_t) = 8
for (int x = 0; x < 480; x += 24)
{
imageBW[y*480 + x ] = GRAYSCALE(imageRGB[y*480 + x ]);
imageBW[y*480 + x + 1] = GRAYSCALE(imageRGB[y*480 + x + 1]);
imageBW[y*480 + x + 2] = GRAYSCALE(imageRGB[y*480 + x + 2]);
imageBW[y*480 + x + 3] = GRAYSCALE(imageRGB[y*480 + x + 3]);
imageBW[y*480 + x + 4] = GRAYSCALE(imageRGB[y*480 + x + 4]);
imageBW[y*480 + x + 5] = GRAYSCALE(imageRGB[y*480 + x + 5]);
imageBW[y*480 + x + 6] = GRAYSCALE(imageRGB[y*480 + x + 6]);
imageBW[y*480 + x + 7] = GRAYSCALE(imageRGB[y*480 + x + 7]);
}
}
Several steps you can take. Result at the end of this answer.
First, use pointers.
const unsigned char *pImage;
RGB *rgbOut = imgRGB;
unsigned char *bwOut = imgBW;
for (int y=0; y < 640; ++y) {
for (int x=0; x < 480; ++x) {
rgbOut->blue = *pImage;
++pImage;
unsigned char tmp = *pImage; // Save to reduce amount of reads.
rgbOut->green = tmp;
*bwOut = tmp;
++pImage;
rgbOut->red = *pImage;
++pImage;
++rgbOut;
++bwOut;
}
}
If imgRGB and imgBW are declared as:
unsigned char imgBW[480][640];
RGB imgRGB[480][640];
You can combine the two loops:
const unsigned char *pImage;
RGB *rgbOut = imgRGB;
unsigned char *bwOut = imgBW;
for (int i=0; i < 640 * 480; ++i) {
rgbOut->blue = *pImage;
++pImage;
unsigned char tmp = *pImage; // Save to reduce amount of reads.
rgbOut->green = tmp;
*bwOut = tmp;
++pImage;
rgbOut->red = *pImage;
++pImage;
++rgbOut;
++bwOut;
}
You can exploit the fact that word reads are faster than four char reads. We will use a helper macro for this. Note this example assumes a little-endian target system.
const unsigned char *pImage;
RGB *rgbOut = imgRGB;
unsigned char *bwOut = imgBW;
const uint32_t *curPixelGroup = pImage;
for (int i=0; i < 640 * 480; ++i) {
uint64_t pixels = 0;
#define WRITE_PIXEL \
rgbOut->blue = pixels; \
pixels >>= 8; \
\
rgbOut->green = pixels; \
*bwOut = pixels; \
pixels >>= 8; \
\
rgbOut->red = pixels; \
pixels >>= 8; \
\
++rgbOut; \
++bwOut;
#define READ_PIXEL(shift) \
pixels |= (*curPixelGroup++) << (shift * 8);
READ_PIXEL(0); WRITE_PIXEL;
READ_PIXEL(1); WRITE_PIXEL;
READ_PIXEL(2); WRITE_PIXEL;
READ_PIXEL(3); WRITE_PIXEL;
/* Remaining */ WRITE_PIXEL;
#undef COPY_PIXELS
}
(Your compiler will probably optimize away the redundant or operation in the first READ_PIXEL. It will also optimize shifts, removing the redundant << 0, too.)
If the structure of RGB is thus:
struct RGB {
unsigned char blue, green, red;
};
You can optimize even further, copy to the struct directly, instead of through its members (red, green, blue). This can be done using anonymous structs (or casting, but that makes the code a bit more messy and probably more prone to error). (Again, this is dependant on little-endian systems, etc. etc.):
union RGB {
struct {
unsigned char blue, green, red;
};
uint32_t rgb:24; // Make sure it's a bitfield, otherwise the union will strech and ruin the ++ operator.
};
const unsigned char *pImage;
RGB *rgbOut = imgRGB;
unsigned char *bwOut = imgBW;
const uint32_t *curPixelGroup = pImage;
for (int i=0; i < 640 * 480; ++i) {
uint64_t pixels = 0;
#define WRITE_PIXEL \
rgbOut->rgb = pixels; \
pixels >>= 8; \
\
*bwOut = pixels; \
pixels >>= 16; \
\
++rgbOut; \
++bwOut;
#define READ_PIXEL(shift) \
pixels |= (*curPixelGroup++) << (shift * 8);
READ_PIXEL(0); WRITE_PIXEL;
READ_PIXEL(1); WRITE_PIXEL;
READ_PIXEL(2); WRITE_PIXEL;
READ_PIXEL(3); WRITE_PIXEL;
/* Remaining */ WRITE_PIXEL;
#undef COPY_PIXELS
}
You can optimize writing the pixel similarly as we did with reading (writing in words rather than 24-bits). In fact, that'd be a pretty good idea, and will be a great next step in optimization. Too tired to code it, though. =]
Of course, you can write the routine in assembly language. This makes it less portable than it already is, however.
I'm assuming the following at the moment, so please let me know if my assumptions are wrong:
a) imgRGB is a structure of the type
struct ImgRGB
{
unsigned char blue;
unsigned char green;
unsigned char red;
};
or at least something similar.
b) imgBW looks something like this:
struct ImgBW
{
unsigned char BW;
};
c) The code is single threaded
Assuming the above, I see several problems with your code:
You put the assignment to the BW part right in the middle of the assignments to the other containers. If you're working on a modern CPU, chances are that with the size of your data your L1 cache gets invalidated every time you're switching containers and you're looking at reloading or switching a cache line. Caches are optimised for linear access these days so hopping to and fro doesn't help. Accessing main memory is a lot slower, so that would be a noticeable performance hit. To verify if this is a problem, temporarily I'd remove the assignment to imgBW and measure if there is a noticeable speedup.
The array access doesn't help and it'll potentially slow down the code a little, although a decent optimiser should take care of that. I'd probably write the loop along these lines instead, but would not expect a big performance gain. Maybe a couple percent.
for (int y=0; y blue = *pImage;
...
}
}
For consistency I would change from using postfix to prefix increment but I would not expect to see a big gain.
If you can waste a little storage (well, 25%) you might gain from adding a fourth dummy unsigned char to the structure ImgRGB provided that this would increase the size of the structure to the size of an int. Native ints are usually fastest to access and if you're looking at a structure of chars that are not filling up an int completely, you're potentially running into all sorts of interesting access issues that can slow your code down noticeably because the compiler might have to generate additional instructions to extract the unsigned chars. Again, try this and measure the result - it might make a noticeable difference or none at all. In the same vein, upping the size of the structure members from unsigned char to unsigned int might waste lots of space but potentially can speed up the code. Nevertheless as long as pImage is a pointer to an unsigned char, you would only eliminate half the problem.
All in all you are down to making your loop fit to your underlying hardware, so for specific optimisation techniques you might have to read up on what your hardware does well and what it does badly.
Make sure pImage, imgRGB, and imgBW are marked __restrict.
Use SSE and do it sixteen bytes at a time.
Actually from what you're doing there it looks like you could use a simple memcpy() to copy pImage into imgRGB (since imgRGB is in row-major format and apparently in the same order as pImage). You could fill out imgBW by using a series of SSE swizzle and store ops to pack down the green values but it might be cumbersome since you'd need to work on ( 3*16 =) 48 bytes at a time.
Are you sure pImage and your output arrays are all in dcache when you start this? Try using a prefetch hint to fetch 128 bytes ahead and measure to see if that improves things.
Edit If you're not on x86, replace "SSE" with the appropriate SIMD instruction set for your hardware, of course. (That'd be VMX, Altivec, SPU, VLIW, HLSL, etc.)
If possible, fix this at a higher level then bit or instruction twiddling!
You could specialize the the B&W image class to one that references the green channel of the color image class (thus saving a copy per pixel). If you always create them in pair, you might not even need the naive imgBW class at all.
By taking care about how your store the data in imgRGB, you could copy a triplet at a time from the input data. Better, you might copy the whole thing, or even just store a reference (which makes the previous suggestion easy as well).
If you don't control the implementation of everything here, you might be stuck, then:
Last resort: unroll the loop (cue someone mentioning Duff's device, or just ask the compiler to do it for you...), though I don't think you'll see much improvement...
It seems that you defined each pixel as some kind of structure or object. Using a primitive type (say, int) could be faster. As others have mentioned, the compiler is likely to optimize the array access using pointer increments. If the compile doesn't do that for you, you can do that yourself to avoid multiplications when you use array[][].
Since you only need 3 bytes per pixel, you could pack one pixel into one int. By doing that, you could copy 3 bytes a time instead of byte-by-byte. The only tricky thing is when you want to read individual color components of a pixel, you will need some bit masking and shifting. This could give you more overhead than that saved by using an int.
Or you can use 3 int arrays for 3 color components respectively. You will need a lot more storage, though.
Here is one very tiny, very simple optimization:
You are referring to imageRGB[y][x] repeatedly, and that likely needs to be re-calculated at each step.
Instead, calculate it once, and see if that makes some improvement:
Pixel* apixel;
for (int y=0; y < 640; y++) {
for (int x=0; x < 480; x++) {
apixel = &imgRGB[y][x];
apixel->blue = *pImage;
pImage++;
apixel->green = *pImage;
imgBW[y][x] = *pImage;
pImage++;
apixel->red = *pImage;
pImage++;
}
}
If pImage is already entirely in memory, why do you need to massage the data? I mean if it is already in pseudo-RGB format, why can't you just write some inline routines/macros that can spit out the values on demand instead of copying it around?
If rearranging the pixel data is important for later operations, consider block operations and/or cache line optimization.