C++ GDI+ bitmap manipulation needs speed up on byte operations

C++ GDI+ bitmap manipulation needs speed up on byte operations - c++

I'm using GDI+ in C++ to manipulate some Bitmap images, changing the colour and resizing the images. My code is very slow at one particular point and I was looking for some potential ways to speed up the line that's been highlighted in the VS2013 Profiler
for (UINT y = 0; y < 3000; ++y)
{
//one scanline at a time because bitmaps are stored wrong way up
byte* oRow = (byte*)bitmapData1.Scan0 + (y * bitmapData1.Stride);
for (UINT x = 0; x < 4000; ++x)
{
//get grey value from 0.114*Blue + 0.299*Red + 0.587*Green
byte grey = (oRow[x * 3] * .114) + (oRow[x * 3 + 1] * .587) + (oRow[x * 3 + 2] * .299); //THIS LINE IS THE HIGHLIGHTED ONE
//rest of manipulation code
}
}
Any handy hints on how to handle this arithmetic line better? It's causing massive slow downs in my code
Thanks in advance!

Optimization depends heavily on the used compiler and the target system. But there are some hints which may be usefull. Avoid multiplications:
Instead of:
byte grey = (oRow[x * 3] * .114) + (oRow[x * 3 + 1] * .587) + (oRow[x * 3 + 2] * .299); //THIS LINE IS THE HIGHLIGHTED ONE
use...
//get grey value from 0.114*Blue + 0.299*Red + 0.587*Green
byte grey = (*oRow) * .114;
oRow++;
grey += (*oRow) * .587;
oRow++;
grey += (*oRow) * .299;
oRow++;
You can put the incrimination of the pointer in the same line. I put it in a separate line for better understanding.
Also, instead of using the multiplication of a float you can use a table, which can be faster than arithmetic. This depends on CPU und table size, but you can give it a shot:
// somwhere global or class attributes
byte tred[256];
byte tgreen[256];
byte tblue[256];
...at startup...
// Only init once at startup
// I am ignoring the warnings, you should not :-)
for(int i=0;i<255;i++)
{
tred[i]=i*.114;
tgreen[i]=i*.587;
tblue[i]=i*.229;
}
...in the loop...
byte grey = tred[*oRow];
oRow++;
grey += tgreen[*oRow];
oRow++;
grey += tblue[*oRow];
oRow++;
Also. 255*255*255 is not such a great size. You can build one big table. As this Table will be larger than the usual CPU cache, I give it not such more speed efficiency.

As suggested, you could do math in integer, but you could also try floats instead of doubles (.114f instead of .114), which are usually quicker and you don't need the precision.
Do the loop like this, instead, to save on pointer math. Creating a temporary pointer like this won't cost because the compiler will understand what you're up to.
for(UINT x = 0; x < 12000; x+=3)
{
byte* pVal = &oRow[x];
....
}
This code is also easily threadable - the compiler can do it for you automatically in various ways; here's one, using parallel for:
https://msdn.microsoft.com/en-us/library/dd728073.aspx
If you have 4 cores, that's a 4x speedup, just about.
Also be sure to check release vs debug build - you don't know the perf until you run it in release/optimized mode.

You could premultiply values like: oRow[x * 3] * .114 and put them into an array. oRow[x*3] has 256 values, so you can easily create array aMul1 of 256 values from 0->255, and multiply it by .144. Then use aMul1[oRow[x * 3]] to find multiplied value. And the same for other components.
Actually you could even create such array for RGB values, ie. your pixel is 888, so you will need an array of size 256*256*256, which is 16777216 = ~16MB.Whether this would speed up your process, you would have to check yourself with profiler.

In general I've found that more direct pointer management, intermediate instructions, less instructions (on most CPUs, they're all equal cost these days), and less memory fetches - e.g. tables are not the answer more often than they are - is the usual optimum, without going to direct assembly. Vectorization, especially explicit is also helpful as is dumping assembly of the function and confirming the inner bits conform to your expectations. Try this:
for (UINT y = 0; y < 3000; ++y)
{
//one scanline at a time because bitmaps are stored wrong way up
byte* oRow = (byte*)bitmapData1.Scan0 + (y * bitmapData1.Stride);
byte *p = oRow;
byte *pend = p + 4000 * 3;
for(; p != pend; p+=3){
const float grey = p[0] * .114f + p[1] * .587f + p[2] * .299f;
}
//alternatively with an autovectorizing compiler
for(; p != pend; p+=3){
#pragma unroll //or use a compiler option to unroll loops
//make sure vectorization and relevant instruction sets are enabled - this is effectively a dot product so the following intrinsic fits the bill:
//https://msdn.microsoft.com/en-us/library/bb514054.aspx
//vector types or compiler intrinsics are more reliable often too... but get compiler specific or architecture dependent respectively.
float grey = 0;
const float w[3] = {.114f, .587f, .299f};
for(int c = 0; c < 3; ++c){
grey += w[c] * p[c];
}
}
}
Consider fooling around with OpenCL and targeting your CPU to see how fast you could solve with CPU specific optimizations and easily multiple cores - OpenCL covers this up for you pretty well and provides built in vector ops and dot product.

Related

Optimize a nearest neighbor resizing algorithm for speed

I'm using the next algorithm to perform nearest neighbor resizing. Is there anyway to optimize it's speed? Input and Output buffers are in ARGB format, though images are known to be always opaque. Thank you.
void resizeNearestNeighbor(const uint8_t* input, uint8_t* output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight)
{
const int x_ratio = (int)((sourceWidth << 16) / targetWidth);
const int y_ratio = (int)((sourceHeight << 16) / targetHeight) ;
const int colors = 4;
for (int y = 0; y < targetHeight; y++)
{
int y2_xsource = ((y * y_ratio) >> 16) * sourceWidth;
int i_xdest = y * targetWidth;
for (int x = 0; x < targetWidth; x++)
{
int x2 = ((x * x_ratio) >> 16) ;
int y2_x2_colors = (y2_xsource + x2) * colors;
int i_x_colors = (i_xdest + x) * colors;
output[i_x_colors] = input[y2_x2_colors];
output[i_x_colors + 1] = input[y2_x2_colors + 1];
output[i_x_colors + 2] = input[y2_x2_colors + 2];
output[i_x_colors + 3] = input[y2_x2_colors + 3];
}
}
}

restrict keyword will help a lot, assuming no aliasing.
Another improvement is to declare another pointerToOutput and pointerToInput as uint_32_t, so that the four 8-bit copy-assignments can be combined into a 32-bit one, assuming pointers are 32bit aligned.

There's little that you can do to speed this up, as you already arranged the loops in the right order and cleverly used fixed-point arithmetic. As others suggested, try to move the 32 bits in a single go (hoping that the compiler didn't see that yet).
In case of significant enlargement, there is a possibility: you can determine how many times every source pixel needs to be replicated (you'll need to work on the properties of the relation Xd=Wd.Xs/Ws in integers), and perform a single pixel read for k writes. This also works on the y's, and you can memcpy the identical rows instead of recomputing them. You can precompute and tabulate the mappings of the X's and Y's using run-length coding.
But there is a barrier that you will not pass: you need to fill the destination image.
If you are desperately looking for speedup, there could remain the option of using vector operations (SEE or AVX) to handle several pixels at a time. Shuffle instructions are available that might enable to control the replication (or decimation) of the pixels. But due to the complicated replication pattern combined with the fixed structure of the vector registers, you will probably need to integrate a complex decision table.

The algorithm is fine, but you can utilize massive parallelization by submitting your image to the GPU. If you use opengl, simply creating a context of the new size and providing a properly sized quad can give you inherent nearest neighbor calculations. Also opengl could give you access to other resizing sampling techniques by simply changing the properties of the texture you read from (which would amount to a single gl command which could be an easy paramter to your resize function).
Also later in development, you could simply swap out a shader for other blending techniques which also keeps you utilizing your wonderful GPU processor of image processing glory.
Also, since you aren't using any fancy geometry it can become almost trivial to write the program. It would be a little more involved than your algorithm, but it could perform magnitudes faster depending on image size.

I hope I didn't break anything. This combines some of the suggestions posted thus far and is about 30% faster. I'm amazed that is all we got. I did not actually check the destination image to see if it was right.
Changes:
- remove multiplies from inner loop (10% improvement)
- uint32_t instead of uint8_t (10% improvement)
- __restrict keyword (1% improvement)
This was on an i7 x64 machine running Windows, compiled with MSVC 2013. You will have to change the __restrict keyword for other compilers.
void resizeNearestNeighbor2_32(const uint8_t* __restrict input, uint8_t* __restrict output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight)
{
const uint32_t* input32 = (const uint32_t*)input;
uint32_t* output32 = (uint32_t*)output;
const int x_ratio = (int)((sourceWidth << 16) / targetWidth);
const int y_ratio = (int)((sourceHeight << 16) / targetHeight);
int x_ratio_with_color = x_ratio;
for (int y = 0; y < targetHeight; y++)
{
int y2_xsource = ((y * y_ratio) >> 16) * sourceWidth;
int i_xdest = y * targetWidth;
int source_x_offset = 0;
int startingOffset = y2_xsource;
const uint32_t * inputLine = input32 + startingOffset;
for (int x = 0; x < targetWidth; x++)
{
i_xdest += 1;
source_x_offset += x_ratio_with_color;
int sourceOffset = source_x_offset >> 16;
output[i_xdest] = inputLine[sourceOffset];
}
}
}

It's slower to calculate integral image using CUDA than CPU code

I am implementing the integral image calculation module using CUDA to improve performance.
But its speed slower than the CPU module.
Please let me know what i did wrong.
cuda kernels and host code follow.
And also, another problem is...
In the kernel SumH, using texture memory is slower than global one, imageTexture was defined as below.
texture<unsigned char, 1> imageTexture;
cudaBindTexture(0, imageTexture, pbImage);
// kernels to scan the image horizontally and vertically.
__global__ void SumH(unsigned char* pbImage, int* pnIntImage, __int64* pn64SqrIntImage, float rVSpan, int nWidth)
{
int nStartY, nEndY, nIdx;
if (!threadIdx.x)
{
nStartY = 1;
}
else
nStartY = (int)(threadIdx.x * rVSpan);
nEndY = (int)((threadIdx.x + 1) * rVSpan);
for (int i = nStartY; i < nEndY; i ++)
{
for (int j = 1; j < nWidth; j ++)
{
nIdx = i * nWidth + j;
pnIntImage[nIdx] = pnIntImage[nIdx - 1] + pbImage[nIdx - nWidth - i];
pn64SqrIntImage[nIdx] = pn64SqrIntImage[nIdx - 1] + pbImage[nIdx - nWidth - i] * pbImage[nIdx - nWidth - i];
//pnIntImage[nIdx] = pnIntImage[nIdx - 1] + tex1Dfetch(imageTexture, nIdx - nWidth - i);
//pn64SqrIntImage[nIdx] = pn64SqrIntImage[nIdx - 1] + tex1Dfetch(imageTexture, nIdx - nWidth - i) * tex1Dfetch(imageTexture, nIdx - nWidth - i);
}
}
}
__global__ void SumV(unsigned char* pbImage, int* pnIntImage, __int64* pn64SqrIntImage, float rHSpan, int nHeight, int nWidth)
{
int nStartX, nEndX, nIdx;
if (!threadIdx.x)
{
nStartX = 1;
}
else
nStartX = (int)(threadIdx.x * rHSpan);
nEndX = (int)((threadIdx.x + 1) * rHSpan);
for (int i = 1; i < nHeight; i ++)
{
for (int j = nStartX; j < nEndX; j ++)
{
nIdx = i * nWidth + j;
pnIntImage[nIdx] = pnIntImage[nIdx - nWidth] + pnIntImage[nIdx];
pn64SqrIntImage[nIdx] = pn64SqrIntImage[nIdx - nWidth] + pn64SqrIntImage[nIdx];
}
}
}
// host code
int nW = image_width;
int nH = image_height;
unsigned char* pbImage;
int* pnIntImage;
__int64* pn64SqrIntImage;
cudaMallocManaged(&pbImage, nH * nW);
// assign image gray values to pbimage
cudaMallocManaged(&pnIntImage, sizeof(int) * (nH + 1) * (nW + 1));
cudaMallocManaged(&pn64SqrIntImage, sizeof(__int64) * (nH + 1) * (nW + 1));
float rHSpan, rVSpan;
int nHThreadNum, nVThreadNum;
if (nW + 1 <= 1024)
{
rHSpan = 1;
nVThreadNum = nW + 1;
}
else
{
rHSpan = (float)(nW + 1) / 1024;
nVThreadNum = 1024;
}
if (nH + 1 <= 1024)
{
rVSpan = 1;
nHThreadNum = nH + 1;
}
else
{
rVSpan = (float)(nH + 1) / 1024;
nHThreadNum = 1024;
}
SumH<<<1, nHThreadNum>>>(pbImage, pnIntImage, pn64SqrIntImage, rVSpan, nW + 1);
cudaDeviceSynchronize();
SumV<<<1, nVThreadNum>>>(pbImage, pnIntImage, pn64SqrIntImage, rHSpan, nH + 1, nW + 1);
cudaDeviceSynchronize();

Regarding the code that is currently in the question. There are two things I'd like to mention: launch parameters and timing methodology.
1) Launch parameters
When you launch a kernel there are two main arguments that specify the amount of threads you are launching. These are between the <<< and >>> sections, and are the number of blocks in the grid, and the number of threads per block as follows:
foo <<< numBlocks, numThreadsPerBlock >>> (args);
For a single kernel to be efficient on a current GPU you can use the rule of thumb that numBlocks * numThreadsPerBlock should be at least 10,000. Ie. 10,000 pieces of work. This is a rule of thumb, so you may get good results with only 5,000 threads (it varies with GPU: cheaper GPUs can get away with fewer threads), but this is the order of magnitude you need to be looking at as a minimum. You are running 1024 threads. This is almost certainly not enough (Hint: the loops inside your kernel look like scan primatives, these can be done in parallel).
Further to this there are a few other things to consider.
The number of blocks should be large in comparison to the number of SMs on your GPU. A Kepler K40 has 15 SMs, and to avoid a signficant tail effect you'd probably want at least ~100 blocks on this GPU. Other GPUs have fewer SMs, but you haven't specified which you have, so I can't be more specific.
The number of threads per block should not be too small. You can only have so many blocks on each SM, so if your blocks are too small you will use the GPU suboptimally. Furthermore, on newer GPUs up to four warps can receive instructions on a SM simultaneously, and as such is it often a good idea to have block sizes as multiples of 128.
2) Timing
I'm not going to go into so much depth here, but make sure your timing is sane. GPU code tends to have a one-time initialisation delay. If this is within your timing, you will see erroneously large runtimes for codes designed to represent a much larger code. Similarly, data transfer between the CPU and GPU takes time. In a real application you may only do this once for thousands of kernel calls, but in a test application you may do it once per kernel launch.
If you want to get accurate timings you must make your example more representitive of the final code, or you must be sure that you are only timing the regions that will be repeated.

The only way to be sure is to profile the code, but in this case we can probably make a reasonable guess.
You're basically just doing a single scan through some data, and doing extremely minimal processing on each item.
Given how little processing you're doing on each item, the bottleneck when you process the data with the CPU is probably just reading the data from memory.
When you do the processing on the GPU, the data still needs to be read from memory and copied into the GPU's memory. That means we still have to read all the data from main memory, just like if the CPU did the processing. Worse, it all has to be written to the GPU's memory, causing a further slowdown. By the time the GPU even gets to start doing real processing, you've already used up more time than it would have taken the CPU to finish the job.
For Cuda to make sense, you generally need to be doing a lot more processing on each individual data item. In this case, the CPU is probably already nearly idle most of the time, waiting for data from memory. In such a case, the GPU is unlikely to be of much help unless the input data was already in the GPU's memory so the GPU could do the processing without any extra copying.

When working with CUDA there are a few things you should keep in mind.
Copying from host memory to device memory is 'slow' - when you copy some data from the host to the device you should do as much calculations as possible (do all the work) before you copy it back to the host.
On the device there are 3 types of memory - global, shared, local. You can rank them in speed like global < shared < local (local = fastest).
Reading from consecutive memory blocks is faster than random access. When working with array of structures you would like to transpose it to a structure of arrays.
You can always consult the CUDA Visual Profiler to show you the bottleneck of your program.

the above mentioned GTX750 has 512 CUDA cores (these are the same as the shader units, just driven in a /different/ mode).
http://www.nvidia.de/object/geforce-gtx-750-de.html#pdpContent=2
the duty of creating integral images is only partially able to be parallel'ized as any result value in the results array depends on a bigger bunch of it's predecessors. further it is only a tiny math portion per memory transfer so the ALU powers and thus the unavoidable memory transfers might be the bottle neck. such an accelerator might provide some speed up, but not a thrilling speed up because of the duty itself does not allow it.
if you would compute multiple variations of integral images on the same input data you would be able to see the "thrill" much more likely due to much higher parallelism options and a higher amount of math ops. but that would be a different duty then.
as a wild guess from google search - others have already fiddled with those item: https://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=11&cad=rja&uact=8&ved=0CD8QFjAKahUKEwjjnoabw8bIAhXFvhQKHbUpA1Y&url=http%3A%2F%2Fdspace.mit.edu%2Fopenaccess-disseminate%2F1721.1%2F71883&usg=AFQjCNHBbOEB_OHAzLZI9__lXO_7FPqdqA

RGB32 byte array memory access optimization

I do not know much about lower level memory optimization but I am trying to figure out how to optimize memory access from a unsigned char array holding rbg32 data.
I am building a directshow transform filter (CTransInPlaceFilter) that needs to average areas of the data. The data given to me is in rgb32 format (B G R 0xFF B G R 0xFF B G R 0xFF etc). I am looping the data and calculating the average red, green and blue channels.
I find when I get dereference the data in the byte array the video playing runs the CPU up and without and accessing, the CPU doesn't do so much work.
After reading a bunch of posts, I think this is a memory to CPU bandwidth bottleneck.
Here is the code for the transform function:
HRESULT CFilter::Transform(IMediaSample *pSample) {
BYTE* data = NULL;
pSample->GetPointer(&data);
if (mVideoType == MEDIASUBTYPE_RGB32) {
Rect roi(0, 0, 400, 400); // Normally this is dynamic
int totalPixels = roi.width * roi.height;
// Find the average color
unsigned int totalR = 0, totalG = 0, totalB = 0;
for (int r = 0; r < roi.height; r++) {
int y = roi.y + r;
BYTE* pixel = data + (roi.x + y * mWidth) * 4; // 4 bytes per pixel
for (int c = 0; c < roi.width; c++) {
totalB += *pixel; // THESE 3 LINES IS THE ISSUE
totalG += *(++pixel);
totalR += *(++pixel);
pixel++;
}
}
int meanR = (int)floor(totalR / totalPixels);
int meanG = (int)floor(totalG / totalPixels);
int meanB = (int)floor(totalB / totalPixels);
// Does work with the averaged data
}
return S_OK; }
So when I run the video without the 3 dereferencing lines I get about 10-14% cpu usage. With those lines I get 30-34% cpu usage.
I also tried to copy the data to a buffer to access the data.
mempy(mData, data, mWidth * mHeight * 4); // mData is only allocated once in constructor
...
totalB += mData[x + y * mWidth];
The cpu usage became 22-25%.
Is it possible to reduce the cpu usage down close to 10 again? Somehow access the data much quicker? Should I try using asm?
Other info: The video is 10bit 1280 X 720 using GraphEdit to test my filter.
My filter does not change the source image (so it does not copy).
I may thread this process if that helps.
Thanks in advance!
Edit:
For more info, I added the directshow graph. The video is 10bit but the Lav filters pass RGB32 (8bit) to me. It is debug build, would release speed it up (eventually I will compile to release build).
I ran the 2 different ways (mentioned earlier) and benchmarked their elapsed time. With dereferencing I get around 0.126208 milliseconds for each time transform runs. Without the dereferencing I get 0.009 milliseconds.
I also tried to reduce the loops by doing
for (int c = 0; c < roi.width; c += 4) {
totalB += pixel[c] + pixel[c + 3] + pixel[c + 6] + pixel[c + 9];
totalG += pixel[c + 1] + pixel[c + 4] + pixel[c + 7] + pixel[c + 10];
totalR += pixel[c + 2] + pixel[c + 5] + pixel[c + 8] + pixel[c + 11];
}
This did not change the CPU usage and the elapsed time is still around 0.12
Edit 2:
I also built all dependencies and the project itself in release and i get the same result. Still very slow access.

I solved the issue. So the problem was that the data was a pointer pointing to data in video memory. I was trying to use data and transferring it from video to computer RAM causing memory bandwidth errors. Copy all the data at once (memcpy) was faster but still very slow. Instead I had to use specific Intel SSE4 commands to efficiently copy the data from video memory to computer memory.
I used this file (gpu_memcpy). It contains a similar function to memcpy but uses the GPU to do the work. It is much faster and after copying, accessing the data is fast as usual.

Modifing data from uint8_t array very slow?

I'm currently trying to optimize my code to run a bit faster. Currently it is taking about +30ms to update about 3776000 bytes. If I remove the outPx updates inside my function it runs at about 3ms meaning that the updates to outPx is what is making the function slower.
Any potential feedback on how to improve the speed of my function below would be greatly appreciated.
uint8_t* outPx = (uint8_t*)out.data;
for (int px=0; px<pxSize; px+=4)
{
newTopAlpha = (alpha*inPx[px+3]);
if (0xff == newTopAlpha)
{
// top is opaque covers entire bottom
// set copy over BGR colors
outPx[px] = inPx[px];
outPx[px+1] = inPx[px+1];
outPx[px+2] = inPx[px+2];
outPx[px+3] = 0xff; //Fully opaque
}
else if (0x00 != newTopAlpha)
{
// top is not completely transparent
topAlpha = newTopAlpha/(float)0xff;
bottomAlpha = outPx[px+3]/(float)0xff;
newAlpha = topAlpha + bottomAlpha*(1-topAlpha);
alphaChange = bottomAlpha*(1-topAlpha);
outPx[px] = (uint8_t)((inPx[px]*topAlpha + outPx[px]*alphaChange)/newAlpha);
outPx[px+1] = (uint8_t)((inPx[px+1]*topAlpha + outPx[px+1]*alphaChange)/newAlpha);
outPx[px+2] = (uint8_t)((inPx[px+2]*topAlpha + outPx[px+2]*alphaChange)/newAlpha);
outPx[px+3] = (uint8_t)(newAlpha*0xff);
}
}

uint8_t is an exact width integer type, meaning that you demand the compiler to allocate exactly that much memory for your type. If your system has an alignment requirement, this may cause the code to run slower.
Change uint8_t to uint_fast8_t. This tells the compiler that you want this variable to be 8 bits if possible, but that it is ok to use a larger size if it makes the code faster.
Apart from that, there are lots of things that could cause bad performance, in which case you need to state what system and compiler you are using.

Your code is doing floating point divides, and conversion from byte to float and back again. If you use integer math, it is highly likely more efficient.
Even doing this simple conversion to multiply instead of divide may help quite a bit:
newAlpha = 1/(topAlpha + bottomAlpha*(1-topAlpha));
...
outpx = (uint8_t)((inPx[px]*topAlpha + outPx[px]*alphaChange)*newAlpha);
Multiply tends to be much faster than divide.

OK, if this really is the bottleneck, and you can't use the GPU / built-in methods for some random reason, then there is a lot you can do:
uint8_t *outPx = (uint8_t*) out.data;
const int cAlpha = (int) (alpha * 256.0f + 0.5f);
for( int px = 0; px < pxSize; px += 4 ) {
const int topAlpha = (cAlpha * (int) inPx[px|3]) >> 8; // note | not + for tiny speed boost
if( topAlpha == 255 ) {
memcpy( &outPx[px], &inPx[px], 4 ); // might be slower than per-component copying; benchmark!
} else if( topAlpha ) {
const int bottomAlpha = (int) outPx[px|3];
const int alphaChange = (bottomAlpha * (255 - topAlpha)) / 255;
const int newAlpha = topAlpha + alphaChange;
outPx[px ] = (uint8_t) ((inPx[px ]*topAlpha + outPx[px ]*alphaChange) / newAlpha);
outPx[px|1] = (uint8_t) ((inPx[px|1]*topAlpha + outPx[px|1]*alphaChange) / newAlpha);
outPx[px|2] = (uint8_t) ((inPx[px|2]*topAlpha + outPx[px|2]*alphaChange) / newAlpha);
outPx[px|3] = (uint8_t) newAlpha;
}
}
The main change is that there is no floating point arithmetic any more (I might have missed a /255 or something, but you get the idea). I also removed repeated calculations and used bit operators where possible. Another optimisation would be to use fixed-precision arithmetic to change the 3 divides into a single divide and 3 multiply/bitshifts. But you'd have to benchmark to confirm that actually helps. The memcpy might be faster. Again, you need to benchmark.
Finally, if you know something about the images, you could give the compiler hints about the branching. For example, in GCC you can say if( __builtin_expect( topAlpha == 255, 1 ) ) if you know that most of the image is solid colour, and alpha is 1.0.
Update based on comments:
And for the love of sanity, never (never) benchmark with optimisations turned off.

exchanging 2 memory positions

I am working with OpenCV and Qt, Opencv use BGR while Qt uses RGB , so I have to swap those 2 bytes for very big images.
There is a better way of doing the following?
I can not think of anything faster but looks so simple and lame...
int width = iplImage->width;
int height = iplImage->height;
uchar *iplImagePtr = (uchar *) iplImage->imageData;
uchar buf;
int limit = height * width;
for (int y = 0; y < limit; ++y) {
buf = iplImagePtr[2];
iplImagePtr[2] = iplImagePtr[0];
iplImagePtr[0] = buf;
iplImagePtr += 3;
}
QImage img((uchar *) iplImage->imageData, width, height,
QImage::Format_RGB888);

We are currently dealing with this issue in a Qt application. We've found that the Intel Performance Primitives to be be fastest way to do this. They have extremely optimized code. In the html help files at Intel ippiSwapChannels Documentation they have an example of exactly what you are looking for.
There are couple of downsides
Is the size of the library, but you can link static link just the library routines you need.
Running on AMD cpus. Intel libs run VERY slow by default on AMD. Check out www.agner.org/optimize/asmlib.zip for details on how do a work around.

I think this looks absolutely fine. That the code is simple is not something negative. If you want to make it shorter you could use std::swap:
std::swap(iplImagePtr[0], iplImagePtr[2]);
You could also do the following:
uchar* end = iplImagePtr + height * width * 3;
for ( ; iplImagePtr != end; iplImagePtr += 3) {
std::swap(iplImagePtr[0], iplImagePtr[2]);
}

There's cvConvertImage to do the whole thing in one line, but I doubt it's any faster either.

Couldn't you use one of the following methods ?
void QImage::invertPixels ( InvertMode mode = InvertRgb )
or
QImage QImage::rgbSwapped () const
Hope this helps a bit !

I would be inclined to do something like the following, working on the basis of that RGB data being in three byte blocks.
int i = 0;
int limit = (width * height); // / 3;
while(i != limit)
{
buf = iplImagePtr[i]; // should be blue colour byte
iplImagePtr[i] = iplImagaePtr[i + 2]; // save the red colour byte in the blue space
iplImagePtr[i + 2] = buf; // save the blue color byte into what was the red slot
// i++;
i += 3;
}
I doubt it is any 'faster' but at end of day, you just have to go through the entire image, pixel by pixel.

You could always do this:
int width = iplImage->width;
int height = iplImage->height;
uchar *start = (uchar *) iplImage->imageData;
uchar *end = start + width * height;
for (uchar *p = start ; p < end ; p += 3)
{
uchar buf = *p;
*p = *(p+2);
*(p+2) = buf;
}
but a decent compiler would do this anyway.
Your biggest overhead in these sorts of operations is going to be memory bandwidth.
If you're using Windows then you can probably do this conversion using the BitBlt and two appropriately set up DIBs. If you're really lucky then this could be done in the graphics hardware.

I hate to ruin anyone's day, but if you don't want to go the IPP route (see photo_tom) or pull in an optimized library, you might get better performance from the following (modifying Andreas answer):
uchar *iplImagePtr = (uchar *) iplImage->imageData;
uchar buf;
size_t limit = height * width;
for (size_t y = 0; y < limit; ++y) {
std::swap(iplImagePtr[y * 3], iplImagePtr[y * 3 + 2]);
}
Now hold on, folks, I hear you yelling "but all those extra multiplies and adds!" The thing is, this form of the loop is far easier for a compiler to optimize, especially if they get smart enough to multithread this sort of algorithm, because each pass through the loop is independent of those before or after. In the other form, the value of iplImagePtr was dependent on the value in previous pass. In this form, it is constant throughout the whole loop; only y changes, and that is in a very, very common "count from 0 to N-1" loop construct, so it's easier for an optimizer to digest.
Or maybe it doesn't make a difference these days because optimizers are insanely smart (are they?). I wonder what a benchmark would say...
P.S. If you actually benchmark this, I'd also like to see how well the following performs:
uchar *iplImagePtr = (uchar *) iplImage->imageData;
uchar buf;
size_t limit = height * width;
for (size_t y = 0; y < limit; ++y) {
uchar *pixel = iplImagePtr + y * 3;
std::swap(pix[0], pix[2]);
}
Again, pixel is defined in the loop to limit its scope and keep the optimizer from thinking there's a cycle-to-cycle dependency. If the compiler increments and decrements the stack pointer each time through the loop to "create" and "destroy" pixel, well, it's stupid and I'll apologize for wasting your time.

cvCvtColor(iplImage, iplImage, CV_BGR2RGB);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js