RGB32 byte array memory access optimization - c++

I do not know much about lower level memory optimization but I am trying to figure out how to optimize memory access from a unsigned char array holding rbg32 data.
I am building a directshow transform filter (CTransInPlaceFilter) that needs to average areas of the data. The data given to me is in rgb32 format (B G R 0xFF B G R 0xFF B G R 0xFF etc). I am looping the data and calculating the average red, green and blue channels.
I find when I get dereference the data in the byte array the video playing runs the CPU up and without and accessing, the CPU doesn't do so much work.
After reading a bunch of posts, I think this is a memory to CPU bandwidth bottleneck.
Here is the code for the transform function:
HRESULT CFilter::Transform(IMediaSample *pSample) {
BYTE* data = NULL;
pSample->GetPointer(&data);
if (mVideoType == MEDIASUBTYPE_RGB32) {
Rect roi(0, 0, 400, 400); // Normally this is dynamic
int totalPixels = roi.width * roi.height;
// Find the average color
unsigned int totalR = 0, totalG = 0, totalB = 0;
for (int r = 0; r < roi.height; r++) {
int y = roi.y + r;
BYTE* pixel = data + (roi.x + y * mWidth) * 4; // 4 bytes per pixel
for (int c = 0; c < roi.width; c++) {
totalB += *pixel; // THESE 3 LINES IS THE ISSUE
totalG += *(++pixel);
totalR += *(++pixel);
pixel++;
}
}
int meanR = (int)floor(totalR / totalPixels);
int meanG = (int)floor(totalG / totalPixels);
int meanB = (int)floor(totalB / totalPixels);
// Does work with the averaged data
}
return S_OK; }
So when I run the video without the 3 dereferencing lines I get about 10-14% cpu usage. With those lines I get 30-34% cpu usage.
I also tried to copy the data to a buffer to access the data.
mempy(mData, data, mWidth * mHeight * 4); // mData is only allocated once in constructor
...
totalB += mData[x + y * mWidth];
The cpu usage became 22-25%.
Is it possible to reduce the cpu usage down close to 10 again? Somehow access the data much quicker? Should I try using asm?
Other info: The video is 10bit 1280 X 720 using GraphEdit to test my filter.
My filter does not change the source image (so it does not copy).
I may thread this process if that helps.
Thanks in advance!
Edit:
For more info, I added the directshow graph. The video is 10bit but the Lav filters pass RGB32 (8bit) to me. It is debug build, would release speed it up (eventually I will compile to release build).
I ran the 2 different ways (mentioned earlier) and benchmarked their elapsed time. With dereferencing I get around 0.126208 milliseconds for each time transform runs. Without the dereferencing I get 0.009 milliseconds.
I also tried to reduce the loops by doing
for (int c = 0; c < roi.width; c += 4) {
totalB += pixel[c] + pixel[c + 3] + pixel[c + 6] + pixel[c + 9];
totalG += pixel[c + 1] + pixel[c + 4] + pixel[c + 7] + pixel[c + 10];
totalR += pixel[c + 2] + pixel[c + 5] + pixel[c + 8] + pixel[c + 11];
}
This did not change the CPU usage and the elapsed time is still around 0.12
Edit 2:
I also built all dependencies and the project itself in release and i get the same result. Still very slow access.

I solved the issue. So the problem was that the data was a pointer pointing to data in video memory. I was trying to use data and transferring it from video to computer RAM causing memory bandwidth errors. Copy all the data at once (memcpy) was faster but still very slow. Instead I had to use specific Intel SSE4 commands to efficiently copy the data from video memory to computer memory.
I used this file (gpu_memcpy). It contains a similar function to memcpy but uses the GPU to do the work. It is much faster and after copying, accessing the data is fast as usual.

Related

Zero-copy exporting image using png++ or libpng

Suppose I have a 2000 x 3000 24-bit RGB image stored in row-major, column, color-minor order. I can export this with png++ as follows:
void export_png(const std::vector<uint8_t>& pixels, const std::string& pngfile) {
png::image<png::rgb_pixel>> image(2000,3000);
for (int y = 0; y < 3000; y++)
for (int x = 0; x < 2000; x++)
{
image[y][x].red = pixels[y*2000*3 + x*3 + 0];
image[y][x].green = pixels[y*2000*3 + x*3 + 1];
image[y][x].blue = pixels[y*2000*3 + x*3 + 2];
}
image.write(pngfile);
}
However this is inefficient because it copies the image data before exporting it, requiring at least an additional ~20 megabytes of memory more than should be needed, not to mention the time and memory bandwidth it takes to perform the copy.
Is there a way of exporting it directly from the pixels buffer without copying the data into a png::image first? With png++ or libpng? If so, what is the code for that?

C++ GDI+ bitmap manipulation needs speed up on byte operations

I'm using GDI+ in C++ to manipulate some Bitmap images, changing the colour and resizing the images. My code is very slow at one particular point and I was looking for some potential ways to speed up the line that's been highlighted in the VS2013 Profiler
for (UINT y = 0; y < 3000; ++y)
{
//one scanline at a time because bitmaps are stored wrong way up
byte* oRow = (byte*)bitmapData1.Scan0 + (y * bitmapData1.Stride);
for (UINT x = 0; x < 4000; ++x)
{
//get grey value from 0.114*Blue + 0.299*Red + 0.587*Green
byte grey = (oRow[x * 3] * .114) + (oRow[x * 3 + 1] * .587) + (oRow[x * 3 + 2] * .299); //THIS LINE IS THE HIGHLIGHTED ONE
//rest of manipulation code
}
}
Any handy hints on how to handle this arithmetic line better? It's causing massive slow downs in my code
Thanks in advance!
Optimization depends heavily on the used compiler and the target system. But there are some hints which may be usefull. Avoid multiplications:
Instead of:
byte grey = (oRow[x * 3] * .114) + (oRow[x * 3 + 1] * .587) + (oRow[x * 3 + 2] * .299); //THIS LINE IS THE HIGHLIGHTED ONE
use...
//get grey value from 0.114*Blue + 0.299*Red + 0.587*Green
byte grey = (*oRow) * .114;
oRow++;
grey += (*oRow) * .587;
oRow++;
grey += (*oRow) * .299;
oRow++;
You can put the incrimination of the pointer in the same line. I put it in a separate line for better understanding.
Also, instead of using the multiplication of a float you can use a table, which can be faster than arithmetic. This depends on CPU und table size, but you can give it a shot:
// somwhere global or class attributes
byte tred[256];
byte tgreen[256];
byte tblue[256];
...at startup...
// Only init once at startup
// I am ignoring the warnings, you should not :-)
for(int i=0;i<255;i++)
{
tred[i]=i*.114;
tgreen[i]=i*.587;
tblue[i]=i*.229;
}
...in the loop...
byte grey = tred[*oRow];
oRow++;
grey += tgreen[*oRow];
oRow++;
grey += tblue[*oRow];
oRow++;
Also. 255*255*255 is not such a great size. You can build one big table. As this Table will be larger than the usual CPU cache, I give it not such more speed efficiency.
As suggested, you could do math in integer, but you could also try floats instead of doubles (.114f instead of .114), which are usually quicker and you don't need the precision.
Do the loop like this, instead, to save on pointer math. Creating a temporary pointer like this won't cost because the compiler will understand what you're up to.
for(UINT x = 0; x < 12000; x+=3)
{
byte* pVal = &oRow[x];
....
}
This code is also easily threadable - the compiler can do it for you automatically in various ways; here's one, using parallel for:
https://msdn.microsoft.com/en-us/library/dd728073.aspx
If you have 4 cores, that's a 4x speedup, just about.
Also be sure to check release vs debug build - you don't know the perf until you run it in release/optimized mode.
You could premultiply values like: oRow[x * 3] * .114 and put them into an array. oRow[x*3] has 256 values, so you can easily create array aMul1 of 256 values from 0->255, and multiply it by .144. Then use aMul1[oRow[x * 3]] to find multiplied value. And the same for other components.
Actually you could even create such array for RGB values, ie. your pixel is 888, so you will need an array of size 256*256*256, which is 16777216 = ~16MB.Whether this would speed up your process, you would have to check yourself with profiler.
In general I've found that more direct pointer management, intermediate instructions, less instructions (on most CPUs, they're all equal cost these days), and less memory fetches - e.g. tables are not the answer more often than they are - is the usual optimum, without going to direct assembly. Vectorization, especially explicit is also helpful as is dumping assembly of the function and confirming the inner bits conform to your expectations. Try this:
for (UINT y = 0; y < 3000; ++y)
{
//one scanline at a time because bitmaps are stored wrong way up
byte* oRow = (byte*)bitmapData1.Scan0 + (y * bitmapData1.Stride);
byte *p = oRow;
byte *pend = p + 4000 * 3;
for(; p != pend; p+=3){
const float grey = p[0] * .114f + p[1] * .587f + p[2] * .299f;
}
//alternatively with an autovectorizing compiler
for(; p != pend; p+=3){
#pragma unroll //or use a compiler option to unroll loops
//make sure vectorization and relevant instruction sets are enabled - this is effectively a dot product so the following intrinsic fits the bill:
//https://msdn.microsoft.com/en-us/library/bb514054.aspx
//vector types or compiler intrinsics are more reliable often too... but get compiler specific or architecture dependent respectively.
float grey = 0;
const float w[3] = {.114f, .587f, .299f};
for(int c = 0; c < 3; ++c){
grey += w[c] * p[c];
}
}
}
Consider fooling around with OpenCL and targeting your CPU to see how fast you could solve with CPU specific optimizations and easily multiple cores - OpenCL covers this up for you pretty well and provides built in vector ops and dot product.

How to parallelize this for loop for rapidly converting YUV422 to RGB888?

I am using v4l2 api to grab images from a Microsoft Lifecam and then transferring these images over TCP to a remote computer. I am also encoding the video frames into a MPEG2VIDEO using ffmpeg API. These recorded videos play too fast which is probably because not enough frames have been captured and due to incorrect FPS settings.
The following is the code which converts a YUV422 source to a RGB888 image. This code fragment is the bottleneck in my code as it takes nearly 100 - 150 ms to execute which means I can't log more than 6 - 10 FPS at 1280 x 720 resolution. The CPU usage is 100% as well.
for (int line = 0; line < image_height; line++) {
for (int column = 0; column < image_width; column++) {
*dst++ = CLAMP((double)*py + 1.402*((double)*pv - 128.0)); // R - first byte
*dst++ = CLAMP((double)*py - 0.344*((double)*pu - 128.0) - 0.714*((double)*pv - 128.0)); // G - next byte
*dst++ = CLAMP((double)*py + 1.772*((double)*pu - 128.0)); // B - next byte
vid_frame->data[0][line * frame->linesize[0] + column] = *py;
// increment py, pu, pv here
}
'dst' is then compressed as jpeg and sent over TCP and 'vid_frame' is saved to the disk.
How can I make this code fragment faster so that I can get atleast 30 FPS at 1280x720 resolution as compared to the present 5-6 FPS?
I've tried parallelizing the for loop across three threads using p_thread, processing one third of the rows in each thread.
for (int line = 0; line < image_height/3; line++) // thread 1
for (int line = image_height/3; line < 2*image_height/3; line++) // thread 2
for (int line = 2*image_height/3; line < image_height; line++) // thread 3
This gave me only a minor improvement of 20-30 milliseconds per frame.
What would be the best way to parallelize such loops? Can I use GPU computing or something like OpenMP? Say spwaning some 100 threads to do the calculations?
I also noticed higher frame rates with my laptop webcam as compared to the Microsoft USB Lifecam.
Here are other details:
Ubuntu 12.04, ffmpeg 2.6
AMG-A8 quad core processor with 6GB RAM
Encoder settings:
codec: AV_CODEC_ID_MPEG2VIDEO
bitrate: 4000000
time_base: (AVRational){1, 20}
pix_fmt: AV_PIX_FMT_YUV420P
gop: 10
max_b_frames: 1
If all you care about is fps and not ms per frame (latency), another option would be a separate thread per frame.
Threading is not the only option for speed improvements. You could also perform integer operations as opposed to floating point. And SIMD is an option. Using an existing library like sws_scale will probably give you the best performance.
Mak sure you are compiling -O3 (or -Os).
Make sure debug symbols are disabled.
Move repeated operations outside the loop e.g.
// compiler cant optimize this because another thread could change frame->linesize[0]
int row = line * frame->linesize[0];
for (int column = 0; column < image_width; column++) {
...
vid_frame->data[0][row + column] = *py;
You can precompute tables, so there is no math in the loop:
init() {
for(int py = 0; py <= 255 ; ++py)
for(int pv = 0; pv <= 255 ; ++pv)
ytable[pv][py] = CLAMP(pv + 1.402*(py - 128.0));
}
for (int column = 0; column < image_width; column++) {
*dst++ = ytable[*pv][*py];
Just to name a few options.
I think unless you want to reinvent the painful wheel, using pre-existing options (ffmpeg' libswscale or ffmpeg's scale filter, gstreamer's scale plugin, etc.) is a much better option.
But if you want to reinvent the wheel for whatever reason, show the code you used. For example, thread startup is expensive, so you'd want to create the threads before measuring your looptime and reuse threads from frame-to-frame. Better yet is frame-threading, but that adds latency. This is usually ok but depends on your use case. More importantly, don't write C code, learn to write x86 assembly (simd), all previously mentioned libraries use simd for such conversions, and that'll give you a 3-4x speedup (since it allows you to do 4-8 pixels instead of 1 per iteration).
You could build blocks of x lines and convert each block in a separate thread
do not mix integer and floating point arithmetic!
char x;
char y=((double)x*1.5); /* ouch casting double<->int is slow! */
char z=(x*3)>>1; /* fixed point arithmetic rulez */
use SIMD (though this would be easier if both input and output data were properly aligned...e.g. by using RGB8888 as output)
use openMP
an alternative that does not require any coding of the processing, would be to simply do your entire processing using a framework that does proper timestamping throughout the pipeline (starting at image acquisition time), and is hopefully optimized enough to deal with big data. e.g. gstreamer
Would something like this not work?
#pragma omp parallel for
for (int line = 0; line < image_height; line++) {
for (int column = 0; column < image_width; column++) {
dst[ ( image_width*line + column )*3 ] = CLAMP((double)*py + 1.402*((double)*pv - 128.0)); // R - first byte
dst[ ( image_width*line + column )*3 + 1] = CLAMP((double)*py - 0.344*((double)*pu - 128.0) - 0.714*((double)*pv - 128.0)); // G - next byte
dst[ ( image_width*line + column )*3 + 2] = CLAMP((double)*py + 1.772*((double)*pu - 128.0)); // B - next byte
vid_frame->data[0][line * frame->linesize[0] + column] = *py;
// increment py, pu, pv here
}
Of course you have to also handle incrementing py, py, pv part accordingly.
Usually transformation of pixel format is performed with using of only integer variables.
It's allow to prevent conversion between float point and integer variables.
Also it's allow to use more effectively SIMD extensions of modern CPUs.
For example, this is a code of conversion YUV to BGR:
const int Y_ADJUST = 16;
const int UV_ADJUST = 128;
const int YUV_TO_BGR_AVERAGING_SHIFT = 13;
const int YUV_TO_BGR_ROUND_TERM = 1 << (YUV_TO_BGR_AVERAGING_SHIFT - 1);
const int Y_TO_RGB_WEIGHT = int(1.164*(1 << YUV_TO_BGR_AVERAGING_SHIFT) + 0.5);
const int U_TO_BLUE_WEIGHT = int(2.018*(1 << YUV_TO_BGR_AVERAGING_SHIFT) + 0.5);
const int U_TO_GREEN_WEIGHT = -int(0.391*(1 << YUV_TO_BGR_AVERAGING_SHIFT) + 0.5);
const int V_TO_GREEN_WEIGHT = -int(0.813*(1 << YUV_TO_BGR_AVERAGING_SHIFT) + 0.5);
const int V_TO_RED_WEIGHT = int(1.596*(1 << YUV_TO_BGR_AVERAGING_SHIFT) + 0.5);
inline int RestrictRange(int value, int min = 0, int max = 255)
{
return value < min ? min : (value > max ? max : value);
}
inline int YuvToBlue(int y, int u)
{
return RestrictRange((Y_TO_RGB_WEIGHT*(y - Y_ADJUST) +
U_TO_BLUE_WEIGHT*(u - UV_ADJUST) +
YUV_TO_BGR_ROUND_TERM) >> YUV_TO_BGR_AVERAGING_SHIFT);
}
inline int YuvToGreen(int y, int u, int v)
{
return RestrictRange((Y_TO_RGB_WEIGHT*(y - Y_ADJUST) +
U_TO_GREEN_WEIGHT*(u - UV_ADJUST) +
V_TO_GREEN_WEIGHT*(v - UV_ADJUST) +
YUV_TO_BGR_ROUND_TERM) >> YUV_TO_BGR_AVERAGING_SHIFT);
}
inline int YuvToRed(int y, int v)
{
return RestrictRange((Y_TO_RGB_WEIGHT*(y - Y_ADJUST) +
V_TO_RED_WEIGHT*(v - UV_ADJUST) +
YUV_TO_BGR_ROUND_TERM) >> YUV_TO_BGR_AVERAGING_SHIFT);
}
This code is taken here (http://simd.sourceforge.net/). Also here there is a code optimized for different SIMDs.

It's slower to calculate integral image using CUDA than CPU code

I am implementing the integral image calculation module using CUDA to improve performance.
But its speed slower than the CPU module.
Please let me know what i did wrong.
cuda kernels and host code follow.
And also, another problem is...
In the kernel SumH, using texture memory is slower than global one, imageTexture was defined as below.
texture<unsigned char, 1> imageTexture;
cudaBindTexture(0, imageTexture, pbImage);
// kernels to scan the image horizontally and vertically.
__global__ void SumH(unsigned char* pbImage, int* pnIntImage, __int64* pn64SqrIntImage, float rVSpan, int nWidth)
{
int nStartY, nEndY, nIdx;
if (!threadIdx.x)
{
nStartY = 1;
}
else
nStartY = (int)(threadIdx.x * rVSpan);
nEndY = (int)((threadIdx.x + 1) * rVSpan);
for (int i = nStartY; i < nEndY; i ++)
{
for (int j = 1; j < nWidth; j ++)
{
nIdx = i * nWidth + j;
pnIntImage[nIdx] = pnIntImage[nIdx - 1] + pbImage[nIdx - nWidth - i];
pn64SqrIntImage[nIdx] = pn64SqrIntImage[nIdx - 1] + pbImage[nIdx - nWidth - i] * pbImage[nIdx - nWidth - i];
//pnIntImage[nIdx] = pnIntImage[nIdx - 1] + tex1Dfetch(imageTexture, nIdx - nWidth - i);
//pn64SqrIntImage[nIdx] = pn64SqrIntImage[nIdx - 1] + tex1Dfetch(imageTexture, nIdx - nWidth - i) * tex1Dfetch(imageTexture, nIdx - nWidth - i);
}
}
}
__global__ void SumV(unsigned char* pbImage, int* pnIntImage, __int64* pn64SqrIntImage, float rHSpan, int nHeight, int nWidth)
{
int nStartX, nEndX, nIdx;
if (!threadIdx.x)
{
nStartX = 1;
}
else
nStartX = (int)(threadIdx.x * rHSpan);
nEndX = (int)((threadIdx.x + 1) * rHSpan);
for (int i = 1; i < nHeight; i ++)
{
for (int j = nStartX; j < nEndX; j ++)
{
nIdx = i * nWidth + j;
pnIntImage[nIdx] = pnIntImage[nIdx - nWidth] + pnIntImage[nIdx];
pn64SqrIntImage[nIdx] = pn64SqrIntImage[nIdx - nWidth] + pn64SqrIntImage[nIdx];
}
}
}
// host code
int nW = image_width;
int nH = image_height;
unsigned char* pbImage;
int* pnIntImage;
__int64* pn64SqrIntImage;
cudaMallocManaged(&pbImage, nH * nW);
// assign image gray values to pbimage
cudaMallocManaged(&pnIntImage, sizeof(int) * (nH + 1) * (nW + 1));
cudaMallocManaged(&pn64SqrIntImage, sizeof(__int64) * (nH + 1) * (nW + 1));
float rHSpan, rVSpan;
int nHThreadNum, nVThreadNum;
if (nW + 1 <= 1024)
{
rHSpan = 1;
nVThreadNum = nW + 1;
}
else
{
rHSpan = (float)(nW + 1) / 1024;
nVThreadNum = 1024;
}
if (nH + 1 <= 1024)
{
rVSpan = 1;
nHThreadNum = nH + 1;
}
else
{
rVSpan = (float)(nH + 1) / 1024;
nHThreadNum = 1024;
}
SumH<<<1, nHThreadNum>>>(pbImage, pnIntImage, pn64SqrIntImage, rVSpan, nW + 1);
cudaDeviceSynchronize();
SumV<<<1, nVThreadNum>>>(pbImage, pnIntImage, pn64SqrIntImage, rHSpan, nH + 1, nW + 1);
cudaDeviceSynchronize();
Regarding the code that is currently in the question. There are two things I'd like to mention: launch parameters and timing methodology.
1) Launch parameters
When you launch a kernel there are two main arguments that specify the amount of threads you are launching. These are between the <<< and >>> sections, and are the number of blocks in the grid, and the number of threads per block as follows:
foo <<< numBlocks, numThreadsPerBlock >>> (args);
For a single kernel to be efficient on a current GPU you can use the rule of thumb that numBlocks * numThreadsPerBlock should be at least 10,000. Ie. 10,000 pieces of work. This is a rule of thumb, so you may get good results with only 5,000 threads (it varies with GPU: cheaper GPUs can get away with fewer threads), but this is the order of magnitude you need to be looking at as a minimum. You are running 1024 threads. This is almost certainly not enough (Hint: the loops inside your kernel look like scan primatives, these can be done in parallel).
Further to this there are a few other things to consider.
The number of blocks should be large in comparison to the number of SMs on your GPU. A Kepler K40 has 15 SMs, and to avoid a signficant tail effect you'd probably want at least ~100 blocks on this GPU. Other GPUs have fewer SMs, but you haven't specified which you have, so I can't be more specific.
The number of threads per block should not be too small. You can only have so many blocks on each SM, so if your blocks are too small you will use the GPU suboptimally. Furthermore, on newer GPUs up to four warps can receive instructions on a SM simultaneously, and as such is it often a good idea to have block sizes as multiples of 128.
2) Timing
I'm not going to go into so much depth here, but make sure your timing is sane. GPU code tends to have a one-time initialisation delay. If this is within your timing, you will see erroneously large runtimes for codes designed to represent a much larger code. Similarly, data transfer between the CPU and GPU takes time. In a real application you may only do this once for thousands of kernel calls, but in a test application you may do it once per kernel launch.
If you want to get accurate timings you must make your example more representitive of the final code, or you must be sure that you are only timing the regions that will be repeated.
The only way to be sure is to profile the code, but in this case we can probably make a reasonable guess.
You're basically just doing a single scan through some data, and doing extremely minimal processing on each item.
Given how little processing you're doing on each item, the bottleneck when you process the data with the CPU is probably just reading the data from memory.
When you do the processing on the GPU, the data still needs to be read from memory and copied into the GPU's memory. That means we still have to read all the data from main memory, just like if the CPU did the processing. Worse, it all has to be written to the GPU's memory, causing a further slowdown. By the time the GPU even gets to start doing real processing, you've already used up more time than it would have taken the CPU to finish the job.
For Cuda to make sense, you generally need to be doing a lot more processing on each individual data item. In this case, the CPU is probably already nearly idle most of the time, waiting for data from memory. In such a case, the GPU is unlikely to be of much help unless the input data was already in the GPU's memory so the GPU could do the processing without any extra copying.
When working with CUDA there are a few things you should keep in mind.
Copying from host memory to device memory is 'slow' - when you copy some data from the host to the device you should do as much calculations as possible (do all the work) before you copy it back to the host.
On the device there are 3 types of memory - global, shared, local. You can rank them in speed like global < shared < local (local = fastest).
Reading from consecutive memory blocks is faster than random access. When working with array of structures you would like to transpose it to a structure of arrays.
You can always consult the CUDA Visual Profiler to show you the bottleneck of your program.
the above mentioned GTX750 has 512 CUDA cores (these are the same as the shader units, just driven in a /different/ mode).
http://www.nvidia.de/object/geforce-gtx-750-de.html#pdpContent=2
the duty of creating integral images is only partially able to be parallel'ized as any result value in the results array depends on a bigger bunch of it's predecessors. further it is only a tiny math portion per memory transfer so the ALU powers and thus the unavoidable memory transfers might be the bottle neck. such an accelerator might provide some speed up, but not a thrilling speed up because of the duty itself does not allow it.
if you would compute multiple variations of integral images on the same input data you would be able to see the "thrill" much more likely due to much higher parallelism options and a higher amount of math ops. but that would be a different duty then.
as a wild guess from google search - others have already fiddled with those item: https://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=11&cad=rja&uact=8&ved=0CD8QFjAKahUKEwjjnoabw8bIAhXFvhQKHbUpA1Y&url=http%3A%2F%2Fdspace.mit.edu%2Fopenaccess-disseminate%2F1721.1%2F71883&usg=AFQjCNHBbOEB_OHAzLZI9__lXO_7FPqdqA

exchanging 2 memory positions

I am working with OpenCV and Qt, Opencv use BGR while Qt uses RGB , so I have to swap those 2 bytes for very big images.
There is a better way of doing the following?
I can not think of anything faster but looks so simple and lame...
int width = iplImage->width;
int height = iplImage->height;
uchar *iplImagePtr = (uchar *) iplImage->imageData;
uchar buf;
int limit = height * width;
for (int y = 0; y < limit; ++y) {
buf = iplImagePtr[2];
iplImagePtr[2] = iplImagePtr[0];
iplImagePtr[0] = buf;
iplImagePtr += 3;
}
QImage img((uchar *) iplImage->imageData, width, height,
QImage::Format_RGB888);
We are currently dealing with this issue in a Qt application. We've found that the Intel Performance Primitives to be be fastest way to do this. They have extremely optimized code. In the html help files at Intel ippiSwapChannels Documentation they have an example of exactly what you are looking for.
There are couple of downsides
Is the size of the library, but you can link static link just the library routines you need.
Running on AMD cpus. Intel libs run VERY slow by default on AMD. Check out www.agner.org/optimize/asmlib.zip for details on how do a work around.
I think this looks absolutely fine. That the code is simple is not something negative. If you want to make it shorter you could use std::swap:
std::swap(iplImagePtr[0], iplImagePtr[2]);
You could also do the following:
uchar* end = iplImagePtr + height * width * 3;
for ( ; iplImagePtr != end; iplImagePtr += 3) {
std::swap(iplImagePtr[0], iplImagePtr[2]);
}
There's cvConvertImage to do the whole thing in one line, but I doubt it's any faster either.
Couldn't you use one of the following methods ?
void QImage::invertPixels ( InvertMode mode = InvertRgb )
or
QImage QImage::rgbSwapped () const
Hope this helps a bit !
I would be inclined to do something like the following, working on the basis of that RGB data being in three byte blocks.
int i = 0;
int limit = (width * height); // / 3;
while(i != limit)
{
buf = iplImagePtr[i]; // should be blue colour byte
iplImagePtr[i] = iplImagaePtr[i + 2]; // save the red colour byte in the blue space
iplImagePtr[i + 2] = buf; // save the blue color byte into what was the red slot
// i++;
i += 3;
}
I doubt it is any 'faster' but at end of day, you just have to go through the entire image, pixel by pixel.
You could always do this:
int width = iplImage->width;
int height = iplImage->height;
uchar *start = (uchar *) iplImage->imageData;
uchar *end = start + width * height;
for (uchar *p = start ; p < end ; p += 3)
{
uchar buf = *p;
*p = *(p+2);
*(p+2) = buf;
}
but a decent compiler would do this anyway.
Your biggest overhead in these sorts of operations is going to be memory bandwidth.
If you're using Windows then you can probably do this conversion using the BitBlt and two appropriately set up DIBs. If you're really lucky then this could be done in the graphics hardware.
I hate to ruin anyone's day, but if you don't want to go the IPP route (see photo_tom) or pull in an optimized library, you might get better performance from the following (modifying Andreas answer):
uchar *iplImagePtr = (uchar *) iplImage->imageData;
uchar buf;
size_t limit = height * width;
for (size_t y = 0; y < limit; ++y) {
std::swap(iplImagePtr[y * 3], iplImagePtr[y * 3 + 2]);
}
Now hold on, folks, I hear you yelling "but all those extra multiplies and adds!" The thing is, this form of the loop is far easier for a compiler to optimize, especially if they get smart enough to multithread this sort of algorithm, because each pass through the loop is independent of those before or after. In the other form, the value of iplImagePtr was dependent on the value in previous pass. In this form, it is constant throughout the whole loop; only y changes, and that is in a very, very common "count from 0 to N-1" loop construct, so it's easier for an optimizer to digest.
Or maybe it doesn't make a difference these days because optimizers are insanely smart (are they?). I wonder what a benchmark would say...
P.S. If you actually benchmark this, I'd also like to see how well the following performs:
uchar *iplImagePtr = (uchar *) iplImage->imageData;
uchar buf;
size_t limit = height * width;
for (size_t y = 0; y < limit; ++y) {
uchar *pixel = iplImagePtr + y * 3;
std::swap(pix[0], pix[2]);
}
Again, pixel is defined in the loop to limit its scope and keep the optimizer from thinking there's a cycle-to-cycle dependency. If the compiler increments and decrements the stack pointer each time through the loop to "create" and "destroy" pixel, well, it's stupid and I'll apologize for wasting your time.
cvCvtColor(iplImage, iplImage, CV_BGR2RGB);