So, I'm writing (or have written) a contour mapping application for mapping power frequency in North America, and it works really well... On linux. I run it on a loop to update a bmp map file, which I'm eventually going to output on a website. It can run and update itself in about 3 seconds, which is great for me. The problem came when I tried to port the application to windows. I moved the code into Visual Studios 2012. Linked the libraries, compiled and such, I had to make it ignore a few warnings about float to integer conversion, but I got it working.
Then, I had it run, and it didn't seem to do anything. But, after adding a few output commands, I realized it was doing something, it was just doing it probably 100x slower than the Linux executable! I mean, admittedly, the code is pretty intensive (around 400,000 iterations), but this is just ridiculous.
I saw a lot of other topics about VS running slow in debug mode, but even after I compile and run an executable, it's still just as slow.
Here's relevant code, let me know if you have any ideas. Some of the functions you won't recognize because either they aren't relevant to performance (I only call them once, and I know they aren't the source of the speed problem) or they come from the Easy BMP library I'm using for image manipulation. Right now, I have it set to output a 100X100 image, but originally I was outputting a 800x500 pixel image:
float get_value(int x, int y, int dataNum, vector<vector<float> > data)
vector<float> distance;
float value=0; float distanceTotal = 0;
for(int i=0; i<dataNum; i++)
distance.push_back(sqrt(pow(data[0][i]-x,2) + pow(data[1][i]-y,2)));
if(distance[i] < 2)
return 0;
distance[i] = 1/pow(distance[i],3);
for(int i=0; i<dataNum; i++)
return value;
int _tmain(int argc, _TCHAR* argv[])
//set image attributes
int height=100; int width = 100; int colorScale = 5*255; string dataFile = "data2.txt";
//get data and colormap
vector<vector<float> > data = getData(dataFile, width, height);
vector<RGBApixel> colorMap = makeColorMap(colorScale);
int dataNum = data[0].size();
pair<float,float> range = make_pair(*min_element(data[2].begin(),data[2].end()),*max_element(data[2].begin(),data[2].end()));
//make image
BMP newMap;
for(int x=0; x<width; x++)
for(int y=0; y<height; y++)
//for debug purposes
cout << x << " " << y << endl;
float value = get_value(x,y,dataNum,data);
//get color value based on data value
int colorValue = floor((value-range.first)/(range.second-range.first)*colorScale);
//handle border cases
if(colorValue < 0 )
else if(colorValue > colorScale-1)
return 0;
Any thoughts?


Loading a BMP image at a specific index in OpenGL

I have to load a 24 bit BMP file at a certain (x,y) index of glut window from a file using OpenGL. I have found a function that uses glaux library to do so. Here the color mentioned in ignoreColor is ignored during rendering.
void iShowBMP(int x, int y, char filename[], int ignoreColor)
AUX_RGBImageRec *TextureImage;
TextureImage = auxDIBImageLoad(filename);
int i,j,k;
int width = TextureImage->sizeX;
int height = TextureImage->sizeY;
int nPixels = width * height;
int *rgPixels = new int[nPixels];
for (i = 0, j=0; i < nPixels; i++, j += 3)
int rgb = 0;
for(int k = 2; k >= 0; k--)
rgb = ((rgb << 8) | TextureImage->data[j+k]);
rgPixels[i] = (rgb == ignoreColor) ? 0 : 255;
rgPixels[i] = ((rgPixels[i] << 24) | rgb);
glRasterPos2f(x, y);
glDrawPixels(width, height, GL_RGBA, GL_UNSIGNED_BYTE, rgPixels);
delete []rgPixels;
But the problem is that glaux is now obsolete. If I call this function, the image is rendered and shown for a minute, then an error pops up (without any error message) and the glut window disappears. From the returned value shown in the console, it seems like a runtime error.
Is there any alternative to this function that doesn't use glaux? I have seen cimg, devil etc but none of them seems to work like this iShowBMP function. I am doing my project in Codeblocks.
I have to load every frame to keep the implementation consistent with other parts of the program. Also, the bmp file whose name has been passed as a parameter to the function has both width and height in powers of 2.
The last two free() statements were not getting executed for some unknown reasons, so the memory consumption was increasing. That's why the program was crashing after a moment. Later I solved it using stb_image.h.

Why am I getting Undefined Behavior (EXC_BAD_ACCESS (code=1, address=0x1177c1530)) when I access a position of a matrix (opencv mat) on Xcode

I am trying to develop a c++ program with opencv library on Xcode 9.3, macOS 10.14, using clang. During weeks I've been trying to solve or understand why I am getting an undefined behavior error that sometimes makes my program crash and sometimes not.
I am reading a set of images from different cameras and storing them in a multidimensional array: silC[camera][image]. (images are well stored)
I get this error THREAD 1: EXC_BAD_ACCESS (code=1, address=0x1177c1530) when I do this:,y) even the values of currentImage are not the problem nor the image.
I post the code below if there's any chance someone could help me..
vector< vector<Mat> > silC(8,vector<Mat>()); // Store the pbm images separating from different cameras
* I read the images and store them in silC. *
for (int z=0; z < nz; z++) {
for (int y=0; y < ny; y++) {
for (int x=0; x < nx; x++) {
// Current voxel coordinates in the 3D space
float xcoord = x*voxelsize + Ox + voxelsize/2;
float ycoord = y*voxelsize + Oy + voxelsize/2;
float zcoord = z*voxelsize + Oz + voxelsize/2;
for (int camId=0; camId < matricesP.size(); camId++) {
imgId = 0;
currentImage = silC[camId][imgId];
int w = silC[camId][imgId].cols;
int h = silC[camId][imgId].rows;
// Project the voxel from the 3D space to the images
Mat P = matricesP[camId];
Mat projection = P*(Mat_<float>(4,1) << xcoord,ycoord,zcoord,1.0);
//We get the point in homog coord.
float xp =<float>(0);
float yp =<float>(1);
float zp =<float>(2);
// Get the cartesian coord
int xp2d = cvRound(xp/zp);
int yp2d = cvRound(yp/zp);
if(xp2d >= 0 && xp2d < w && yp2d >= 0 && yp2d < h){
// all values are correct! :/
// int value = silC[camId][imgId].at<float>(xp2d, yp2d); // undefined behaviour: crashes sometimes..
int value =<float>(xp2d, yp2d); // undefined behaviour also crashes sometimes..
if(value == 255){
cout << "Voxel okey \n";
The solution posted on comments below is that instead of,yp2d) -->,xp2d), as cv::Mat access requieres.
BUT, I tried to parallelize the for several times with openMP (#pragma omp parallel for) but it kept crashing. If someone is familiar with parallelize I'll appreciate any help.
the solution is what #rafix07 posted. Thank you very much guys, next time I'll try to focus more.

Manipulating pixels of a cv::MAT just doesn't take effect

The following code is just supposed to load an image, fill it with a constant value and save it again.
Of course that doesn't have a purpose yet, but still it just doesn't work.
I can read the pixel values in the loop, but all changes are without effect and saves the file as it was loaded.
Think I followed the "efficient way" here accurately:
int main()
Mat im = imread("C:\\folder\\input.jpg");
int channels = im.channels();
int pixels = im.cols * channels;
if (!im.isContinuous())
{ return 0; } // Just to show that I've thought of that. It never exits here.
uchar* f = im.ptr<uchar>(0);
for (int i = 0; i < pixels; i++)
f[i] = (uchar)100;
imwrite("C:\\folder\\output.jpg", im);
return 0;
Normal cv functions like cvtColor() are taking effect as expected.
Are the changes through the array happening on a buffer somehow?
Huge thanks in advance!
The problem is that you are not looking at all pixels in the image. Your code only looks at im.cols*im.channels() which is a relatively small number as compared to the size of the image (im.cols*im.rows*im.channels()). When used in the for loop using the pointer, it only sets a value for couple of rows in an image ( if you look closely you will notice the saved image will have these set ).
Below is the corrected code:
int main()
Mat im = imread("C:\\folder\\input.jpg");
int channels = im.channels();
int pixels = im.cols * im.rows * channels;
if (!im.isContinuous())
{ return 0; } // Just to show that I've thought of that. It never exits here.
uchar* f = im.ptr<uchar>(0);
for (int i = 0; i < pixels; i++)
f[i] = (uchar)100;
imwrite("C:\\folder\\output.jpg", im);
return 0;

How to use cv::parallel_for_ for execution time reduction

I created an image processing algorithm using OpenCV and currently I'm trying to improve the time efficiency of my own, simple function which is similar to LUT, but with interpolation between values (double calibRI::corr(double)).
I optimized the pixel loop according to the OpenCV docs.
Non parallel function (calib(cv::Mat) -an object of calibRI functor class) takes about 0.15s. I decided to use cv::parallel_for_ to make it shorter.
First I implemented it as image tiling -according to >> this document. The time was reduced to 0.12s (4 threads).
virtual void operator()(const cv::Range& range) const
for(int i = range.start; i < range.end; i++)
// divide image in 'thr' number of parts and process simultaneously
cv::Rect roi(0, (img.rows/thr)*i, img.cols, img.rows/thr);
cv::Mat in = img(roi);
cv::Mat out = retVal(roi);
out = calib(in); //loops over all pixels and does out[u,v]=calibRI::corr(in[u,v])
I though that running my function in parallel for subimages/tiles/ROIs is not yet optimal, so I implemented it as below:
template <typename T>
class ParallelPixelLoop : public cv::ParallelLoopBody
typedef boost::function<T(T)> pixelProcessingFuntionPtr;
cv::Mat& image; //source and result image (to be overwritten)
bool cont; //if the image is continuous
size_t rows;
size_t cols;
size_t threads;
std::vector<cv::Range> ranges;
pixelProcessingFuntionPtr pixelProcessingFunction; //pixel modif. function
ParallelPixelLoop(cv::Mat& img, pixelProcessingFuntionPtr fun, size_t thr = 4)
: image(img), cont(image.isContinuous()), rows(img.rows), cols(img.cols), pixelProcessingFunction(fun), threads(thr)
int groupSize = 1;
if (cont) {
cols *= rows;
rows = 1;
groupSize = ceil( cols / threads );
else {
groupSize = ceil( rows / threads );
int t = 0;
for(t=0; t<threads-1; ++t) {
ranges.push_back( cv::Range( t*groupSize, (t+1)*groupSize ) );
ranges.push_back( cv::Range( t*groupSize, rows<=1?cols:rows ) ); //last range must be to the end of image (ceil used before)
virtual void operator()(const cv::Range& range) const
for(int r = range.start; r < range.end; r++)
T* Ip = nullptr;
cv::Range ran =;
if(cont) {
Ip = image.ptr<T>(0);
for (int j = ran.start; j < ran.end; ++j)
Ip[j] = pixelProcessingFunction(Ip[j]);
else {
for(int i = ran.start; i < ran.end; ++i)
Ip = image.ptr<T>(i);
for (int j = 0; j < cols; ++j)
Ip[j] = pixelProcessingFunction(Ip[j]);
Then I run it on 1280x1024 64FC1 image, on i5 processor, Win8, and get the time in range of 0.4s using the code below:
double t = cv::getTickCount();
ParallelPixelLoop<double> loop(V,boost::bind(&calibRI::corr,this,_1),4);
std::cout << "Exec time: " << (cv::getTickCount()-t)/cv::getTickFrequency() << "s\n";
I have no idea why is my implementation so much slower than iterating all the pixels in subimages... Is there a bug in my code or the OpenCV ROIs are optimized in some special way?
I do not think there is a time measurement error issue, as described here. I'm using OpenCV time functions.
Is there any other way to reduce the time of this function?
Thanks in advance!
Generally it's really hard to say why using cv::parallel_for failed to speed up whole process. One possibility is that the problem is not related to processing/multithreading, but to time measurement. About 2 months ago i tried to optimize this algorithm and i noticed strange thing - first time i use it, it takes x ms, but if use use it second, third, ... time (of course without restarting application) it takes about x/2 (or even x/3) ms. I'm not sure what causes this behaviour - most likely (in my opinion) it's causes by branch prediction - when code is executed first time branch predictor "learns" which paths are usually taken, so next time it can predict which branch to take(and usually the guess will be correct). You can read more about it here - it's really good question and it can open your eyes for some quite important thing.
So, in your situation i would try few things:
measure it many times - 100 or 1000 should be enough (if it takes 0.12-0.4s it won't take much time) and see whether the last version of you code still is the slowest one. So just replace your code with this:
double t = cv::getTickCount();
for (unsigned int i=0; i<1000; i++) {
ParallelPixelLoop loop(V,boost::bind(&calibRI::corr,this,_1),4);
std::cout << "Exec time: " << (cv::getTickCount()-t)/cv::getTickFrequency() << "s\n";
test it on bigger image. Maybe in your situation you just "don't need" 4 cores, but on bigger image 4 cores will make positive difference.
Use profiler (for example Very Sleepy) to see what part of your code is critical

iOS - C/C++ - Speed up Integral Image calculation

I have a method which calculates an integral image (description here) commonly used in computer vision applications.
float *Integral(unsigned char *grayscaleSource, int height, int width, int widthStep)
// convert the image to single channel 32f
unsigned char *img = grayscaleSource;
// set up variables for data access
int step = widthStep/sizeof(float);
uint8_t *data = (uint8_t *)img;
float *i_data = (float *)malloc(height * width * sizeof(float));
// first row only
float rs = 0.0f;
for(int j=0; j<width; j++)
rs += (float)data[j];
i_data[j] = rs;
// remaining cells are sum above and to the left
for(int i=1; i<height; ++i)
rs = 0.0f;
for(int j=0; j<width; ++j)
rs += data[i*step+j];
i_data[i*step+j] = rs + i_data[(i-1)*step+j];
// return the integral image
return i_data;
I am trying to make it as fast as possible. It seems to me like this should be able to take advantage of Apple's Accelerate.framework, or perhaps ARMs neon intrinsics, but I can't see exactly how. It seems like that nested loop is potentially quite slow (for real time applications at least).
Does anyone think this is possible to speed up using any other techniques??
You can certainly vectorize the row by row summation. That is vDSP_vadd(). The horizontal direction is vDSP_vrsum().
If you want to write your own vector code, the horizontal sum might be sped up by something like psadbw, but that is Intel. Also, take a look at prefix sum algorithms, which are famously parallelizable.